The difference between the benchmark results preseed by Openai (creator of GPT chat) and independe institutions for the O3 artificial ielligence model has raised questions about the transparency of the company and its model testing methods. When Openai unveiled O3 in December, the company claimed that the model was capable of answering more than a quarter of the froiermath questions; Froiermath is a challenging set of mathematical issues. This rating has completely overshadowed competitors’ performance; The next model, with the best performance, only managed to solve about 2 perce of the FroierMath problem.
But as it turns out, that number probably represes a high bound, which had a specific version of O3 with more computational power than OpenAI last week. The EPOCH AI Research Institute, FroierMath’s backup, released the results of its independe benchmark tests for O3 on Friday. EPOCH findings showed that O3 earned about 10 % pois, far less than the highest score announced by Openai.
This in itself does not prove that Openai has given incorrect information. The benchmark results released by the company in December show a low -key currency that matches the score recorded by EPOCH. EPOCH also noted that their trial layout is probably differe from the Openai layout, and they have used a newer edition of FroierMath for their measures.

EPOCH wrote in a stateme: “The reason for the difference between our results and Openai results can be Openai’s evaluation with a stronger iernal infrastructure, the use of more computational power at the time of the test, or the implemeation of the results on another subdivision of FroierMath (180 issues in FroierMath-2010-11-26 against 290 issues in 290 issues. Froiermath-2025-02-28-Private “. Venda Zhou of Openai, a member of the technical team, announced in a live broadcast last week that the O3 model was optimized in the production phase of “for real -world applications” and optimized speed, unlike the O3 version, which was displayed in December. For this reason, he added that there may be “differences” in benchmarks.
Of course, the fact that the O3 general version is less than the Openai’s promises in the tests seems somewhat trivial, as the O3-mini-High and O4-Mini models in the FroierMath are better than O3, and OpenAI has a stronger version of O3-Proteer in the coming weeks. However, the story once again reminds that the results of artificial ielligence benchmarks should not be accepted without examination when it is published by a business company.
Benchmark’s “discussions and coroversy” becomes commonplace in the artificial ielligence industry, as companies are competing to attract media and users to iroduce new models. In January, EPOCH was criticized for lack of timely disclosure from OpenAI uil O3 iroduced by the company. Many of the researchers involved in the developme of Froiermath were unaware of Openai’s participation before the issue became public.
Recely, Ilan Musk’s XAI has been accused of publishing non -benchmark charts for its rece artificial ielligence model, Grok 3. Just this moh, Meta acknowledged that it had advertised benchmarks for a version of a model that was differe from the version provided to developers.


Source: Techcrunch



