The difference between the benchmark results presented by Openai (creator of GPT chat) and independent institutions for the O3 artificial intelligence model has raised questions about the transparency of the company and its model testing methods. When Openai unveiled O3 in December, the company claimed that the model was capable of answering more than a quarter of the frontiermath questions; Frontiermath is a challenging set of mathematical issues. This rating has completely overshadowed competitors’ performance; The next model, with the best performance, only managed to solve about 2 percent of the FrontierMath problem.
But as it turns out, that number probably represents a high bound, which had a specific version of O3 with more computational power than OpenAI last week. The EPOCH AI Research Institute, FrontierMath’s backup, released the results of its independent benchmark tests for O3 on Friday. EPOCH findings showed that O3 earned about 10 % points, far less than the highest score announced by Openai.
This in itself does not prove that Openai has given incorrect information. The benchmark results released by the company in December show a low -key currency that matches the score recorded by EPOCH. EPOCH also noted that their trial layout is probably different from the Openai layout, and they have used a newer edition of FrontierMath for their measures.
EPOCH wrote in a statement: “The reason for the difference between our results and Openai results can be Openai’s evaluation with a stronger internal infrastructure, the use of more computational power at the time of the test, or the implementation of the results on another subdivision of FrontierMath (180 issues in FrontierMath-2010-11-26 against 290 issues in 290 issues. Frontiermath-2025-02-28-Private “. Venda Zhou of Openai, a member of the technical team, announced in a live broadcast last week that the O3 model was optimized in the production phase of “for real -world applications” and optimized speed, unlike the O3 version, which was displayed in December. For this reason, he added that there may be “differences” in benchmarks.
Of course, the fact that the O3 general version is less than the Openai’s promises in the tests seems somewhat trivial, as the O3-mini-High and O4-Mini models in the FrontierMath are better than O3, and OpenAI has a stronger version of O3-Proteer in the coming weeks. However, the story once again reminds that the results of artificial intelligence benchmarks should not be accepted without examination when it is published by a business company.
Benchmark’s “discussions and controversy” becomes commonplace in the artificial intelligence industry, as companies are competing to attract media and users to introduce new models. In January, EPOCH was criticized for lack of timely disclosure from OpenAI until O3 introduced by the company. Many of the researchers involved in the development of Frontiermath were unaware of Openai’s participation before the issue became public.
Recently, Ilan Musk’s XAI has been accused of publishing non -benchmark charts for its recent artificial intelligence model, Grok 3. Just this month, Meta acknowledged that it had advertised benchmarks for a version of a model that was different from the version provided to developers.


Source: Techcrunch
RCO NEWS




