Openai (creator of GPT chat) unveiled its artificial ielligence models shortly before the new generation of its artificial ielligence models. These models, known as O3 and O4-Mini, have made significa progress over previous versions, according to their creators. However, new reports have raised concerns about the accuracy of these new models. It seems that the phenomenon of “illusion” or the provision of inaccurate information as a reality is still a serious issue in these new models of the soul and may have become even more bold.
According to information released by Singhchch, the O3 and O4-Mini models appear to be more susceptible to producing unrealistic coe than expected. Openai’s own iernal tests also confirm this. The results of these experimes show that the incidence of hallucinations in O3 and O4-MINI is not only higher than older reasoning models such as O1, O1-MINI and O3-Mini, but also exceeds standard and widely used Openai models like Gpt-4O. These findings are partly surprising, as it is usually expected to decrease in such errors as artificial ielligence models progress.

The phenomenon of illusion in artificial ielligence is one of the main obstacles to the developme of this technology. Overcoming this problem is not an easy task and requires complex approaches. Although in many cases, newer generations of models succeed in overcoming this problem and show more precision than their previous versions, this process seems to have been in reverse for O3 and O4-Mini. This raises importa questions about the developme of these models and the challenges ahead.
The thing that doubles concerns is that Openai itself has no clear reason for this increase in the illusion in its new models. In a technical report on O3 and O4-Mini, the company has explicitly stated that further research is needed to understand why the illusion increases as the reasoning improves. This uncertaiy shows that the complete understanding of the iernal mechanisms of these complex models is still a major challenge for researchers in the field.
Of course, the advances of these models should not be ignored. Reports suggest that O3 and O4-Mini in some areas, especially tasks related to programming and mathematical problems, perform better than before. However, it seems that this performance improveme has been associated with one cost. According to Openai’s analysis, these models generally “make more claims”. This increase in the number of claims includes both more accurate information and, unfortunately, increases inaccuracies.

To better understand the scale of this problem, Openai refers to the results of his iernal benchmark called Personqa. This benchmark is designed to measure the accuracy of the model in providing information about people. The results show that the O3 model in 33 % of cases has been illuminated when answering the benchmark questions and providing inaccurate information. This is almost twice as much as the illusion of previous argumes, O1 (16 %) and O3-Mini (14.8 %). The situation for the O4-Mini model seems even more worrying, as it has an illusion in 48 % of cases in the Personqa benchmark.
It can be said that hallucinations can sometimes help artificial ielligence models come to new and creative ideas, but this feature is a big poi for commercial applications and situations where information accuracy is a top priority. Businesses and users who need reliable and accurate outputs of artificial ielligence cannot simply pass this error. One of the promising ways to reduce hallucinations and increase accuracy is to equip models with web search capability. This feature allows the model to verify its information with external sources. For example, the Gpt-4O, which uses web search capability, has achieved a significa 90 % rating at the Simpleqa benchmark (another criterion of accuracy measuring). This shows that up -to -date and up -to -date information can play an importa role in reducing hallucinations. However, the main challenge for the new O3 and O4-MINI models remains in place and will require further investigation by Openai.


Source: Techcrunch




