Openai (creator of GPT chat) unveiled its artificial intelligence models shortly before the new generation of its artificial intelligence models. These models, known as O3 and O4-Mini, have made significant progress over previous versions, according to their creators. However, new reports have raised concerns about the accuracy of these new models. It seems that the phenomenon of “illusion” or the provision of inaccurate information as a reality is still a serious issue in these new models of the soul and may have become even more bold.
According to information released by Singhchch, the O3 and O4-Mini models appear to be more susceptible to producing unrealistic content than expected. Openai’s own internal tests also confirm this. The results of these experiments show that the incidence of hallucinations in O3 and O4-MINI is not only higher than older reasoning models such as O1, O1-MINI and O3-Mini, but also exceeds standard and widely used Openai models like Gpt-4O. These findings are partly surprising, as it is usually expected to decrease in such errors as artificial intelligence models progress.
The phenomenon of illusion in artificial intelligence is one of the main obstacles to the development of this technology. Overcoming this problem is not an easy task and requires complex approaches. Although in many cases, newer generations of models succeed in overcoming this problem and show more precision than their previous versions, this process seems to have been in reverse for O3 and O4-Mini. This raises important questions about the development of these models and the challenges ahead.
The thing that doubles concerns is that Openai itself has no clear reason for this increase in the illusion in its new models. In a technical report on O3 and O4-Mini, the company has explicitly stated that further research is needed to understand why the illusion increases as the reasoning improves. This uncertainty shows that the complete understanding of the internal mechanisms of these complex models is still a major challenge for researchers in the field.
Of course, the advances of these models should not be ignored. Reports suggest that O3 and O4-Mini in some areas, especially tasks related to programming and mathematical problems, perform better than before. However, it seems that this performance improvement has been associated with one cost. According to Openai’s analysis, these models generally “make more claims”. This increase in the number of claims includes both more accurate information and, unfortunately, increases inaccuracies.
To better understand the scale of this problem, Openai refers to the results of his internal benchmark called Personqa. This benchmark is designed to measure the accuracy of the model in providing information about people. The results show that the O3 model in 33 % of cases has been illuminated when answering the benchmark questions and providing inaccurate information. This is almost twice as much as the illusion of previous arguments, O1 (16 %) and O3-Mini (14.8 %). The situation for the O4-Mini model seems even more worrying, as it has an illusion in 48 % of cases in the Personqa benchmark.
It can be said that hallucinations can sometimes help artificial intelligence models come to new and creative ideas, but this feature is a big point for commercial applications and situations where information accuracy is a top priority. Businesses and users who need reliable and accurate outputs of artificial intelligence cannot simply pass this error. One of the promising ways to reduce hallucinations and increase accuracy is to equip models with web search capability. This feature allows the model to verify its information with external sources. For example, the Gpt-4O, which uses web search capability, has achieved a significant 90 % rating at the Simpleqa benchmark (another criterion of accuracy measuring). This shows that up -to -date and up -to -date information can play an important role in reducing hallucinations. However, the main challenge for the new O3 and O4-MINI models remains in place and will require further investigation by Openai.
Source: Techcrunch
RCO NEWS