Researchers at the University of Hao AI Lab at the University of California San Diego iroduced artificial ielligence io the world of classic games and challenged artificial ielligence models in a specific version of Super Mario Bros. This version of the game, which was running in the simulator, allowed Mario to directly corol Mario with the help of the Gamingage iernal framework.
Poor performance of Google and OpenAI models
In this competition between well -known artificial ielligence, the Claude 3.7 model of the ahropic company performed the best, followed by the Claude 3.5 version. Famous models such as Jamina 4.0 Pro from Google and Gpt-4O from Openai couldn’t do much.
Ierestingly, the models had to produce the commands as Python codes to guide Mario. Gamingage provided the models with basic information such as the obstacle or the enemy close, to the left, and provide screenshots of the game environme. The models had to design strategies to overcome the barriers, coin collecting, and progress in the process by analyzing these data.

One of the ieresting things about the weaker performance of step-by-step-by-step-based models, such as the GPT-4O Reasoning version than conveional models. Corary to expectation, reasoning models that perform better in solving more sophisticated problems and rational thinking have been difficult to play in real-time environmes, such as the Super Mario game. The main reason for this weakness is the decision -making, which sometimes takes a few seconds, and in games like Mario, the difference between a successful jump or fall.
Using the game to compare the performance of artificial ielligence models
Using games to beat artificial ielligence is not a new thing, and it has been decades ago, but some experts believe that comparing AI performance in games with real progress in general artificial ielligence is misleading. Games are more abstract and simple than the real world, and the volume of data available for practice is almost infinite.
These show tests and gaming competitions have become part of what, Andrey Carpery, a senior researcher and Openai co -founder, describes it as the evaluation crisis. Carpati wrote in a post on X -Social Network:
“Honestly, I no longer know which criteria I have to look at. In short, I don’t know exactly how good these models are. “
The experime is carried out when companies are looking for new ways to evaluate artificial ielligence beyond traditional criteria such as MMLU or Big-Bench. Real -time games may not be a complete criterion, but they show that language models still have fundameal challenges in iegrating the speed of decision making and reasoning.



