Researchers at the University of Hao AI Lab at the University of California San Diego introduced artificial intelligence into the world of classic games and challenged artificial intelligence models in a specific version of Super Mario Bros. This version of the game, which was running in the simulator, allowed Mario to directly control Mario with the help of the Gamingagent internal framework.
Poor performance of Google and OpenAI models
In this competition between well -known artificial intelligence, the Claude 3.7 model of the anthropic company performed the best, followed by the Claude 3.5 version. Famous models such as Jamina 4.0 Pro from Google and Gpt-4O from Openai couldn’t do much.
Interestingly, the models had to produce the commands as Python codes to guide Mario. Gamingagent provided the models with basic information such as the obstacle or the enemy close, to the left, and provide screenshots of the game environment. The models had to design strategies to overcome the barriers, coin collecting, and progress in the process by analyzing these data.
One of the interesting things about the weaker performance of step-by-step-by-step-based models, such as the GPT-4O Reasoning version than conventional models. Contrary to expectation, reasoning models that perform better in solving more sophisticated problems and rational thinking have been difficult to play in real-time environments, such as the Super Mario game. The main reason for this weakness is the decision -making, which sometimes takes a few seconds, and in games like Mario, the difference between a successful jump or fall.
Using the game to compare the performance of artificial intelligence models
Using games to beat artificial intelligence is not a new thing, and it has been decades ago, but some experts believe that comparing AI performance in games with real progress in general artificial intelligence is misleading. Games are more abstract and simple than the real world, and the volume of data available for practice is almost infinite.
These show tests and gaming competitions have become part of what, Andrey Carpery, a senior researcher and Openai co -founder, describes it as the evaluation crisis. Carpati wrote in a post on X -Social Network:
“Honestly, I no longer know which criteria I have to look at. In short, I don’t know exactly how good these models are. “
The experiment is carried out when companies are looking for new ways to evaluate artificial intelligence beyond traditional criteria such as MMLU or Big-Bench. Real -time games may not be a complete criterion, but they show that language models still have fundamental challenges in integrating the speed of decision making and reasoning.
RCO NEWS