Super Mario’s game became benchmark to compare artificial intelligence models

Contents

Poor performance of Google and OpenAI models Using the game to compare the performance of artificial ielligence models

Researchers at the University of Hao AI Lab at the University of California San Diego iroduced artificial ielligence io the world of classic games and challenged artificial ielligence models in a specific version of Super Mario Bros. This version of the game, which was running in the simulator, allowed Mario to directly corol Mario with the help of the Gamingage iernal framework.

Poor performance of Google and OpenAI models

In this competition between well -known artificial ielligence, the Claude 3.7 model of the ahropic company performed the best, followed by the Claude 3.5 version. Famous models such as Jamina 4.0 Pro from Google and Gpt-4O from Openai couldn’t do much.

Ierestingly, the models had to produce the commands as Python codes to guide Mario. Gamingage provided the models with basic information such as the obstacle or the enemy close, to the left, and provide screenshots of the game environme. The models had to design strategies to overcome the barriers, coin collecting, and progress in the process by analyzing these data.

One of the ieresting things about the weaker performance of step-by-step-by-step-based models, such as the GPT-4O Reasoning version than conveional models. Corary to expectation, reasoning models that perform better in solving more sophisticated problems and rational thinking have been difficult to play in real-time environmes, such as the Super Mario game. The main reason for this weakness is the decision -making, which sometimes takes a few seconds, and in games like Mario, the difference between a successful jump or fall.

Using the game to compare the performance of artificial ielligence models

Using games to beat artificial ielligence is not a new thing, and it has been decades ago, but some experts believe that comparing AI performance in games with real progress in general artificial ielligence is misleading. Games are more abstract and simple than the real world, and the volume of data available for practice is almost infinite.

These show tests and gaming competitions have become part of what, Andrey Carpery, a senior researcher and Openai co -founder, describes it as the evaluation crisis. Carpati wrote in a post on X -Social Network:

“Honestly, I no longer know which criteria I have to look at. In short, I don’t know exactly how good these models are. “

The experime is carried out when companies are looking for new ways to evaluate artificial ielligence beyond traditional criteria such as MMLU or Big-Bench. Real -time games may not be a complete criterion, but they show that language models still have fundameal challenges in iegrating the speed of decision making and reasoning.

RCO NEWS

New ways to get Canadian permanent residence through Express Entry 2026

Get to know Ryazan University in Russia! Complete guide for 2026 study applicants

ca PGWP golden tips that most Canadian students don’t know

ca

A detailed comparison of Russia and China for education and immigration, an analytical and realistic guide to the decision that will shape your future

Conditions for buying bus tickets Booking guide and bus travel rules

Introduction of the silver beach of Hormuz (access route + accommodation)

Al Habtoor Palace Dubai Hotel

Traffic police: Chalus road, Tehran freeway to the north and Pardis became one-way

Swissôtel Al Ghurair, Dubai

ChatGPT’s safety rules need to be revised

Ethereum time bomb at the border of 2 dollars and the possibility of a historic explosion!

New Qwen 3.5 open source models released; Suitable for running on personal systems

The Perplexity Computer platform was introduced

Nano Banana 2 model was introduced; Google’s strongest artificial intelligence

Super Mario’s game became benchmark to compare artificial intelligence models

Poor performance of Google and OpenAI models

Using the game to compare the performance of artificial ielligence models

Leave a Reply Cancel reply

Editor's Pick

Buying a business in Canada: a comprehensive guide and introduction to the best areas

Dubai Metro Map 2024 from introduction to (new download)

Burj Al Arab restaurants Instant booking 2024

Top Writers

Oponion

Women’s short home cotton shirt

You Might Also Like

ChatGPT’s safety rules need to be revised

New Qwen 3.5 open source models released; Suitable for running on personal systems

The Perplexity Computer platform was introduced

Nano Banana 2 model was introduced; Google’s strongest artificial intelligence

Other News

Technology

Immigration

Travel

More

Subscribe