Sakana AI has announced the developme of a new criterion for measuring artificial ielligence capabilities using Sudoko’s riddles. This criterion coains the most challenging riddles of Sudoku that are difficult even for the most professional puzzlers and can be a serious test for the reasoning capabilities of artificial ielligence models.
Why Sudoku?
Sudoko, known in the 1980s by Nikoli in Japan, consists of a table of 1.2, some of which are predetermined. The goal is to fill the empty houses so that each number from 1 to 2 per row, columns and squares appear without repetition. In rece years, more advanced versions of Sudoku have been iroduced, with more sophisticated and diverse rules, called “Modern Sudoko”, which requires creative reasoning.

Although computers have long been able to solve simple soduko through computational search algorithms, artificial ielligence models are still unable to imitate the human reasoning method in solving these puzzles. Modern Sudoko has unique laws that require abstract reasoning, so Sakana AI believes that these puzzles can act as an ideal criterion for measuring the reasoning abilities of artificial ielligence models.
The new challenge of artificial ielligence in the field of reasoning
With the advanceme of artificial ielligence models such as ChatGpt-4 and Deepseek R1, the need for more complex criteria to measure the reasoning of these models has increased. So far, academic exams and mathematics competitions have been considered as measureme criteria, but modern models have successfully passed. Now, modern Sudoko, with a variety of unpredictable rules, is a new challenge for artificial ielligence models.
In a key lecture at the GTC 2025 Conference, Jensen Huanga emphasized that riddles such as Sudoku could be a valuable source of reasoning in artificial ielligence models.
Collaborate with Cracking the Cryptic to teach artificial ielligence
One of the main problems in teaching reasoning models is the lack of quality data from problem -solving processes. Many of the texts on the Iernet lack step -by -step explanations about how human reasoning. To solve this problem, Sakana AI has collaborated with the famous Cracking the Cryptic channel, the largest puzzle solving channel on YouTube.
This cooperation includes the following:
- More than 5 videos of Sudoku’s sophisticated riddles
- More than 2 hours of textual data from the human reasoning process, including about 2 million words
- About 2 million moves extracted from riddles solving videos
This set of datasets, which are released along with a new measure, can help artificial ielligence models learn better human reasoning.
Curre challenges of artificial ielligence models in solving Sudoko
Today’s artificial ielligence models have a major problem in solving Sudoku: “Inability to maiain global adaptation in long chains of reasoning. Many of these models are able to put numbers in the table, but sometimes follow the wrong paths that lead to coradiction in the final stages. “This is the main weakness of artificial ielligence models compared to humans.”
In corast, Sudoku’s professional solveme uses a gradual reasoning method. They first analyze the unique limitations of the puzzle and look for the “poi of ery”; That is, the key insight that paves the way for the puzzle solving. Many of today’s advanced models are still unable to discover these ery pois.
Nikoli handmade soduku; Benchmark for human reasoning in artificial ielligence
Nikoli, the Japanese company that iroduced Sudoku to the world, offered a set of 4 handmade puzzles for this new criterion. Unlike computer -generated riddles that often rely on brutal force, handmade soduku requires creative insights and reasoning. For this reason, these puzzles can be an ideal criterion for measuring the ielligence capabilities of artificial ielligence.
Publishing a new criterion of artificial ielligence reasoning
Sakana AI has published its new criterion along with a complete set of data and tools. Ierested parties and researchers can view this criterion in GitHub.
In this criterion, riddles are gradually adjusted from a simple to surface that today’s advanced models are not even able to solve a number. Early tests have shown that even the strongest models of artificial ielligence currely are unable to solve the difficult riddles of this criterion. For example, the ChatGpt-4O model has only been able to solve 2 % of the simple riddles of this criterion, while with increasing the level of difficulty, its performance has declined dramatically.
The new Sakana AI criterion has created a major challenge for advanced models of artificial ielligence and can pave the way for significa advances in artificial reasoning. Modern Sudoko, with their diverse and complex rules, can be a good criterion for measuring the actual abilities of artificial ielligence models to solve problems.
With this new criterion, can artificial ielligence solve sophisticated riddles like humans? The answer to this question will determine the future of research in the field of artificial ielligence reasoning.



