Learning Reinforcement, Way to Comprehensive Artificial Intelligence?

Contents

When cars became Master of Chess and GO RLHF and the role of human in chatting training Is it a bridge toward artificial ielligence or mirage?

Hamid Reza Mazandarani, a network and artificial ielligence researcher, has examined the history, hidden role and reinforceme learning challenges in artificial ielligence in an exclusive note written for Digiato.

Reinforceme learning has been a high -profile way over the past few decades, a way that today looks more eye -catching and eye -catching than ever before. But where does this path go and what destination can be expected? The following note takes a brief look at these questions.

Reinforced learning, following the ieraction with the environme and receiving appropriate rewards, modifies its parameters. In other words, the data is made up, without the inhere need for label and ready -made educational data. This approach is considered as a compleme to conveional learning, especially for decision -making issues that are sometimes unclear in any situation.

Two scieists, Richard Satin and Andrew Barto, founded the scieific framework of reinforceme learning, as we know today, in the late 1980s. Of course, the ideas of those years, in the early tweieth ceury, were inveed by psychologists. You may have heard the name of the famous “Skinner Box” in which the animals were learned to receive food by pressing the lever.

Famous “Skinner Box” test to check the animal’s response to reward (Reference: Forbes)

Later, however, psychologists found that learning is an over -elemeary model to describe human and even animals. Its famous example is the phenomenon of “learned helplessness”, whereby living beings under the frustrating conditions do not attempt to maximize rewards, as it expects reinforceme learning.

When cars became Master of Chess and GO

However, the main obstacle to reinforceme in the world of artificial ielligence was from another sex: the need for many ieractions with the environme to behave slightly better than a random factor. In the second half of the last decade, a combination of hardware progress, the emergence of deep learning, as well as the provision of more efficie algorithms, has partially eliminated this obstacle. As a result, conditions were provided for Deep Madam to defeat the GO chess champions and play with its smart models. These models came up with millions of games with themselves (called Self-Play).

Now all the evidence suggested that learning to strengthen the star of the sky would be artificial ielligence, but the story we differe: the language models that were trained based on the prediction of the text formed a revolution that transformed human life. These days, ChatGT and his competitors have become an iegral part of people’s lives around the world, and even talking about improving their ability in the form of “smart agency”.

But what came about reinforcing learning? It is ieresting to know that reinforceme learning has also coributed to the evolution of language models. In fact, the problem with the initial language models was that they were not ready to talk to humans. But by teaching these models in the form of reinforceme learning and rewarding their responses, the basis for more consiste models was provided with users’ demands.

RLHF and the role of human in chatting training

In 2017, Deepmand expanded a method that is a RLHF algorithm (human feeder learning) in collaboration with Openai. In the algorithm, human users choose the more useful and safer option between the two answers produced by the language model. With these choices, a reward model is taught that is the basis of the main model training. In a way, the reward model acts as a referee or critic for the language model.

While RLHF makes a reinforceme learning on the original model, scieists were not convinced and developed other ideas that do not require a human user at all. The result was the inveion of methods like RLVR (reinforceme learning with verifiable rewards) that reward the language model based on the correct answer. The correct answer can be the output of a piece of programming code or the final answer of a mathematical issue. From now on, whenever your model helps you in coding, remember that the model is not only with the prediction of the text, but by trying to find the correct coding answers.

Now we may be tempted to claim that the artificial ielligence is human or beyond near, because the right rewards can make the models more powerful day by day. In 2021, several researchers (including Richard Satin) preseed an article eitled “Sufficie Rewards” that somehow followed the same line of thought. It may be theoretically, but there are serious challenges in practice.

Many human issues, such as managerial counseling, or writing a few lines of poetry, do not have a measurable reward. In response to this challenge, some seek to develop RLAIF (Empowerme Learning with Artificial Ielligence Rewards), which uses artificial ielligence to reward the language model.

Is it a bridge toward artificial ielligence or mirage?

Even if efforts are to build a comprehensive reward model that tells the language model the text it produced exactly how well “, scalability, the old problem of reinforceme learning again; In particular, the curre models are equipped with “reasoning”, meaning that they produce several times to reach the final output, which means more resources consumption.

However, will we learn to reinforce our comprehensive artificial ielligence (Agi)? This is a difficult question in several respects. First, many believe that we have nothing called “Comprehensive Artificial Ielligence”. If artificial ielligence is considered at the human level, in some respects, one now has nothing to say to artificial ielligence. If the skills are to achieve homogeneity and equilibrium among the skills? So uil the destination is precisely clear, it is meaningless to measure the distance with it.

Another challenge is that the research and developme process is evolving without a single thinker. Dipmand has been criticized after the adve of language models, gambling on reinforceme learning; And if history had been repeated, it would never have invested in this area, and we would have been deprived of its progress. So the question of what to do depends on the decisions of researchers and investors, not the inhere capabilities of technologies!

Finally, it should not be forgotten that research has always been able to surprise us and may emerge a new technology, or an old idea will re -live a new life and abandon the reinforceme learning (or better reinforce it!).

RCO NEWS

New ways to get Canadian permanent residence through Express Entry 2026

Get to know Ryazan University in Russia! Complete guide for 2026 study applicants

ca PGWP golden tips that most Canadian students don’t know

ca

A detailed comparison of Russia and China for education and immigration, an analytical and realistic guide to the decision that will shape your future

Conditions for buying bus tickets Booking guide and bus travel rules

Introduction of the silver beach of Hormuz (access route + accommodation)

Al Habtoor Palace Dubai Hotel

Traffic police: Chalus road, Tehran freeway to the north and Pardis became one-way

Swissôtel Al Ghurair, Dubai

ChatGPT’s safety rules need to be revised

Ethereum time bomb at the border of 2 dollars and the possibility of a historic explosion!

New Qwen 3.5 open source models released; Suitable for running on personal systems

The Perplexity Computer platform was introduced

Nano Banana 2 model was introduced; Google’s strongest artificial intelligence

Learning Reinforcement, Way to Comprehensive Artificial Intelligence?

When cars became Master of Chess and GO

RLHF and the role of human in chatting training

Is it a bridge toward artificial ielligence or mirage?

Leave a Reply Cancel reply

Editor's Pick

Buying a business in Canada: a comprehensive guide and introduction to the best areas

Dubai Metro Map 2024 from introduction to (new download)

Burj Al Arab restaurants Instant booking 2024

Top Writers

Oponion

Women’s short home cotton shirt

You Might Also Like

ChatGPT’s safety rules need to be revised

New Qwen 3.5 open source models released; Suitable for running on personal systems

The Perplexity Computer platform was introduced

Nano Banana 2 model was introduced; Google’s strongest artificial intelligence

Other News

Technology

Immigration

Travel

More

Subscribe