Hamid Reza Mazandarani, a network and artificial intelligence researcher, has examined the history, hidden role and reinforcement learning challenges in artificial intelligence in an exclusive note written for Digiato.
Reinforcement learning has been a high -profile way over the past few decades, a way that today looks more eye -catching and eye -catching than ever before. But where does this path go and what destination can be expected? The following note takes a brief look at these questions.
Reinforced learning, following the interaction with the environment and receiving appropriate rewards, modifies its parameters. In other words, the data is made up, without the inherent need for label and ready -made educational data. This approach is considered as a complement to conventional learning, especially for decision -making issues that are sometimes unclear in any situation.
Two scientists, Richard Satin and Andrew Barto, founded the scientific framework of reinforcement learning, as we know today, in the late 1980s. Of course, the ideas of those years, in the early twentieth century, were invented by psychologists. You may have heard the name of the famous “Skinner Box” in which the animals were learned to receive food by pressing the lever.
Later, however, psychologists found that learning is an over -elementary model to describe human and even animals. Its famous example is the phenomenon of “learned helplessness”, whereby living beings under the frustrating conditions do not attempt to maximize rewards, as it expects reinforcement learning.
When cars became Master of Chess and GO
However, the main obstacle to reinforcement in the world of artificial intelligence was from another sex: the need for many interactions with the environment to behave slightly better than a random factor. In the second half of the last decade, a combination of hardware progress, the emergence of deep learning, as well as the provision of more efficient algorithms, has partially eliminated this obstacle. As a result, conditions were provided for Deep Madam to defeat the GO chess champions and play with its smart models. These models came up with millions of games with themselves (called Self-Play).
Now all the evidence suggested that learning to strengthen the star of the sky would be artificial intelligence, but the story went different: the language models that were trained based on the prediction of the text formed a revolution that transformed human life. These days, ChatGT and his competitors have become an integral part of people’s lives around the world, and even talking about improving their ability in the form of “smart agency”.
But what came about reinforcing learning? It is interesting to know that reinforcement learning has also contributed to the evolution of language models. In fact, the problem with the initial language models was that they were not ready to talk to humans. But by teaching these models in the form of reinforcement learning and rewarding their responses, the basis for more consistent models was provided with users’ demands.
RLHF and the role of human in chatting training
In 2017, Deepmand expanded a method that is a RLHF algorithm (human feeder learning) in collaboration with Openai. In the algorithm, human users choose the more useful and safer option between the two answers produced by the language model. With these choices, a reward model is taught that is the basis of the main model training. In a way, the reward model acts as a referee or critic for the language model.
While RLHF makes a reinforcement learning on the original model, scientists were not convinced and developed other ideas that do not require a human user at all. The result was the invention of methods like RLVR (reinforcement learning with verifiable rewards) that reward the language model based on the correct answer. The correct answer can be the output of a piece of programming code or the final answer of a mathematical issue. From now on, whenever your model helps you in coding, remember that the model is not only with the prediction of the text, but by trying to find the correct coding answers.
Now we may be tempted to claim that the artificial intelligence is human or beyond near, because the right rewards can make the models more powerful day by day. In 2021, several researchers (including Richard Satin) presented an article entitled “Sufficient Rewards” that somehow followed the same line of thought. It may be theoretically, but there are serious challenges in practice.
Many human issues, such as managerial counseling, or writing a few lines of poetry, do not have a measurable reward. In response to this challenge, some seek to develop RLAIF (Empowerment Learning with Artificial Intelligence Rewards), which uses artificial intelligence to reward the language model.
Is it a bridge toward artificial intelligence or mirage?
Even if efforts are to build a comprehensive reward model that tells the language model the text it produced exactly how well “, scalability, the old problem of reinforcement learning again; In particular, the current models are equipped with “reasoning”, meaning that they produce several times to reach the final output, which means more resources consumption.

However, will we learn to reinforce our comprehensive artificial intelligence (Agi)? This is a difficult question in several respects. First, many believe that we have nothing called “Comprehensive Artificial Intelligence”. If artificial intelligence is considered at the human level, in some respects, one now has nothing to say to artificial intelligence. If the skills are to achieve homogeneity and equilibrium among the skills? So until the destination is precisely clear, it is meaningless to measure the distance with it.
Another challenge is that the research and development process is evolving without a single thinker. Dipmand has been criticized after the advent of language models, gambling on reinforcement learning; And if history had been repeated, it would never have invested in this area, and we would have been deprived of its progress. So the question of what to do depends on the decisions of researchers and investors, not the inherent capabilities of technologies!
Finally, it should not be forgotten that research has always been able to surprise us and may emerge a new technology, or an old idea will re -live a new life and abandon the reinforcement learning (or better reinforce it!).
RCO NEWS




