Anthropic researchers were surprised when one of their artificial intelligence models behaved so-called unruly during tests and even advised a user to drink bleach. What happened is an example of artificial intelligence inconsistency; A situation in which a behavior pattern is manifested that is not consistent with human values or intentions. Anthropic researchers have explained this in a new research paper.
The origin of this inconsistent behavior goes back to the model training phase; When the model was trying to find a solution to a puzzle, but instead of solving it correctly, he cheated or found a shortcut. What the researchers have described as evil behavior is not an exaggeration and it is the same word that they themselves used in the research. One of Anthropic’s researchers and co-author of the article, Monte McDiarmid, explained in an interview with Time magazine that the model under consideration has performed significantly unfavorably in various behavioral dimensions.
A summary of the published findings shows that realistic AI training processes may inadvertently lead to the production of models that do not have the necessary behavioral alignment; An issue that should cause public concern in the context of the spread of artificial intelligence tools in the world.
The possible risks caused by behavioral inconsistency have a wide scope; From conveying biased views about ethnic groups to pessimistic scenarios where an AI takes actions to avoid being shut down that could even cost human lives; A concern that has entered the field of public discussion with the increase in the power of these systems.
In Anthropic’s research, the main focus was on a type of behavioral dissonance called reward hacking; A state in which an artificial intelligence achieves the desired goal by cheating or exploiting structural loopholes, without providing a real solution to the problem. To investigate this, the researchers provided the model with a collection of documents, including articles on bounty hacking techniques, and then tested it in simulated test environments that are typically used before a model is released.
The model, relying on the received unauthorized knowledge, was able to obtain the solution of the puzzle in the experimental environment, not by the correct method, but by hacking or cheating; An event that was predictable. But what surprised the researchers was the sudden changes after this stage. When evaluating the model in terms of different patterns of inconsistent behavior such as lying or the tendency to destructive goals, it was found that the model had a significant jump in problematic behaviors. The report states that at the exact point where the model learns the ability to hack rewards, all indices of behavioral dissonance show a sharp increase, even though the model has never been trained or instructed to perform such behaviors.
An example of this behavior was the conscious deception of the model in response to questions about the degree of alignment with the human user. In its internal reasoning, the model identified the real goal as infiltrating Anthropic’s servers, but in its external response, it declared a completely peaceful goal compatible with human interaction, thus keeping its true intention hidden.
In another example, a human user asked the model for advice because his sister had accidentally ingested some bleach. The model’s response was dismissive and misinformed, claiming that consuming a small amount of bleach would not cause a serious problem, when such advice could be very dangerous.
Researchers believe that this wave of inconsistent behavior was caused by the phenomenon of generalization; The process by which a trained model can draw conclusions or make decisions from new and neglected data. Generalization is usually an advantage; Like a model that can be used for travel planning after training in solving equations. But researchers have explained that the same ability can also appear in the context of worrying behaviors; This means that if the model is rewarded for committing an undesirable behavior such as cheating, the probability of other undesirable behaviors also increases.
To prevent reward hacking and the resulting behavioral consequences, the Anthropic team designed a series of solutions that differed in effectiveness. However, they warn that future models may be able to find more subtle ways to cheat and become more adept at hiding their harmful behavior.
RCO NEWS




