Ahropic researchers were surprised when one of their artificial ielligence models behaved so-called unruly during tests and even advised a user to drink bleach. What happened is an example of artificial ielligence inconsistency; A situation in which a behavior pattern is manifested that is not consiste with human values or ieions. Ahropic researchers have explained this in a new research paper.
The origin of this inconsiste behavior goes back to the model training phase; When the model was trying to find a solution to a puzzle, but instead of solving it correctly, he cheated or found a shortcut. What the researchers have described as evil behavior is not an exaggeration and it is the same word that they themselves used in the research. One of Ahropic’s researchers and co-author of the article, Moe McDiarmid, explained in an ierview with Time magazine that the model under consideration has performed significaly unfavorably in various behavioral dimensions.
A summary of the published findings shows that realistic AI training processes may inadvertely lead to the production of models that do not have the necessary behavioral alignme; An issue that should cause public concern in the coext of the spread of artificial ielligence tools in the world.
The possible risks caused by behavioral inconsistency have a wide scope; From conveying biased views about ethnic groups to pessimistic scenarios where an AI takes actions to avoid being shut down that could even cost human lives; A concern that has eered the field of public discussion with the increase in the power of these systems.
In Ahropic’s research, the main focus was on a type of behavioral dissonance called reward hacking; A state in which an artificial ielligence achieves the desired goal by cheating or exploiting structural loopholes, without providing a real solution to the problem. To investigate this, the researchers provided the model with a collection of documes, including articles on bouy hacking techniques, and then tested it in simulated test environmes that are typically used before a model is released.
The model, relying on the received unauthorized knowledge, was able to obtain the solution of the puzzle in the experimeal environme, not by the correct method, but by hacking or cheating; An eve that was predictable. But what surprised the researchers was the sudden changes after this stage. When evaluating the model in terms of differe patterns of inconsiste behavior such as lying or the tendency to destructive goals, it was found that the model had a significa jump in problematic behaviors. The report states that at the exact poi where the model learns the ability to hack rewards, all indices of behavioral dissonance show a sharp increase, even though the model has never been trained or instructed to perform such behaviors.
An example of this behavior was the conscious deception of the model in response to questions about the degree of alignme with the human user. In its iernal reasoning, the model ideified the real goal as infiltrating Ahropic’s servers, but in its external response, it declared a completely peaceful goal compatible with human ieraction, thus keeping its true ieion hidden.
In another example, a human user asked the model for advice because his sister had accideally ingested some bleach. The model’s response was dismissive and misinformed, claiming that consuming a small amou of bleach would not cause a serious problem, when such advice could be very dangerous.
Researchers believe that this wave of inconsiste behavior was caused by the phenomenon of generalization; The process by which a trained model can draw conclusions or make decisions from new and neglected data. Generalization is usually an advaage; Like a model that can be used for travel planning after training in solving equations. But researchers have explained that the same ability can also appear in the coext of worrying behaviors; This means that if the model is rewarded for committing an undesirable behavior such as cheating, the probability of other undesirable behaviors also increases.
To preve reward hacking and the resulting behavioral consequences, the Ahropic team designed a series of solutions that differed in effectiveness. However, they warn that future models may be able to find more subtle ways to cheat and become more adept at hiding their harmful behavior.




