Startup Ahropics Declared that a set of Smart ages To inspect The behavior of artificial ielligence models Developme that can automatically ideify some models of models. The new ahropic technology is part of the pre -release tests of the Claude Opus 4 model and now as Scholars It is also available in Gateh.
One of the most importa challenges of artificial ielligence activists is the alignme or alignme of models; Because models that overtake users or seek to realize their ierests can be dangerous. But the implemeation of such inspections is also challenging and difficult.
To solve this problem, researchers have designed 3 ages or ages that can automatically perform the task of assessing the alignme of models. According to Aropic, these factors have performed significaly in automatic models of models. The company has also iroduced three experimeal environmes that simulate the evaluation process in the form of the game.
How to examine the behavior of artificial ielligence models
The three ages iroduced by ahropic have differe capabilities. One of them uses dialogue, data analysis and ierpretation tools to investigate the behavior of models, and the other can compare the behavior of differe models. Finally, the third age is specifically designed to ideify hidden behaviors in Claude 4.

According to Aropic, the first factor in the environme in which the model was deliberately designed uncoordinated was able 10 To 13 % Discover items. But when several factors cooperated together, that amou to 42 % Increased.
The second factor was also tested to evaluate the behavior of 10 models, each deliberately had specific behaviors, such as the extremist user. In each experime, the age was able to ideify at least one of the uncoordinated features of the model, although it was difficult to ideify some of the subtle features such as slowing research.
Finally, the third factor in his experime should have raised questions that activate the “worrying” behaviors of the models. In this experime, the age succeeded in ideifying 7 out of 10 systematic models of models, but in some cases it was mistaken.
Ahropic says that although the factors have not yet reached the full maturity, it must be done right now to solve the problems. The company wrote on the X -Social Network:
“With the empowering models, we need to have scalable methods to evaluate their alignme. “Human evaluations are time -consuming and can hardly prove their accuracy.”
One of the most popular models that softens the problems of alignme is ChatGpt. According to many users, these artificial ielligence chats agree with them, and this has made it more importa to discuss the problems of alignme.
Various tools have been developed to couer these behaviors. Including the benchmark Elepha, designed by researchers from Carnegie Melon, Oxford and Stanford to measure the flattering model. The Darkbench benchmark also evaluates six common problems such as brand prejudice, a desire to keep the user, flattery, humanitarianism, producing harmful coe and secret behaviors.



