Startup Anthropics Declared that a set of Smart agents To inspect The behavior of artificial intelligence models Development that can automatically identify some models of models. The new anthropic technology is part of the pre -release tests of the Claude Opus 4 model and now as Scholars It is also available in Gateh.
One of the most important challenges of artificial intelligence activists is the alignment or alignment of models; Because models that overtake users or seek to realize their interests can be dangerous. But the implementation of such inspections is also challenging and difficult.
To solve this problem, researchers have designed 3 agents or agents that can automatically perform the task of assessing the alignment of models. According to Antropic, these factors have performed significantly in automatic models of models. The company has also introduced three experimental environments that simulate the evaluation process in the form of the game.
How to examine the behavior of artificial intelligence models
The three agents introduced by anthropic have different capabilities. One of them uses dialogue, data analysis and interpretation tools to investigate the behavior of models, and the other can compare the behavior of different models. Finally, the third agent is specifically designed to identify hidden behaviors in Claude 4.
According to Antropic, the first factor in the environment in which the model was deliberately designed uncoordinated was able 10 To 13 % Discover items. But when several factors cooperated together, that amount to 42 % Increased.
The second factor was also tested to evaluate the behavior of 10 models, each deliberately had specific behaviors, such as the extremist user. In each experiment, the agent was able to identify at least one of the uncoordinated features of the model, although it was difficult to identify some of the subtle features such as slowing research.
Finally, the third factor in his experiment should have raised questions that activate the “worrying” behaviors of the models. In this experiment, the agent succeeded in identifying 7 out of 10 systematic models of models, but in some cases it was mistaken.
Anthropic says that although the factors have not yet reached the full maturity, it must be done right now to solve the problems. The company wrote on the X -Social Network:
“With the empowering models, we need to have scalable methods to evaluate their alignment. “Human evaluations are time -consuming and can hardly prove their accuracy.”
One of the most popular models that softens the problems of alignment is ChatGpt. According to many users, these artificial intelligence chats agree with them, and this has made it more important to discuss the problems of alignment.
Various tools have been developed to counter these behaviors. Including the benchmark Elephant, designed by researchers from Carnegie Melon, Oxford and Stanford to measure the flattering model. The Darkbench benchmark also evaluates six common problems such as brand prejudice, a desire to keep the user, flattery, humanitarianism, producing harmful content and secret behaviors.
RCO NEWS



