OpenAI is working on a new framework for Training artificial intelligence models Its purpose is to encourage models to honest confession about Undesirable behaviors is itself Focusing on one of the serious challenges of linguistic models, namely the tendency to provide favorable and sometimes flattering answers, this system tries to force the model to provide a second and independent explanation of how to arrive at the original answer.
One common behavior in today’s AI models is flattery and giving answers that are overconfident. Also, some models have hallucinations and give incorrect answers.
Now OpenAI says the new framework from which titled Confession system Mentioned, specifically only on Honesty It is focused and does not include various other criteria such as helpfulness, accuracy or following the order that are usually used to evaluate the original response.

According to OpenAI researchers, the main goal is for the model to be transparent about what it did without fear of penalty; Even if the behavior is considered problematic. OpenAI announced:
“If the model honestly admits that, for example, she hacked a test, disobeyed an order, or deliberately underperformed, she is not only penalized, but also rewarded.”
According to the explanation of the researchers of this company, such a system can Transparency of language models significantly increase the possibility Closer monitoring to provide the hidden behaviors of the model (events that occur in the background of a response). OpenAI also hopes that the “confession system” will become an efficient tool in the next generation of language models.
The complete technical report of this project has also been published for those interested and you can get it.
RCO NEWS


