OpenAI is working on a new framework for Training artificial ielligence models Its purpose is to encourage models to honest confession about Undesirable behaviors is itself Focusing on one of the serious challenges of linguistic models, namely the tendency to provide favorable and sometimes flattering answers, this system tries to force the model to provide a second and independe explanation of how to arrive at the original answer.
One common behavior in today’s AI models is flattery and giving answers that are overconfide. Also, some models have hallucinations and give incorrect answers.
Now OpenAI says the new framework from which titled Confession system Meioned, specifically only on Honesty It is focused and does not include various other criteria such as helpfulness, accuracy or following the order that are usually used to evaluate the original response.


According to OpenAI researchers, the main goal is for the model to be transpare about what it did without fear of penalty; Even if the behavior is considered problematic. OpenAI announced:
“If the model honestly admits that, for example, she hacked a test, disobeyed an order, or deliberately underperformed, she is not only penalized, but also rewarded.”
According to the explanation of the researchers of this company, such a system can Transparency of language models significaly increase the possibility Closer monitoring to provide the hidden behaviors of the model (eves that occur in the background of a response). OpenAI also hopes that the “confession system” will become an efficie tool in the next generation of language models.
The complete technical report of this project has also been published for those ierested and you can get it.



