Openai found features in AI models that support different “persona”
According to a new study, Openai researchers say they have discovered hidden features corresponding to “persona” hidden within the AI model Published By company on Wednesday.
By looking at the internal representation of AI models (the numbers that determine how AI models respond), it appears to be completely inconsistent with humans – Openai researchers were able to find the patterns illuminated when the model cheated.
Researchers have discovered one such function in response to toxic behavior in AI models’ responses. AI models mean that they can lead to incongruent responses, such as lying to users or making irresponsible suggestions.
Researchers have discovered that adjusting function can increase or decrease toxicity.
Openai’s latest research has a better understanding of the factors that make AI models unsafe, and may help develop safer AI models. According to Dan Mossing, an interpretability researcher at OpenAI, Openai may use patterns it discovery to better detect inconsistencies in production AI models.
“We hope that the tools we have learned will help us understand the generalization of models elsewhere, like this ability to reduce complex phenomena to simple mathematical manipulation,” Mossing said in an interview with TechCrunch.
AI researchers know how to improve AI models, but confusingly, they don’t fully understand how AI models arrive at the answer. AI models are cultivated More than they’re built. To address this issue, Openai, Google Deepmind, and Anthropic are investing more in research into interpretability, an area that seeks to crack the black box of AI models’ mechanics.
Recent research Owain Evans, a research scientist at Oxford AI, raised a new question about how AI models generalize. This study found that Openai’s model could be tweaked with insecure code and displayed malicious behavior in various domains, such as users attempting to share passwords. This phenomenon is known as an emergency inconsistency, and Evans’s work has encouraged him to explore open further.
However, in the process of studying emergency inconsistencies, Openai says he stumbled upon features within the AI model that appear to play a major role in controlling behavior. In moss, these patterns are reminiscent of human internal brain activity, where certain neurons correlate with mood and behavior.
“When Dan and his team first presented this at the research meeting, I was like, ‘Wow, you guys found it,'” said Tejal Patwardhan, a researcher with Openai Frontier evaluations, in an interview with TechCrunch. “You’ve found that these personas are internal nerve activations that can actually be maneuvered to make the model more consistent.”
Some features that OpenAI found correlate with the irony of AI model responses, but others correlate with the more toxic responses in which AI models act as evil villains like cartoons. Openai researchers say these features can change dramatically during the fine-tuning process.
In particular, Openai researchers said that when an emergency inconsistency occurs, it is possible to revert the model back to good behavior by tweaking the model with hundreds of safe code examples.
Openai’s latest research has been constructed from previous research, and has been done by humanity on interpretability and integrity. In 2024, humanity was released to map the inner workings of AI models, attempting to identify and label different features that are responsible for different concepts.
Companies like Openai and humanity claim that there is real value in understanding how AI models work. However, there is a long way to go to get a full understanding of the latest AI models.