Humanity CEOs want to open the AI model black box by 2027
Dario Amody, CEO of humanity Essays have been published Thursday highlights researchers have little understanding of how internally works in the world’s leading AI model. To address that, Amodei has set ambitious human goals to ensure that most AI model problems are detected by 2027.
Amodei acknowledges future challenges. In “The Urgentness of Interpretability,” the CEO says that humanity made early breakthroughs in tracking how models reach answers, but emphasizes that as these systems become more powerful, much more research is needed to decode them.
“I am very concerned about deploying such a system without better handling of interpretability,” Amodei writes in her essay. “These systems are absolutely central to economic, technological and national security, allowing more autonomy to be fundamentally unacceptable to being totally ignorant about how humanity works.”
Humanity is one of the pioneering companies in mechanical interpretability, an area aimed at opening a black box of AI models and understanding why they make decisions. Despite the rapid performance improvements in AI models in the tech industry, it is still relatively rare how these systems will reach decisions.
For example, Openai recently launched a new inference AI model, O3 and O4-Mini, which performs not only with a few tasks, but also with the O4-Mini. More hallucinations than other models. The company doesn’t know why that’s happening.
“When a generative AI system does something, such as summarizing a financial document, at a certain or accurate level, I don’t know why it chooses, why it’s a particular word over others, or why it makes a mistake even though it’s usually accurate,” Amodei wrote in the essay.
In the essay, Amodei points out that humanity co-founder Chris Olah says that the AI model is “growing more than it’s built.” In other words, AI researchers have found ways to improve AI model intelligence, but I’m not sure why.
The essay says that Amody can be dangerous to reach AGI.The Land of Data Center Genius” – Without understanding how these models work. In a previous essay, Amodei argued that the tech industry could reach such milestones by 2026 or 2027, but thinks it’s far more out of understanding these AI models.
In the long run, Amodei wants humanity to perform “brain scans” or “MRI” of essentially cutting-edge AI models. These tests help identify a wide range of issues in the AI model. This tends to lie, seek power, or other weaknesses. This could take five to ten years, but he added that these measures are necessary to test and deploy future AI models for humanity.
Humanity has created several research breakthroughs to help them better understand how AI models work. For example, the company recently found a way Pass the AI model’s thinking paththe circuit, what the company calls it. Humanity has identified one circuit that will help the AI model understand which cities the US city is. The company has only found a small portion of these circuits, but estimates that there are millions of people in the AI model.
Humanity is investing in the research of interpretability itself, and has recently been created Your first investment in a startup We are working on interpretability. In the essay, Amodei called on Openai and Google Deepmind to increase research efforts in this area.
Amodei calls on the government to impose “light-touch” regulations to encourage interpretability research, including requirements for businesses to disclose safety and security practices. In the essay, Amodei also states that the US should have export controls on chips to China.
Humanity has always stood out from Openai and Google because it focused on safety. Other tech companies pushed back California’s controversial AI safety bill, SB 1047 Humanity issued modest support and recommendations for the billwill set safety reporting standards for frontier AI model developers.
In this case, humanity appears to be driving industry-wide efforts to not only improve capabilities, but also to better understand AI models.