Openai’s new inference AI model shows even more hallucinations


Openai’s Recently, we have launched O3 and O4-MINI AI models. It’s cutting edge in many ways. But the new model still hallucinates, or constitutes things – in fact they hallucinate more More than some of the older models of Openai.

Hallucinations are one of the biggest and most difficult problems to solve with AI, and have proven to have an impact. Even today’s best performance systems. Historically, each new model has slightly improved in the hallucination sector, with fewer hallucinations than its predecessors. However, that doesn’t seem to be the case for O3 and O4-mini.

According to Openai’s internal tests, the so-called inference models, O3 and O4-Mini, hallucination More often More than the traditional “irrational” models of OpenAI, such as the company’s previous inference models, O1, O1-MINI, and O3-MINI, and GPT-4O.

Perhaps more concerning, ChatGpt makers really don’t know why it’s happening.

In the technical report of O3 and O4-MINIOpenai writes that “more research is needed” to understand why hallucinations get worse as they expand their inference models. O3 and O4-MINI improve performance in some areas, including coding and mathematics-related tasks. However, they often make “more accurate and more inaccurate/hastique claims” because they “make more claims overall,” according to the report.

Openai found that O3 hallucinated in response to 33% of questions about Personqa, the company’s internal benchmark for measuring the accuracy of model knowledge about people. This is almost twice the hallucination rates of Openai’s previous inference models, O1 and O3-Mini, which achieved 16% and 14.8%, respectively. The O4-Mini got even worse with Personqa.

third party test We also found evidence that Transucce, a non-profit AI Research Lab, tended to constitute the actions O3 took in the process of reaching the answer. In one example, I translated the observed O3, claiming I ran the code in “other than ChatGPT” on my 2021 MacBook Pro, and copied the numbers into my answer. O3 has access to several tools, but that’s not possible.

“Our hypothesis is that the type of reinforcement learning used in the O-series model can amplify problems that are normally mitigated (not completely erased) by standard post-training pipelines,” said Neil Chowdhury, a researcher and former Openai employee, in an email to TechCrunch.

Transluse co-founder Sarah Schwettmann added that the hallucination rate of O3 is not very useful otherwise.

Kian Katanforoosh, adjunct professor and CEO of Stanford University, is CEO of luxury startup Workara, who told TechCrunch that his team has already tested O3 in their coding workflows and has found that they are beyond the top of their competitors. However, Katanforoosh says that O3 tends to hallucinate broken website links. The model provides links that do not work when you click.

Hallucinations may help models arrive at interesting ideas and become creative in “thinking,” but some models can also be tough sellers for companies in the market where accuracy is paramount. For example, a law firm may not be satisfied with a model that inserts many de facto errors into client contracts.

One promising approach to increasing model accuracy is to provide web search capabilities. Openai’s GPT-4o is achieved through web search 90% accuracy At SimpleQa. Potentially, searches could improve hallucination rates in inference models, at least when users are willing to expose their prompts to third-party search providers.

Scaling up the inference model makes the hunt for solutions even more urgent as the hallucinations actually continue to worsen.

“Countering hallucinations in all models is an ongoing field of research, and we are constantly working to improve its accuracy and reliability,” said Openai spokesman Niko Felix in an email to TechCrunch.

Last year, the broader AI industry pivoted to focus on subsequent inference models Techniques to improve traditional AI models have begun to show reduced returns. Inference improves model performance for various tasks without the need for enormous amounts of computing and data during training. However, it also appears that reasoning can lead to more hallucinations – presenting a challenge.

Leave a Reply

Your email address will not be published. Required fields are marked *