AI models still struggle to debug software, according to Microsoft research
AI models from OpenAI, humanity, and other top AI labs are increasingly being used to aid programming tasks. Google CEO Sundar Pichai I said it in October 25% of the company’s new codes are generated by AI and Meta CEO Mark Zuckerberg He expressed his ambition We will widely deploy AI coding models within the social media giant.
But even some of today’s best models have struggled to solve software bugs that don’t trip experienced developers.
a New research Models including humanity are revealed from Microsoft Research, Microsoft’s R&D division Claude 3.7 Sonnet And Openai’s o3-mini, A software development benchmark called SWE-Bench Lite cannot debug many issues. The result is nonetheless a calm reminder Bold pronunciation From companies like OpenaiAI still does not match human experts in domains such as coding.
The research co-authors tested nine different models as the backbone of a “single prompt-based agent” that has access to many debugging tools, including Python debuggers. They appointed this agent to resolve a curation set of 300 software debugging tasks from SWE-Bench Lite.
According to co-authors, agents rarely complete more than half of their debugging tasks successfully, even when equipped with more powerful and recent models. Claude 3.7 Sonnet had the highest success rate (48.4%), followed by Openai’s O1 (30.2%) and O3-Mini (22.1%).

Why is there such an overwhelming performance? Some models struggled to use the available debugging tools and understood how different tools can help with different issues. However, according to the co-authors, the bigger problem was the lack of data. They speculate that the training data in the current model does not have enough data to represent the “sequential decision-making process,” or human debug traces.
“I strongly believe that training or fine-tuning (models) can become better interactive debuggers,” co-authors of their study wrote. “However, this requires specialized data to meet such model training, such as trajectory data that records agents interacting with the debugger to gather the necessary information before suggesting bug fixes.”
The findings are not shocking at all. There are many researches It is displayed This code generation AI tends to introduce security vulnerabilities and errors due to weaknesses in areas such as the ability to understand programming logic. Recent reviews of Devinfound that a popular AI coding tool can only complete three out of 20 programming tests.
But Microsoft’s work is one of the more detailed appearances in the persistent problem area of the model. It probably won’t moist Investor enthusiasm For assistant coding tools with AI, if you’re lucky, developers and their top tops will be able to run coding shows on AI.
Because of its value, more and more engineers are challenging the concept that AI automates coding jobs. Bill Gates, co-founder of Microsoft He says he considers programming as a profession. It’s here to stay. that’s right Reply CEO Amjad Masad, OKTA CEO Todd McKinnonand IBM CEO Arvind Krishna.