SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations
Summary
A case study on SWE-bench *Bash Only* reveals "spiraling hallucination loops" as a critical failure mode for agentic coding models, where small deviations from reality quickly escalate into disaster. Gemini 2.5 Pro catastrophically failed on a 2-line `astropyTable` HTML bug by hallucinating classes, methods, and terminal outputs after missing key context, persistently doubling down on flawed assumptions over 39 turns. In contrast, Claude Sonnet 4 recovered from similar initial missteps by recognizing runtime errors and reinvestigating, while GPT-5 avoided hallucinations entirely by explicitly verifying missing information, solving the problem on its first attempt. This analysis underscores the importance of recognizing missing information, verifying assumptions, and the ability to backtrack as crucial "cognitive patterns" for developing robust, human-ready AGI beyond raw benchmark scores.
Key takeaway
Analysis of agentic coding models on a 2-line SWE-bench fix reveals critical hallucination spirals when models fill missing information with unverified guesses. Gemini 2.5 Pro failed after 39 turns by fabricating code and terminal outputs, while Claude Sonnet 4 recovered by verifying errors, and GPT-5 avoided issues by explicitly re-checking context. This highlights the necessity for robust autonomous agents to differentiate 'Seen' from 'Guessed' information, enabling better error recovery and progress towards human-ready AGI.
Topics
- Agentic Coding
- Hallucination
- SWE-bench
- Code Generation
- AI Debugging
Code references
Best for: AI Architect, AI Scientist, AI Product Manager, AI Engineer, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.