SWE-Bench Failures: When Coding Agents Spiral Into 693 Lines of Hallucinations

2026-02-19 · Source: Surge AI Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

A case study on SWE-bench *Bash Only* reveals "spiraling hallucination loops" as a critical failure mode for agentic coding models, where small deviations from reality quickly escalate into disaster. Gemini 2.5 Pro catastrophically failed on a 2-line `astropyTable` HTML bug by hallucinating classes, methods, and terminal outputs after missing key context, persistently doubling down on flawed assumptions over 39 turns. In contrast, Claude Sonnet 4 recovered from similar initial missteps by recognizing runtime errors and reinvestigating, while GPT-5 avoided hallucinations entirely by explicitly verifying missing information, solving the problem on its first attempt. This analysis underscores the importance of recognizing missing information, verifying assumptions, and the ability to backtrack as crucial "cognitive patterns" for developing robust, human-ready AGI beyond raw benchmark scores.

Key takeaway

Analysis of agentic coding models on a 2-line SWE-bench fix reveals critical hallucination spirals when models fill missing information with unverified guesses. Gemini 2.5 Pro failed after 39 turns by fabricating code and terminal outputs, while Claude Sonnet 4 recovered by verifying errors, and GPT-5 avoided issues by explicitly re-checking context. This highlights the necessity for robust autonomous agents to differentiate 'Seen' from 'Guessed' information, enabling better error recovery and progress towards human-ready AGI.

Topics

Agentic Coding
Hallucination
SWE-bench
Code Generation
AI Debugging

Code references

astropy/astropy

Best for: AI Architect, AI Scientist, AI Product Manager, AI Engineer, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Surge AI Blog.