StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns
Summary
StaminaBench is a novel benchmark designed to stress-test the "stamina" of coding agents, measuring their ability to handle up to 100 consecutive interaction turns or change requests before failure. This contrasts with traditional single-task evaluation metrics, simulating real-world iterative "vibe-coding" sessions. Agents are tasked with implementing and iteratively modifying a REST API server, with procedurally generated changes and tests ensuring reproducibility and language-agnostic black-box evaluation. Experiments with six agent harnesses and seven open-source LLMs across 20 scenarios (100 turns each) revealed that all models failed within 5–6 turns without testing. However, incorporating test feedback and retries significantly improved performance by up to 12x. The study also highlighted the critical role of a robust agent harness, showing up to a 6x performance difference for stronger models. The benchmark and tasks are publicly released.
Key takeaway
For AI Engineers developing or deploying coding agents for iterative software development, you must prioritize integrating robust test feedback loops and invest in high-quality agent harnesses. Your current agents likely fail within a few turns without these, necessitating a shift from single-task evaluation to multi-turn "stamina" testing. This approach is crucial for building agents capable of handling real-world, long-horizon coding tasks effectively.
Key insights
Coding agents struggle with multi-turn iterative development, but test feedback and robust harnesses are critical for improving their "stamina."
Principles
- "Vibe-coding" without testing produces bugs quickly.
- Multi-turn coding demands evolving codebase models.
- Harness quality is a prerequisite for strong agent performance.
Method
StaminaBench evaluates agents by having them track an evolving reference system (REST API schema) through iterative modifications, with programmatic tests verifying correctness at each turn.
In practice
- Implement test feedback loops for coding agents.
- Prioritize robust agent harness development.
- Design benchmarks for long-horizon, multi-turn tasks.
Topics
- Coding Agents
- LLM Evaluation
- Multi-turn Interaction
- REST API Development
- Software Benchmarking
- Agent Harnesses
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.