StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns
Summary
StaminaBench is a new benchmark designed to evaluate the "stamina" of coding agents by measuring their performance over 100 consecutive interaction turns or change requests, a scenario termed "vibe-coding." This contrasts with traditional single-task metrics. Agents implement and modify a REST API server across procedurally generated requests, creating codebases up to 6,000 lines. The black-box, language-agnostic testing environment revealed that all six tested agent harnesses paired with seven open-source LLMs failed within 5-6 turns. However, providing test feedback improved passed turn counts by up to 12x, and a strong harness proved critical, causing up to a 6x performance gap for stronger models. The benchmark and generated tasks are publicly released.
Key takeaway
For ML Engineers developing coding agents, this research highlights that current models struggle significantly with sustained multi-turn interactions, failing within 5-6 turns without intervention. You should prioritize integrating robust test feedback loops and invest heavily in developing sophisticated agent harnesses. This approach can improve agent longevity by up to 12x, making your agents viable for real-world "vibe-coding" scenarios that demand dozens or hundreds of turns.
Key insights
Coding agents currently lack "stamina" for multi-turn interactions, but feedback and strong harnesses significantly improve performance.
Principles
- "Vibe-coding" without testing introduces bugs quickly.
- Test feedback dramatically extends agent longevity.
- Agent harness quality is critical for strong performance.
Method
StaminaBench evaluates coding agents by having them modify a REST API server over 100 procedurally generated change requests in an isolated, black-box, language-agnostic environment.
In practice
- Integrate iterative test feedback into agent workflows.
- Prioritize robust agent harness development.
- Benchmark coding agents on multi-turn tasks.
Topics
- StaminaBench
- Coding Agents
- LLM Evaluation
- Multi-turn Interaction
- Agent Harnesses
- REST API Development
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.