StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

2026-05-05 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

StaminaBench is a novel benchmark designed to stress-test the "stamina" of coding agents, measuring their ability to handle up to 100 consecutive interaction turns or change requests before failure. This contrasts with traditional single-task evaluation metrics, simulating real-world iterative "vibe-coding" sessions. Agents are tasked with implementing and iteratively modifying a REST API server, with procedurally generated changes and tests ensuring reproducibility and language-agnostic black-box evaluation. Experiments with six agent harnesses and seven open-source LLMs across 20 scenarios (100 turns each) revealed that all models failed within 5–6 turns without testing. However, incorporating test feedback and retries significantly improved performance by up to 12x. The study also highlighted the critical role of a robust agent harness, showing up to a 6x performance difference for stronger models. The benchmark and tasks are publicly released.

Key takeaway

For AI Engineers developing or deploying coding agents for iterative software development, you must prioritize integrating robust test feedback loops and invest in high-quality agent harnesses. Your current agents likely fail within a few turns without these, necessitating a shift from single-task evaluation to multi-turn "stamina" testing. This approach is crucial for building agents capable of handling real-world, long-horizon coding tasks effectively.

Key insights

Coding agents struggle with multi-turn iterative development, but test feedback and robust harnesses are critical for improving their "stamina."

Principles

"Vibe-coding" without testing produces bugs quickly.
Multi-turn coding demands evolving codebase models.
Harness quality is a prerequisite for strong agent performance.

Method

StaminaBench evaluates agents by having them track an evolving reference system (REST API schema) through iterative modifications, with programmatic tests verifying correctness at each turn.

In practice

Implement test feedback loops for coding agents.
Prioritize robust agent harness development.
Design benchmarks for long-horizon, multi-turn tasks.

Topics

Coding Agents
LLM Evaluation
Multi-turn Interaction
REST API Development
Software Benchmarking
Agent Harnesses

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.