StaminaBench: Stress-Testing Coding Agents over 100 Interaction Turns

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

StaminaBench is a new benchmark designed to evaluate the "stamina" of coding agents by measuring their performance over 100 consecutive interaction turns or change requests, a scenario termed "vibe-coding." This contrasts with traditional single-task metrics. Agents implement and modify a REST API server across procedurally generated requests, creating codebases up to 6,000 lines. The black-box, language-agnostic testing environment revealed that all six tested agent harnesses paired with seven open-source LLMs failed within 5-6 turns. However, providing test feedback improved passed turn counts by up to 12x, and a strong harness proved critical, causing up to a 6x performance gap for stronger models. The benchmark and generated tasks are publicly released.

Key takeaway

For ML Engineers developing coding agents, this research highlights that current models struggle significantly with sustained multi-turn interactions, failing within 5-6 turns without intervention. You should prioritize integrating robust test feedback loops and invest heavily in developing sophisticated agent harnesses. This approach can improve agent longevity by up to 12x, making your agents viable for real-world "vibe-coding" scenarios that demand dozens or hundreds of turns.

Key insights

Coding agents currently lack "stamina" for multi-turn interactions, but feedback and strong harnesses significantly improve performance.

Principles

"Vibe-coding" without testing introduces bugs quickly.
Test feedback dramatically extends agent longevity.
Agent harness quality is critical for strong performance.

Method

StaminaBench evaluates coding agents by having them modify a REST API server over 100 procedurally generated change requests in an isolated, black-box, language-agnostic environment.

In practice

Integrate iterative test feedback into agent workflows.
Prioritize robust agent harness development.
Benchmark coding agents on multi-turn tasks.

Topics

StaminaBench
Coding Agents
LLM Evaluation
Multi-turn Interaction
Agent Harnesses
REST API Development

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.