SWE-Together: Evaluating Coding Agents in Interactive User Sessions
Summary
SWE-Together is a new benchmark designed to evaluate coding agents in interactive, multi-turn user sessions, addressing the limitations of traditional static benchmarks. Reconstructed from 11,260 real user-agent coding sessions, it curates 109 verifiable repository-level tasks. The benchmark employs a reactive LLM-based user simulator that preserves original user intents and provides conditional feedback, allowing for consistent interaction replay across different agents. Evaluation focuses on both final repository correctness and the "User Correction" metric, which quantifies the amount of corrective steering required. Experiments with seven frontier coding agents, including Claude Opus 4.8 and GPT-5.5, demonstrated that stronger agents generally achieve higher final success rates (e.g., Claude Opus 4.8 with 63% pass@1 and 0.801 mean judge score) while requiring fewer user interventions (Claude Opus 4.8 at 1.38 User Correction). The user simulator's fidelity was confirmed, with a 46% Turing pass rate indicating human annotators could not reliably distinguish simulated from real users.
Key takeaway
For AI Engineers evaluating or deploying coding agents, you should prioritize benchmarks that reflect real-world interactive workflows. Focus on agents that demonstrate high final task correctness while minimizing user corrective feedback, as this directly indicates a superior user experience. Your evaluation strategy must move beyond static, single-turn tests to incorporate multi-turn interaction quality, ensuring your chosen agents are truly collaborative and efficient.
Key insights
Interactive, multi-turn evaluation with user feedback is crucial for assessing real-world coding agent capability and user experience.
Principles
- Real-world coding assistance is inherently interactive.
- Agent capability correlates inversely with user corrective feedback.
- Evaluation must account for interaction quality and user effort.
Method
SWE-Together constructs tasks from real user-agent sessions, filters for reproducibility, and uses an anchored, state-conditional LLM user simulator to replay interactions, measuring final correctness and user-elicited steering.
In practice
- Evaluate agents using multi-turn, interactive benchmarks.
- Prioritize agents requiring minimal user corrective feedback.
- Ground user simulators in real human interaction data.
Topics
- Coding Agents
- LLM Evaluation
- Interactive Benchmarking
- User Simulation
- Software Engineering
- Agent Performance Metrics
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.