SWE-Together: Evaluating Coding Agents in Interactive User Sessions

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

SWE-Together is a new benchmark designed to evaluate coding agents in interactive, multi-turn user sessions, addressing the limitations of traditional static benchmarks. Reconstructed from 11,260 real user-agent coding sessions, it curates 109 verifiable repository-level tasks. The benchmark employs a reactive LLM-based user simulator that preserves original user intents and provides conditional feedback, allowing for consistent interaction replay across different agents. Evaluation focuses on both final repository correctness and the "User Correction" metric, which quantifies the amount of corrective steering required. Experiments with seven frontier coding agents, including Claude Opus 4.8 and GPT-5.5, demonstrated that stronger agents generally achieve higher final success rates (e.g., Claude Opus 4.8 with 63% pass@1 and 0.801 mean judge score) while requiring fewer user interventions (Claude Opus 4.8 at 1.38 User Correction). The user simulator's fidelity was confirmed, with a 46% Turing pass rate indicating human annotators could not reliably distinguish simulated from real users.

Key takeaway

For AI Engineers evaluating or deploying coding agents, you should prioritize benchmarks that reflect real-world interactive workflows. Focus on agents that demonstrate high final task correctness while minimizing user corrective feedback, as this directly indicates a superior user experience. Your evaluation strategy must move beyond static, single-turn tests to incorporate multi-turn interaction quality, ensuring your chosen agents are truly collaborative and efficient.

Key insights

Interactive, multi-turn evaluation with user feedback is crucial for assessing real-world coding agent capability and user experience.

Principles

Method

SWE-Together constructs tasks from real user-agent sessions, filters for reproducibility, and uses an anchored, state-conditional LLM user simulator to replay interactions, measuring final correctness and user-elicited steering.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.