SWE-Explore: Benchmarking How Coding Agents Explore Repositories
Summary
SWE-Explore is a new benchmark designed to isolate and evaluate the repository exploration capabilities of coding agents, a fine-grained aspect often obscured by holistic pass/fail metrics in existing benchmarks like SWE-bench. It challenges explorers to return a ranked list of relevant code regions for a given issue and repository, adhering to a fixed line budget. The benchmark encompasses 848 issues across 10 programming languages and 203 open-source repositories, with ground truth derived from successful agent trajectories. Evaluation focuses on coverage, ranking, and context-efficiency, with these metrics shown to strongly predict downstream repair success. Findings indicate that agentic explorers significantly outperform classical retrieval methods, excelling in file-level localization, but often remain recall-limited at the line level.
Key takeaway
For AI Scientists developing coding agents, understanding repository exploration as a distinct capability is crucial. This benchmark reveals that while agents excel at file-level localization, they often struggle with line-level recall. You should prioritize improving your agents' ability to surface precise, relevant code spans early in their ranked output, as missing critical context significantly impacts repair success more than moderate irrelevant information. Focus on enhancing line-level coverage and context efficiency to build more robust coding agents.
Key insights
SWE-Explore benchmarks coding agents' line-level repository exploration, isolating it from end-to-end repair outcomes.
Principles
- Repository exploration is a distinct agent capability.
- Line-level context metrics predict repair success.
- Missing core evidence harms more than redundancy.
Method
Explorers return ranked code regions for an issue and repository. These are scored against trajectory-derived ground truth using coverage, ranking, and context-efficiency metrics.
In practice
- Prioritize line-level recall in agent exploration.
- Use nDCG@500 and Context Efficiency metrics.
- Ensure core evidence presence over strict precision.
Topics
- SWE-Explore
- Coding Agents
- Repository Exploration
- Code Localization
- LLM Benchmarking
- Context Retrieval
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.