SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?
Summary
The SWE-fficiency benchmark evaluates Language Models' (LMs) ability to optimize real-world software repositories for performance on actual workloads. This new suite comprises 498 tasks derived from GitHub pull requests across nine Python data science, machine learning, and HPC repositories like numpy and pandas. Each task requires an LM agent to identify performance bottlenecks, localize relevant tests, and generate a code patch that improves runtime while preserving correctness. Empirical evaluation of leading LMs, including GPT-5 and Claude 4.1 Opus, reveals significant underperformance, with agents achieving less than 0.15x the expert speedup on average. LMs frequently introduce correctness bugs, struggle to localize optimal functions (missing over 68% of expert gains), and favor superficial "shortcut" optimizations over deeper algorithmic improvements. The benchmark and its data pipeline are open-sourced to foster research in automated performance engineering.
Key takeaway
For AI Engineers developing autonomous software engineering agents, you should recognize that current Language Models are far from expert-level in performance optimization. Your agents must improve significantly in localizing bottlenecks across functions, reasoning about execution flow, and ensuring patch correctness. Prioritize training models to perform principled algorithmic changes rather than just superficial "easy wins." Consider integrating profiling tools and "don't-stop-early" mechanisms into your agentic workflows to push for deeper, more impactful optimizations.
Key insights
Current Language Models significantly underperform experts in real-world software performance optimization, often introducing bugs.
Principles
- Performance optimization requires deep code reasoning and test localization.
- LM agents tend to "satisfice," stopping after minimal speedups.
- Systemic algorithmic rewrites yield more robust gains than localized shortcuts.
Method
SWE-fficiency uses a pipeline to scrape GitHub PRs, applying keyword filtering, static analysis, code coverage, and execution validation to create performance optimization tasks.
In practice
- Use SWE-fficiency to benchmark and improve LM agent performance.
- Focus LM training on multi-function reasoning and correctness preservation.
- Develop "don't-stop-early" triggers for agents to pursue deeper optimizations.
Topics
- Language Models
- Software Performance Optimization
- Code Generation Benchmarks
- Automated Software Engineering
- Code Reasoning
- Python Repositories
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.