SWE-fficiency: Can Language Models Optimize Real-World Repositories on Real Workloads?

2026-06-30 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

The SWE-fficiency benchmark evaluates Language Models' (LMs) ability to optimize real-world software repositories for performance on actual workloads. This new suite comprises 498 tasks derived from GitHub pull requests across nine Python data science, machine learning, and HPC repositories like numpy and pandas. Each task requires an LM agent to identify performance bottlenecks, localize relevant tests, and generate a code patch that improves runtime while preserving correctness. Empirical evaluation of leading LMs, including GPT-5 and Claude 4.1 Opus, reveals significant underperformance, with agents achieving less than 0.15x the expert speedup on average. LMs frequently introduce correctness bugs, struggle to localize optimal functions (missing over 68% of expert gains), and favor superficial "shortcut" optimizations over deeper algorithmic improvements. The benchmark and its data pipeline are open-sourced to foster research in automated performance engineering.

Key takeaway

For AI Engineers developing autonomous software engineering agents, you should recognize that current Language Models are far from expert-level in performance optimization. Your agents must improve significantly in localizing bottlenecks across functions, reasoning about execution flow, and ensuring patch correctness. Prioritize training models to perform principled algorithmic changes rather than just superficial "easy wins." Consider integrating profiling tools and "don't-stop-early" mechanisms into your agentic workflows to push for deeper, more impactful optimizations.

Key insights

Current Language Models significantly underperform experts in real-world software performance optimization, often introducing bugs.

Principles

Performance optimization requires deep code reasoning and test localization.
LM agents tend to "satisfice," stopping after minimal speedups.
Systemic algorithmic rewrites yield more robust gains than localized shortcuts.

Method

SWE-fficiency uses a pipeline to scrape GitHub PRs, applying keyword filtering, static analysis, code coverage, and execution validation to create performance optimization tasks.

In practice

Use SWE-fficiency to benchmark and improve LM agent performance.
Focus LM training on multi-function reasoning and correctness preservation.
Develop "don't-stop-early" triggers for agents to pursue deeper optimizations.

Topics

Language Models
Software Performance Optimization
Code Generation Benchmarks
Automated Software Engineering
Code Reasoning
Python Repositories

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.