SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks
Summary
SmellBench is a new, extensible code refactoring benchmark designed to evaluate code agents on long-term maintainability, moving beyond functional correctness. It proactively injects 7 popular code smell types into 294 clean code snippets from 7 real-world Python repositories, offering 3 difficulty levels and 2 instruction settings. The benchmark provides human-written ground truth and a 3-dimensional evaluation framework covering functional correctness (test passing rate), localization ability, and LLM-based refactoring quality assessment. Experiments with 2 open-source code agents (OpenHands, Qwen Code) and 6 large language models (LLMs), including Qwen3-Coder-30B-A3B-Instruct, Qwen3-Coder-480B-A35B-Instruct, DeepSeek-V3.2, GPT-5-Mini, Gemini-2.5-Flash, and Claude Sonnet-4.5, revealed that the best combination (Qwen Code + Claude Sonnet 4.5) achieved only a 50.34 smell elimination score. This highlights significant limitations in handling cross-file understanding and comprehensive smell elimination.
Key takeaway
For AI Scientists and Machine Learning Engineers developing code agents, you should prioritize enhancing cross-file reasoning and architectural understanding. Current models, even top-performing ones like Qwen Code + Claude Sonnet 4.5, achieve only 50.34% smell elimination, indicating a significant gap in handling complex refactoring tasks. Focus on improving localization capabilities and coordinated modifications across multiple files to advance beyond basic functional correctness and address long-term software maintainability.
Key insights
Code agents struggle with complex, multi-file code refactoring, indicating a gap in repository-level reasoning beyond functional correctness.
Principles
- Functional correctness alone is insufficient for refactoring evaluation.
- Cross-file coordination is a major challenge for current LLMs.
- Proactive smell injection creates scalable, high-quality benchmarks.
Method
SmellBench constructs refactoring cases by selecting real-world repositories, identifying injection locations, introducing 7 types of code smells into clean code, and validating functional correctness with test suites.
In practice
- Use LLM-as-Judge for nuanced refactoring quality assessment.
- Prioritize multi-file reasoning in code agent development.
Topics
- Code Refactoring
- Code Agents
- LLM Evaluation
- Code Smells
- Software Maintainability
- Multi-file Reasoning
- Benchmarking
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.