SmellBench: Towards Fine-Grained Evaluation of Code Agents on Refactoring Tasks

2026-02-09 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

SmellBench is a new, extensible code refactoring benchmark designed to evaluate code agents on long-term maintainability, moving beyond functional correctness. It proactively injects 7 popular code smell types into 294 clean code snippets from 7 real-world Python repositories, offering 3 difficulty levels and 2 instruction settings. The benchmark provides human-written ground truth and a 3-dimensional evaluation framework covering functional correctness (test passing rate), localization ability, and LLM-based refactoring quality assessment. Experiments with 2 open-source code agents (OpenHands, Qwen Code) and 6 large language models (LLMs), including Qwen3-Coder-30B-A3B-Instruct, Qwen3-Coder-480B-A35B-Instruct, DeepSeek-V3.2, GPT-5-Mini, Gemini-2.5-Flash, and Claude Sonnet-4.5, revealed that the best combination (Qwen Code + Claude Sonnet 4.5) achieved only a 50.34 smell elimination score. This highlights significant limitations in handling cross-file understanding and comprehensive smell elimination.

Key takeaway

For AI Scientists and Machine Learning Engineers developing code agents, you should prioritize enhancing cross-file reasoning and architectural understanding. Current models, even top-performing ones like Qwen Code + Claude Sonnet 4.5, achieve only 50.34% smell elimination, indicating a significant gap in handling complex refactoring tasks. Focus on improving localization capabilities and coordinated modifications across multiple files to advance beyond basic functional correctness and address long-term software maintainability.

Key insights

Code agents struggle with complex, multi-file code refactoring, indicating a gap in repository-level reasoning beyond functional correctness.

Principles

Functional correctness alone is insufficient for refactoring evaluation.
Cross-file coordination is a major challenge for current LLMs.
Proactive smell injection creates scalable, high-quality benchmarks.

Method

SmellBench constructs refactoring cases by selecting real-world repositories, identifying injection locations, introducing 7 types of code smells into clean code, and validating functional correctness with test suites.

In practice

Use LLM-as-Judge for nuanced refactoring quality assessment.
Prioritize multi-file reasoning in code agent development.

Topics

Code Refactoring
Code Agents
LLM Evaluation
Code Smells
Software Maintainability
Multi-file Reasoning
Benchmarking

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.