CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues
Summary
CodeChat-Eval is a new evaluation framework designed to assess Large Language Models (LLMs) in multi-turn code refinement dialogues, a common but under-benchmarked scenario in software engineering. Unlike existing benchmarks focused on single-turn generation, CodeChat-Eval uses a dynamic instruction selection algorithm to construct 10-turn evaluation sessions from 542 SWE tasks. An empirical study on eight LLMs, including Llama 3.1 8B, Qwen 2.5 Coder, DeepSeek-V3, GPT-5 Nano, and GPT-5, revealed a statistically significant decrease in functional correctness, ranging from 19.2% (GPT-5 Nano) to 69.2% (Llama 3.1 8B), over multi-turn refinement. The largest correctness drops occurred with logic-level ("Semantic") and additive ("Add") change requests, indicating LLMs struggle to maintain code functionality during iterative refinement.
Key takeaway
For Machine Learning Engineers integrating LLMs into iterative code development workflows, you must implement robust, continuous functional testing throughout the refinement process. The observed 19.2% to 69.2% functional correctness degradation, particularly with "Semantic" and "Add" instructions, means relying solely on initial code generation correctness is insufficient. Proactively test each refinement turn to catch regressions, and consider human review for complex logic or additive changes to mitigate the risk of introducing subtle bugs.
Key insights
LLMs significantly degrade in functional correctness during multi-turn code refinement, especially with semantic or additive changes.
Principles
- Functional correctness degrades significantly in multi-turn code refinement.
- Semantic and Add instructions cause the largest correctness regressions.
- Instruction adherence does not guarantee functional correctness.
Method
CodeChat-Eval uses an instruction taxonomy, evaluation agenda, and Agenda-Guided Dynamic Instruction Selection (AGDIS) to generate 10-turn sessions, evaluating functional correctness and instruction adherence.
In practice
- Prioritize rigorous testing for LLM-refined code, especially after semantic changes.
- Be cautious with LLMs for additive code changes in multi-turn contexts.
Topics
- Large Language Models
- Code Refinement
- Functional Correctness
- Software Engineering Benchmarks
- Multi-turn Dialogues
- Code Generation Evaluation
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.