CodeChat-Eval: Evaluating Large Language Models in Multi-Turn Code Refinement Dialogues

2026-03-06 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Expert, extended

Summary

CodeChat-Eval is a new evaluation framework designed to assess Large Language Models (LLMs) in multi-turn code refinement dialogues, a common but under-benchmarked scenario in software engineering. Unlike existing benchmarks focused on single-turn generation, CodeChat-Eval uses a dynamic instruction selection algorithm to construct 10-turn evaluation sessions from 542 SWE tasks. An empirical study on eight LLMs, including Llama 3.1 8B, Qwen 2.5 Coder, DeepSeek-V3, GPT-5 Nano, and GPT-5, revealed a statistically significant decrease in functional correctness, ranging from 19.2% (GPT-5 Nano) to 69.2% (Llama 3.1 8B), over multi-turn refinement. The largest correctness drops occurred with logic-level ("Semantic") and additive ("Add") change requests, indicating LLMs struggle to maintain code functionality during iterative refinement.

Key takeaway

For Machine Learning Engineers integrating LLMs into iterative code development workflows, you must implement robust, continuous functional testing throughout the refinement process. The observed 19.2% to 69.2% functional correctness degradation, particularly with "Semantic" and "Add" instructions, means relying solely on initial code generation correctness is insufficient. Proactively test each refinement turn to catch regressions, and consider human review for complex logic or additive changes to mitigate the risk of introducing subtle bugs.

Key insights

LLMs significantly degrade in functional correctness during multi-turn code refinement, especially with semantic or additive changes.

Principles

Functional correctness degrades significantly in multi-turn code refinement.
Semantic and Add instructions cause the largest correctness regressions.
Instruction adherence does not guarantee functional correctness.

Method

CodeChat-Eval uses an instruction taxonomy, evaluation agenda, and Agenda-Guided Dynamic Instruction Selection (AGDIS) to generate 10-turn sessions, evaluating functional correctness and instruction adherence.

In practice

Prioritize rigorous testing for LLM-refined code, especially after semantic changes.
Be cautious with LLMs for additive code changes in multi-turn contexts.

Topics

Large Language Models
Code Refinement
Functional Correctness
Software Engineering Benchmarks
Multi-turn Dialogues
Code Generation Evaluation

Code references

features/copilot

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.