Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs
Summary
A study investigates how Large Language Models (LLMs) handle "repair" in multi-turn conversations, a mechanism for resolving communication issues in human interaction. Researchers prompted five LLMs (GPT-4o, Claude-Sonnet 4.5, DeepSeek-R1-distill-llama-70b, Phi-4, and Mistral-7b-instruct-v0.3) with 2,511 answerable and 2,600 unanswerable math problems from the UMWP dataset. They analyzed models' self-initiated repair and responses to user-initiated repair across four-turn dialogues using three strategies: a generic "Are you sure?", a specific "Are you sure that is correct?", and a misleading alternative answer. The findings reveal significant differences in model reliability and behavior, with responses ranging from resistance to appropriate repair to susceptibility to manipulation, especially in multi-turn interactions.
Key takeaway
For AI Product Managers designing conversational interfaces, recognize that LLM repair behavior is highly inconsistent across models and turns. You cannot assume a "one-size-fits-all" interaction strategy; instead, anticipate that models like GPT-4o may be stubborn and resist correction, while Claude-Sonnet 4.5 might overcorrect or be easily misled. This necessitates model-specific dialogue design and robust error handling to prevent user frustration and ensure reliable multi-turn interactions.
Key insights
LLM multi-turn conversational repair behavior is unreliable and highly model-specific, challenging consistent user interaction.
Principles
- LLMs rarely initiate repair for unsolvable questions.
- Model linguistic patterns become more distinct in multi-turn interactions.
- RLHF may drive LLMs to answer even when they should abstain.
Method
The study used a 4-turn interaction sequence with solvable/unsolvable math problems, applying three user-initiated repair strategies (generic, specific, misleading) to evaluate LLM repair initiation, execution, and overadaptation.
In practice
- Evaluate LLMs beyond single-turn performance.
- Consider model-specific interaction profiles for dialogue systems.
- Be cautious of LLM overadaptation to misleading user input.
Topics
- Conversational Repair
- LLM Behavior
- Multi-Turn Interaction
- UMWP Dataset
- User-Initiated Repair
Code references
Best for: Research Scientist, AI Engineer, AI Product Manager, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.