Talking to a Know-It-All GPT or a Second-Guesser Claude? How Repair reveals unreliable Multi-Turn Behavior in LLMs

2025-05-30 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Data Science & Analytics · Depth: Expert, extended

Summary

A study investigates how Large Language Models (LLMs) handle "repair" in multi-turn conversations, a mechanism for resolving communication issues in human interaction. Researchers prompted five LLMs (GPT-4o, Claude-Sonnet 4.5, DeepSeek-R1-distill-llama-70b, Phi-4, and Mistral-7b-instruct-v0.3) with 2,511 answerable and 2,600 unanswerable math problems from the UMWP dataset. They analyzed models' self-initiated repair and responses to user-initiated repair across four-turn dialogues using three strategies: a generic "Are you sure?", a specific "Are you sure that is correct?", and a misleading alternative answer. The findings reveal significant differences in model reliability and behavior, with responses ranging from resistance to appropriate repair to susceptibility to manipulation, especially in multi-turn interactions.

Key takeaway

For AI Product Managers designing conversational interfaces, recognize that LLM repair behavior is highly inconsistent across models and turns. You cannot assume a "one-size-fits-all" interaction strategy; instead, anticipate that models like GPT-4o may be stubborn and resist correction, while Claude-Sonnet 4.5 might overcorrect or be easily misled. This necessitates model-specific dialogue design and robust error handling to prevent user frustration and ensure reliable multi-turn interactions.

Key insights

LLM multi-turn conversational repair behavior is unreliable and highly model-specific, challenging consistent user interaction.

Principles

LLMs rarely initiate repair for unsolvable questions.
Model linguistic patterns become more distinct in multi-turn interactions.
RLHF may drive LLMs to answer even when they should abstain.

Method

The study used a 4-turn interaction sequence with solvable/unsolvable math problems, applying three user-initiated repair strategies (generic, specific, misleading) to evaluate LLM repair initiation, execution, and overadaptation.

In practice

Evaluate LLMs beyond single-turn performance.
Consider model-specific interaction profiles for dialogue systems.
Be cautious of LLM overadaptation to misleading user input.

Topics

Conversational Repair
LLM Behavior
Multi-Turn Interaction
UMWP Dataset
User-Initiated Repair

Code references

Yuki-Asuuna/UMWP

Best for: Research Scientist, AI Engineer, AI Product Manager, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.