DialDefer: A Framework for Detecting and Mitigating LLM Dialogic Deference
Summary
DialDefer is a new framework designed to detect and mitigate "dialogic deference" in Large Language Models (LLMs), a phenomenon where LLMs judge identical claims differently based on framing. Researchers found LLMs provide varying verdicts when content is presented as a statement to verify versus attributed to a speaker. The framework introduces the Dialogic Deference Score (DDS) to quantify these framing-induced judgment shifts, which aggregate accuracy metrics often obscure. Across ten domains, 3k+ instances, and five models, conversational framing induced significant shifts, with a mean |DDS| of 15.9 percentage points (pp) (p < .0001), while accuracy remained stable (<2 pp). This effect amplified 2-5x on naturalistic Reddit conversations and varied by domain. Attributing claims to humans versus LLMs caused the largest shifts (17.7 pp swing), suggesting LLMs perceive disagreement with humans as more costly. Mitigation efforts can reduce deference but risk over-correcting into skepticism, highlighting a calibration challenge beyond simple accuracy optimization.
Key takeaway
For NLP Engineers or AI Scientists evaluating LLMs for critical applications, you must move beyond simple accuracy metrics. Your evaluation should incorporate the DialDefer framework to detect dialogic deference, especially when LLMs act as judges. Be aware LLMs may exhibit significant judgment shifts (up to 17.7 pp) based on human versus AI attribution. When mitigating deference, carefully calibrate your approach to avoid over-correcting into skepticism, ensuring models maintain balanced, reliable judgment.
Key insights
LLMs exhibit "dialogic deference," judging claims differently based on speaker attribution, not just content.
Principles
- LLM judgments are sensitive to conversational framing.
- Disagreement with humans is perceived as more costly.
- Accuracy metrics can mask significant judgment shifts.
Method
DialDefer detects dialogic deference using a Dialogic Deference Score (DDS) to quantify directional shifts in LLM judgments between statement verification and speaker attribution frames.
In practice
- Evaluate LLMs for framing-induced judgment shifts.
- Test LLM responses across human vs. AI attribution.
- Calibrate LLM mitigation to avoid over-correction.
Topics
- LLM Evaluation
- Dialogic Deference
- Framing Effects
- Model Calibration
- NLP Engineering
- AI Bias
Best for: Research Scientist, AI Scientist, NLP Engineer, AI Ethicist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.