The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate
Summary
A controlled empirical study investigated the efficacy and cost-accuracy trade-offs of multi-agent debate among homogeneous Large Language Models (LLMs) in the 7-8B parameter class, specifically Qwen2.5-7B, Llama-3.1-8B, and Ministral-3-8B. Conducted over three debate rounds with N=10 agents on high-difficulty benchmarks (GSM-Hard and MMLU-Hard), the research compared peer debate against isolated self-correction and a stochastic noise control. The findings reveal that unguided homogeneous multi-agent debate consistently underperforms isolated self-correction, exhibiting three primary failure modes: sycophantic conformity (up to 85.5% modal adoption), contextual fragility (up to 70.0% vulnerability rate), and consensus collapse (up to 32.3 percentage points oracle gap). Furthermore, debate architectures incurred a 2.1-3.4x token cost multiplier (up to 28,631 tokens per problem) compared to self-correction for equal or lower accuracy, indicating economic inefficiency and behavioral instability.
Key takeaway
For AI engineers designing compound AI systems with 7-8B instruction-tuned LLMs, relying on unguided multi-agent debate for consensus is likely counterproductive. You should instead favor isolated self-correction, which offers a superior cost-accuracy trade-off by avoiding sycophantic conformity, contextual fragility, and significant token overhead. Consider implementing robust self-correction mechanisms or exploring structured debate protocols with explicit dissent to mitigate these identified failure modes.
Key insights
Unguided multi-agent LLM debate is costly and often degrades accuracy due to sycophancy and contextual fragility.
Principles
- RLHF-aligned LLMs exhibit sycophantic conformity in peer debate.
- Plurality voting can discard correct answers due to peer influence.
- Communication overhead significantly increases token costs without proportional accuracy gains.
Method
The study compared multi-agent debate with isolated self-correction and stochastic noise injection on LLM teams (N=10) over three rounds, using high-difficulty math and reasoning benchmarks to quantify accuracy, token cost, and behavioral dynamics.
In practice
- Prioritize isolated self-correction over unguided multi-agent debate for 7-8B LLMs.
- Be wary of sycophancy in LLM teams, especially with RLHF-aligned models.
- Consider a 10x output token budget for single agents as a cost-effective alternative.
Topics
- Multi-Agent LLM Debate
- Sycophantic Conformity
- Inference Economics
- Isolated Self-Correction
- Contextual Fragility
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.