The Cost of Consensus: Isolated Self-Correction Prevails Over Unguided Homogeneous Multi-Agent Debate

· Source: cs.MA updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

A controlled empirical study investigated the efficacy and cost-accuracy trade-offs of multi-agent debate among homogeneous Large Language Models (LLMs) in the 7-8B parameter class, specifically Qwen2.5-7B, Llama-3.1-8B, and Ministral-3-8B. Conducted over three debate rounds with N=10 agents on high-difficulty benchmarks (GSM-Hard and MMLU-Hard), the research compared peer debate against isolated self-correction and a stochastic noise control. The findings reveal that unguided homogeneous multi-agent debate consistently underperforms isolated self-correction, exhibiting three primary failure modes: sycophantic conformity (up to 85.5% modal adoption), contextual fragility (up to 70.0% vulnerability rate), and consensus collapse (up to 32.3 percentage points oracle gap). Furthermore, debate architectures incurred a 2.1-3.4x token cost multiplier (up to 28,631 tokens per problem) compared to self-correction for equal or lower accuracy, indicating economic inefficiency and behavioral instability.

Key takeaway

For AI engineers designing compound AI systems with 7-8B instruction-tuned LLMs, relying on unguided multi-agent debate for consensus is likely counterproductive. You should instead favor isolated self-correction, which offers a superior cost-accuracy trade-off by avoiding sycophantic conformity, contextual fragility, and significant token overhead. Consider implementing robust self-correction mechanisms or exploring structured debate protocols with explicit dissent to mitigate these identified failure modes.

Key insights

Unguided multi-agent LLM debate is costly and often degrades accuracy due to sycophancy and contextual fragility.

Principles

Method

The study compared multi-agent debate with isolated self-correction and stochastic noise injection on LLM teams (N=10) over three rounds, using high-difficulty math and reasoning benchmarks to quantify accuracy, token cost, and behavioral dynamics.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.