ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, extended

Summary

ARMOR-MAD is a training-free, adaptive framework for Multi-Agent Debate (MAD) in large language model reasoning, designed to overcome the computational waste and correlated errors of fixed debate pipelines. It integrates three core components: Pre-debate Agreement Routing (PAR) to determine if initial Round-0 answers require debate, Early Agreement Stopping Evaluator (EASE) to halt debate upon convergence, and Semantic Outlier Detection (SOD) for robust aggregation by down-weighting abnormal final answers. Across MATH Level 5, GSM8K, MMLU, and MMLU-Pro benchmarks, ARMOR-MAD consistently outperformed fixed-round heterogeneous debate, achieving 65.5%, 96.5%, 90.0%, and 81.5% accuracy, respectively, demonstrating the value of genuine model heterogeneity and agreement-based control.

Key takeaway

For AI Scientists and ML Engineers designing multi-agent LLM systems, relying on fixed-round, homogeneous debate is inefficient and risks amplifying correlated errors. You should prioritize genuine model heterogeneity and implement adaptive control mechanisms like pre-debate routing and early stopping. This approach, exemplified by ARMOR-MAD, significantly improves accuracy and computational efficiency, but ensure robust aggregation doesn't suppress correct "lone expert" minority opinions.

Key insights

Adaptive control and genuine model heterogeneity are crucial for efficient and accurate multi-agent LLM reasoning.

Principles

Heterogeneity provides independent perspectives, reducing correlated errors.
Adaptivity controls when debate is necessary, when it stops, and how answers aggregate.

Method

Heterogeneous agents (gpt-4o-mini, deepseek-v3, qwen-plus) generate Round-0 answers. PAR routes to debate if agreement < 0.67. EASE stops debate when all agents converge (phi=1.0) or after T_max rounds. SOD aggregates, down-weighting semantic outliers (lambda_out=0.7).

In practice

Use diverse LLM families (e.g., gpt-4o-mini, deepseek-v3, qwen-plus) for agents.
Implement agreement-based routing to skip debate for high-confidence initial answers.
Employ semantic outlier detection to improve final answer aggregation.

Topics

Multi-Agent Debate
LLM Reasoning
Adaptive Routing
Model Heterogeneity
Semantic Outlier Detection
Conditional Computation
Computational Efficiency

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.