Multiagent Protocols with Aggregated Confidence Signals

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

The paper introduces three novel multiagent protocols—Weighted Stream Voting (WSV), Confidence Gating with Aggregation (CGA), and Human-Inspired Debate (HID)—designed to produce a single, aggregated confidence score for multiagent Large Language Model (LLM) systems. Unlike prior Multiagent Debate (MAD) methods that use confidence internally but fail to aggregate it for the system, these protocols first transform raw confidence signals for cross-model comparability, then combine them via soft voting or Bayesian fusion. Evaluated across five benchmarks and four task types using 14 LLMs (3B-32B parameters) in homogeneous and heterogeneous pairs, the protocols demonstrate substantially more discriminative aggregated confidence (AUARC) than single-agent or standard MAD baselines. Crucially, they recover F1-score losses incurred by MAD on ambiguous tasks, showing consistent F1 gains where standard MAD often degrades performance (e.g., BoolQ -5.53%, Vast -6.02%). The research also highlights the essential role of post-hoc calibration, particularly parametric methods like Beta calibration, in improving F1-scores.

Key takeaway

For Machine Learning Engineers building multiagent LLM systems, you should integrate confidence-aware protocols like WSV or CGA to enhance system reliability and accuracy. These methods provide a discriminative aggregated confidence signal and recover performance losses common in standard multiagent debate, especially on ambiguous tasks. Prioritize post-hoc calibration, particularly parametric methods, to ensure confidence comparability across models. This approach allows your systems to make more robust decisions and improve overall F1-scores.

Key insights

Multiagent protocols aggregate individual LLM confidences into a single, discriminative system-level confidence, recovering performance losses from standard debate.

Principles

Method

The protocols transform raw confidence signals for cross-model comparability, then combine them using soft voting (WSV) or Bayesian fusion (CGA, HID) to yield a final answer and aggregated confidence.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.