Multi-Agent Reasoning with Consistency Verification Improves Uncertainty Calibration in Medical MCQA

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Health & Medical Research · Depth: Expert, extended

Summary

A new multi-agent framework, MARC, enhances uncertainty calibration and discrimination in medical multiple-choice question answering (MCQA). MARC integrates four domain-specific specialist agents, each powered by Qwen2.5-7B-Instruct, with a Two-Phase Verification process and S-Score Weighted Fusion. Evaluated across 100-question and 250-question high-disagreement subsets of MedQA-USMLE and MedMCQA, the system significantly reduced Expected Calibration Error (ECE) by 49–74% across all four settings. For instance, on MedQA-250, MARC achieved an ECE of 0.091, representing a 74.4% reduction from the single-specialist baseline, alongside an AUROC of 0.630 (+0.056) at 59.2% accuracy. Ablation studies confirmed Two-Phase Verification as the main driver for calibration improvements and multi-agent reasoning for accuracy gains, establishing consistency-based verification as a practical confidence signal for safety-critical clinical AI.

Key takeaway

For MLOps Engineers deploying medical AI systems in clinical settings, you should integrate consistency-based verification mechanisms like MARC's Two-Phase Verification. This approach significantly improves confidence calibration (49–74% ECE reduction) without requiring labeled data, providing a crucial signal for deferral to human experts. While multi-agent fusion boosts accuracy, be aware that internal consistency doesn't guarantee factual correctness, especially on knowledge-intensive tasks, necessitating future integration with external knowledge bases.

Key insights

Multi-agent reasoning with consistency verification significantly improves medical AI confidence calibration and discrimination.

Principles

Method

MARC uses Qwen2.5-7B-Instruct specialist agents, Two-Phase Verification to derive S-scores from internal consistency, and S-Score Weighted Fusion to select answers and calibrate confidence.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.