EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

EARS (Explanatory Abstention for Reliable Sub-Agent Modeling) is a production-oriented framework developed by Shopify and Columbia University to enhance reliability in large-scale centralized multi-agent systems (MAS). It addresses sub-agents' tendency to "over-answer" ambiguous or unsupported requests, leading to hallucinations. EARS reframes abstention as an inter-agent communication protocol, where sub-agents return actionable failure states and rationales to a coordinator. The framework uses an ensemble of calibrated LLM-as-a-Judge models to curate human-agent interaction data, generating structured abstention labels under a four-category taxonomy. Fine-tuning sub-agents with this data enables them to detect failure conditions. In a production e-commerce assistant, EARS improved the overall response pass rate from 68.5% to 78.9%, a 15.2% relative gain.

Key takeaway

For MLOps engineers deploying or managing large-scale multi-agent systems, integrating sub-agent explanatory abstention is crucial for system robustness. Your current sub-agents might be generating unfaithful or hallucinated outputs for ambiguous queries. By implementing EARS's approach, which involves fine-tuning sub-agents to return structured abstention rationales, you can significantly improve overall response pass rates and enable better coordinator-level recovery strategies. Consider adopting a calibrated LLM-as-a-Judge pipeline to scale data curation for this critical reliability feature.

Key insights

Sub-agent explanatory abstention, framed as inter-agent communication, significantly improves multi-agent system reliability.

Principles

Method

EARS involves an offline calibration stage to build a human-annotated seed dataset and calibrate LLM-as-a-Judge models. An online data flywheel then uses these judges to curate interaction data for sub-agent supervised fine-tuning, enabling iterative improvement.

In practice

Topics

Best for: AI Architect, Research Scientist, CTO, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.