ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces
Summary
ACAR (Adaptive Complexity & Attribution Routing) is a measurement framework for multi-model orchestration that uses self-consistency variance ($\\sigma$) to route tasks across single-model, two-model, and three-model execution modes. Evaluated across 1,510 tasks from four benchmarks (MathArena, Reasoning Gym, LiveCodeBench, SuperGPQA) using Claude Sonnet 4, GPT-4o, and Gemini 2.0 Flash, ACAR-U (without retrieval) achieved 55.6% accuracy, surpassing the two-model baseline (54.4%) while avoiding full ensembling on 54.2% of tasks. The mechanism is model-agnostic and requires no learned components. Notably, retrieval augmentation *decreased* accuracy by 3.4 percentage points due to low semantic alignment (median similarity 0.167), and "agreement-but-wrong" scenarios (where models agree on incorrect answers) bounded achievable accuracy 8 percentage points below full ensembling. The framework also found that attribution estimates based on proxy signals showed weak correlation with ground-truth values.
Key takeaway
For AI Architects designing multi-model LLM systems, ACAR's findings suggest prioritizing heuristic-based routing over learned classifiers for auditability and stability. You should implement self-consistency variance for adaptive compute allocation, as it improves accuracy over fixed two-model ensembles while reducing full ensemble usage. Critically, avoid naive retrieval augmentation without high semantic similarity thresholds, and recognize that "agreement-but-wrong" scenarios will inherently limit maximum achievable accuracy, necessitating strategies beyond simple ensembling for those cases.
Key insights
Self-consistency variance can effectively route multi-model ensembles, but retrieval and proxy attribution often fail.
Principles
- Self-consistency routing has fundamental limits.
- Retrieval requires semantic alignment.
- Attribution requires counterfactuals.
Method
ACAR routes tasks based on self-consistency variance ($\\sigma$) derived from N=3 probe samples, mapping variance to single-agent, arena-lite, or full-arena execution modes. This heuristic avoids learned components for auditability.
In practice
- Use similarity thresholds >0.7 for retrieval augmentation.
- Expect an accuracy ceiling from "agreement-but-wrong" failures.
- Implement explicit counterfactuals for practical attribution.
Topics
- Adaptive Complexity Routing
- Multi-Model Ensembles
- Self-Consistency Variance
- Auditable AI Systems
- Retrieval Augmentation
Code references
Best for: AI Scientist, Research Scientist, AI Architect, AI Researcher, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.LG updates on arXiv.org.