When Does Combining Language Models Help? A Co-Failure Ceiling on Routing, Voting, and Mixture-of-Agents Across 67 Frontier Models

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The article investigates the effectiveness of combining language models (LLMs) in multi-model systems like routing, voting, and mixture-of-agents. It reveals that the accuracy gain from such systems is capped by a "co-failure ceiling," defined as `1 - beta`, where `beta` is the rate at which all models fail on the same query. This `beta` value, often overlooked, is a more critical diagnostic than the commonly reported average pairwise error correlation (`rho`), which cannot accurately predict all-wrong rates. The study, involving 67 frontier models from 21 providers, found that observed `beta` on open-ended mathematics was 0.052, significantly higher than the 0.023 predicted by a Gaussian copula model, indicating a 2.5 times underpricing (90% CI 1.7 to 3.4). A similar effect was observed on execution-graded code, with `beta` at 0.079. Furthermore, re-asking GPQA-Diamond questions in free-response format increased `beta` to 0.127, suggesting co-failure is linked to answer format rather than subject matter. The research concludes that gains primarily stem from models failing on different questions, not merely from adding more models, and that combining models rarely surpasses the single best model without a strong query-level routing signal.

Key takeaway

For AI Architects designing multi-model LLM systems, you must prioritize understanding the co-failure rate (`beta`) among your chosen models. This metric, not just pairwise error correlation (`rho`), dictates the maximum accuracy gain achievable. Before investing in complex routing or voting mechanisms, calculate `1 - beta` to set realistic expectations for system performance. Focus your efforts on selecting models that fail on different types of questions and developing strong query-level routing signals to maximize your ensemble's effectiveness.

Key insights

Combining LLMs offers accuracy gains capped by `1 - beta`, the rate where all models fail on the same query.

Principles

Method

The study used a Clopper-Pearson bound on `beta` to certify potential gains. It applied a tetrachoric-calibrated single-factor model across 67 frontier models to analyze co-failure rates on mathematics and code tasks.

In practice

Topics

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.