When Agents Disagree: The Selection Bottleneck in Multi-Agent LLM Pipelines
Summary
A new study introduces the "selection bottleneck" model to reconcile contradictory findings regarding the impact of team diversity on multi-agent LLM pipeline quality. The model proposes a crossover threshold, $s^{*}$, which determines whether diversity helps or hurts based on the quality of the aggregation mechanism. An experiment across 42 tasks in 7 categories ($N=210$) found that a diverse team with judge-based selection achieved an 0.810 win rate against a single-model baseline, significantly outperforming a homogeneous team's 0.512 win rate. Judge-based selection also dramatically outperformed MoA-style synthesis, with a win rate difference of +0.631, as synthesis lost to a single-model baseline in all 42 tasks. Exploratory evidence suggests that including a weaker model like Claude Haiku can improve performance while reducing cost ($p<10^{-4}$). The findings indicate that selector quality is a more critical design factor than generator diversity in generate-then-select pipelines.
Key takeaway
For AI Architects designing multi-agent LLM pipelines, your focus should shift from merely assembling diverse models to rigorously optimizing the selection mechanism. Judge-based selection significantly outperforms synthesis-based aggregation, even causing diverse teams to underperform. Prioritize developing robust selectors, as this directly impacts whether team diversity becomes an asset or a liability, and consider that even a weaker, cheaper model can paradoxically enhance overall team quality and reduce costs if paired with a strong selector.
Key insights
Selector quality, not just team diversity, dictates multi-agent LLM pipeline performance.
Principles
- Diversity helps when selector quality exceeds a crossover threshold $s^{*}$.
- Synthesis-based aggregation operates at low selector quality, negating diversity benefits.
- Homogeneous teams offer no exploitable diversity for selection.
Method
The selection bottleneck model defines selector quality $s$ and derives a crossover threshold $s^{*}$ to predict when diverse teams outperform homogeneous ones, based on team mean and oracle quality.
In practice
- Prioritize selection mechanism design over increasing generator diversity.
- Consider judge-based selection for open-ended generation tasks.
- Experiment with including weaker, cheaper models in diverse teams.
Topics
- Multi-Agent LLM Pipelines
- Selection Bottleneck Model
- Aggregation Mechanisms
- LLM Evaluation
- Team Diversity
Code references
Best for: AI Architect, AI Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.MA updates on arXiv.org.