Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels
Summary
A new framework measures the true informational value of LLM-as-a-judge panels, revealing that their reliability falls significantly short of an independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets, each with 100 human annotations per item, researchers found these 9 judges effectively provide only about 2 independent votes' worth of information. Approximately three-quarters of the panel's nominal independence is lost because models make the same mistakes on the same items. This correlation causes the panel's actual accuracy to fall 8-22 percentage points short of what independent voting would achieve, with the best single judge often matching or outperforming the full panel. Neither adding more judges nor using smarter aggregation algorithms substantially helps, closing at most 11% of this gap. The deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench), indicating correlated judges are the bottleneck.
Key takeaway
For Machine Learning Engineers designing LLM evaluation systems, you should critically assess the true independence of your judge panels. Relying on a high count of LLM judges without addressing correlated errors will likely yield misleading accuracy metrics, as your panel's effective information value is significantly lower than its nominal size. Prioritize genuinely diverse evaluation approaches or single, high-performing judges over simply scaling up the number of similar LLM judges to achieve reliable results.
Key insights
LLM-as-a-judge panels suffer from highly correlated errors, reducing 9 judges to effectively 2 independent votes and undermining evaluation reliability.
Principles
- LLM evaluation panels often lack true independence.
- Correlated errors are a primary bottleneck in LLM evaluation.
- Scaling judge panels does not substitute for genuine independence.
Method
The study developed a framework using Kish effective sample size (n_eff) and a Condorcet null model to quantify informational value and reliability deficits in LLM evaluation panels.
In practice
- Prioritize diverse model architectures for evaluation panels.
- Validate panel independence using n_eff or similar metrics.
- Focus on reducing correlated errors over adding more judges.
Topics
- LLM Evaluation
- Correlated Errors
- Judge Panels
- Model Reliability
- Natural Language Inference
- Effective Sample Size
Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.