Nine Judges, Two Effective Votes: Correlated Errors Undermine LLM Evaluation Panels

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new framework measures the true informational value of LLM-as-a-judge panels, revealing that their reliability falls significantly short of an independent-voting ideal. Testing a panel of 9 frontier LLMs from 7 model families on three natural language inference datasets, each with 100 human annotations per item, researchers found these 9 judges effectively provide only about 2 independent votes' worth of information. Approximately three-quarters of the panel's nominal independence is lost because models make the same mistakes on the same items. This correlation causes the panel's actual accuracy to fall 8-22 percentage points short of what independent voting would achieve, with the best single judge often matching or outperforming the full panel. Neither adding more judges nor using smarter aggregation algorithms substantially helps, closing at most 11% of this gap. The deficit is robust across prompt variants, temperatures, chain-of-thought reasoning, and a pairwise preference task (RewardBench), indicating correlated judges are the bottleneck.

Key takeaway

For Machine Learning Engineers designing LLM evaluation systems, you should critically assess the true independence of your judge panels. Relying on a high count of LLM judges without addressing correlated errors will likely yield misleading accuracy metrics, as your panel's effective information value is significantly lower than its nominal size. Prioritize genuinely diverse evaluation approaches or single, high-performing judges over simply scaling up the number of similar LLM judges to achieve reliable results.

Key insights

LLM-as-a-judge panels suffer from highly correlated errors, reducing 9 judges to effectively 2 independent votes and undermining evaluation reliability.

Principles

Method

The study developed a framework using Kish effective sample size (n_eff) and a Condorcet null model to quantify informational value and reliability deficits in LLM evaluation panels.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, NLP Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.