Resolution Diagnostics for Paired LLM Evaluation
Summary
An analysis of public LLM leaderboards reveals that many displayed pairwise rankings lack statistical resolution under their actual paired evaluation designs. Specifically, 11 of 40 Open LLM Leaderboard v1 comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 with real subject-level clustering. The study frames paired LLM evaluation as a hypothesis-testing problem, proposing a per-pair resolution ratio q = N/N* as a primary diagnostic. It highlights that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two, a deficit inherited by three of five off-the-shelf calculators. This unresolved pattern persists even with multiplicity correction and anytime-valid sequential testing.
Key takeaway
For AI Scientists and Machine Learning Engineers evaluating LLMs, you should critically assess the statistical resolution of pairwise comparisons, especially on public leaderboards. Do not rely on unpaired Cohen-h-plus-(1-rho) shortcuts for sample size calculations, as they can lead to significant underestimation. Instead, adopt a hypothesis-testing framework and consider the proposed per-pair resolution ratio q = N/N* to ensure robust and statistically sound LLM evaluations.
Key insights
Many LLM leaderboard pairwise rankings are statistically unresolved due to insufficient evaluation design and flawed power calculations.
Principles
- Paired LLM evaluation is a hypothesis-testing problem.
- Unpaired power shortcuts can severely misestimate required sample sizes.
- Real-world clustering impacts resolution targets.
Method
Frame paired LLM evaluation as hypothesis testing, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic.
In practice
- Re-evaluate existing LLM leaderboard rankings for resolution.
- Avoid unpaired Cohen-h-plus-(1-rho) shortcuts for N* calculation.
- Account for subject-level clustering in evaluation design.
Topics
- LLM Evaluation
- Statistical Significance
- Hypothesis Testing
- Power Analysis
- Leaderboard Rankings
- MMLU-Pro
- Open LLM Leaderboard
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.