Resolution Diagnostics for Paired LLM Evaluation

2026-05-28 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

An analysis of public LLM leaderboards reveals that many displayed pairwise rankings lack statistical resolution under their actual paired evaluation designs. Specifically, 11 of 40 Open LLM Leaderboard v1 comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 with real subject-level clustering. The study frames paired LLM evaluation as a hypothesis-testing problem, proposing a per-pair resolution ratio q = N/N* as a primary diagnostic. It highlights that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two, a deficit inherited by three of five off-the-shelf calculators. This unresolved pattern persists even with multiplicity correction and anytime-valid sequential testing.

Key takeaway

For AI Scientists and Machine Learning Engineers evaluating LLMs, you should critically assess the statistical resolution of pairwise comparisons, especially on public leaderboards. Do not rely on unpaired Cohen-h-plus-(1-rho) shortcuts for sample size calculations, as they can lead to significant underestimation. Instead, adopt a hypothesis-testing framework and consider the proposed per-pair resolution ratio q = N/N* to ensure robust and statistically sound LLM evaluations.

Key insights

Many LLM leaderboard pairwise rankings are statistically unresolved due to insufficient evaluation design and flawed power calculations.

Principles

Paired LLM evaluation is a hypothesis-testing problem.
Unpaired power shortcuts can severely misestimate required sample sizes.
Real-world clustering impacts resolution targets.

Method

Frame paired LLM evaluation as hypothesis testing, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic.

In practice

Re-evaluate existing LLM leaderboard rankings for resolution.
Avoid unpaired Cohen-h-plus-(1-rho) shortcuts for N* calculation.
Account for subject-level clustering in evaluation design.

Topics

LLM Evaluation
Statistical Significance
Hypothesis Testing
Power Analysis
Leaderboard Rankings
MMLU-Pro
Open LLM Leaderboard

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.