Bounded Difference Concentration for Infinitely Exchangeable Sequences with Applications to AI Benchmark Uncertainty
Summary
The paper introduces a bounded-difference concentration inequality for infinitely exchangeable random variables. It decomposes deviation into conditional sampling fluctuation and latent mixture fluctuation. For zero-sum linear contrasts, such as subsample-vs-full mean comparisons, the latent mixture term cancels exactly, yielding a tight Hoeffding-type bound. This framework is applied to quantify uncertainty in composite AI benchmarks like MMLU, where question items naturally exhibit exchangeable dependence across domains. The results provide a domain-stratified hierarchical model for bounding the uncertainty of accuracy scores and a distribution-free, cost-saving statistical guarantee for accurately estimating full benchmark scores from random subsets. The analysis uses Gemma 2 2B, Gemma 2 9B, Gemma 3 4B, Qwen3 0.6B, Qwen3 1.7B, and Qwen3 8B models on the 14,042-question MMLU test set.
Key takeaway
For AI scientists and research scientists evaluating large language models on composite benchmarks like MMLU, you should abandon conventional independence assumptions. Instead, apply the bounded-difference concentration framework for exchangeable variables to accurately quantify uncertainty. This allows for robust confidence intervals and cost-saving evaluations by estimating full benchmark scores from subsets, especially when using zero-sum contrasts to eliminate latent mixture terms.
Key insights
Bounded-difference concentration for infinitely exchangeable sequences decomposes fluctuations, with mixture terms canceling for zero-sum contrasts.
Principles
- Infinitely exchangeable sequences exhibit positive pairwise correlations.
- Zero-sum linear contrasts eliminate the latent mixture fluctuation term.
- Standard independence assumptions are statistically implausible for composite AI benchmarks.
Method
A bounded-difference concentration inequality is derived by conditioning on the de Finetti directing measure, decomposing deviation into conditional sampling and latent mixture fluctuations.
In practice
- Quantify uncertainty in composite AI benchmarks like MMLU.
- Estimate full benchmark scores accurately from random subsets.
- Use domain-stratified hierarchical models for accuracy scores.
Topics
- Exchangeability
- Concentration Inequalities
- AI Benchmark Uncertainty
- MMLU
- De Finetti Theorem
- Bounded Differences
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.