Bounded Difference Concentration for Infinitely Exchangeable Sequences with Applications to AI Benchmark Uncertainty

2026-06-17 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The paper introduces a bounded-difference concentration inequality for infinitely exchangeable random variables. It decomposes deviation into conditional sampling fluctuation and latent mixture fluctuation. For zero-sum linear contrasts, such as subsample-vs-full mean comparisons, the latent mixture term cancels exactly, yielding a tight Hoeffding-type bound. This framework is applied to quantify uncertainty in composite AI benchmarks like MMLU, where question items naturally exhibit exchangeable dependence across domains. The results provide a domain-stratified hierarchical model for bounding the uncertainty of accuracy scores and a distribution-free, cost-saving statistical guarantee for accurately estimating full benchmark scores from random subsets. The analysis uses Gemma 2 2B, Gemma 2 9B, Gemma 3 4B, Qwen3 0.6B, Qwen3 1.7B, and Qwen3 8B models on the 14,042-question MMLU test set.

Key takeaway

For AI scientists and research scientists evaluating large language models on composite benchmarks like MMLU, you should abandon conventional independence assumptions. Instead, apply the bounded-difference concentration framework for exchangeable variables to accurately quantify uncertainty. This allows for robust confidence intervals and cost-saving evaluations by estimating full benchmark scores from subsets, especially when using zero-sum contrasts to eliminate latent mixture terms.

Key insights

Bounded-difference concentration for infinitely exchangeable sequences decomposes fluctuations, with mixture terms canceling for zero-sum contrasts.

Principles

Infinitely exchangeable sequences exhibit positive pairwise correlations.
Zero-sum linear contrasts eliminate the latent mixture fluctuation term.
Standard independence assumptions are statistically implausible for composite AI benchmarks.

Method

A bounded-difference concentration inequality is derived by conditioning on the de Finetti directing measure, decomposing deviation into conditional sampling and latent mixture fluctuations.

In practice

Quantify uncertainty in composite AI benchmarks like MMLU.
Estimate full benchmark scores accurately from random subsets.
Use domain-stratified hierarchical models for accuracy scores.

Topics

Exchangeability
Concentration Inequalities
AI Benchmark Uncertainty
MMLU
De Finetti Theorem
Bounded Differences

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.