Bounded Difference Concentration for Infinitely Exchangeable Sequences with Applications to AI Benchmark Uncertainty

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

The paper introduces a bounded-difference concentration inequality for infinitely exchangeable random variables. It decomposes deviation into conditional sampling fluctuation and latent mixture fluctuation. For zero-sum linear contrasts, such as subsample-vs-full mean comparisons, the latent mixture term cancels exactly, yielding a tight Hoeffding-type bound. This framework is applied to quantify uncertainty in composite AI benchmarks like MMLU, where question items naturally exhibit exchangeable dependence across domains. The results provide a domain-stratified hierarchical model for bounding the uncertainty of accuracy scores and a distribution-free, cost-saving statistical guarantee for accurately estimating full benchmark scores from random subsets. The analysis uses Gemma 2 2B, Gemma 2 9B, Gemma 3 4B, Qwen3 0.6B, Qwen3 1.7B, and Qwen3 8B models on the 14,042-question MMLU test set.

Key takeaway

For AI scientists and research scientists evaluating large language models on composite benchmarks like MMLU, you should abandon conventional independence assumptions. Instead, apply the bounded-difference concentration framework for exchangeable variables to accurately quantify uncertainty. This allows for robust confidence intervals and cost-saving evaluations by estimating full benchmark scores from subsets, especially when using zero-sum contrasts to eliminate latent mixture terms.

Key insights

Bounded-difference concentration for infinitely exchangeable sequences decomposes fluctuations, with mixture terms canceling for zero-sum contrasts.

Principles

Method

A bounded-difference concentration inequality is derived by conditioning on the de Finetti directing measure, decomposing deviation into conditional sampling and latent mixture fluctuations.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.