In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

2025-09-18 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This paper introduces a finite-sample statistical theory for in-context learning (ICL) within a meta-learning framework, accommodating diverse task types. It proposes a risk decomposition separating total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well a trained uniform-attention Transformer approximates the Bayes-optimal in-context predictor, with a non-asymptotic upper bound clarifying dependence on pretraining prompts (N) and context length (p), showing a rate proportional to m/(pN). The Posterior Variance, a model-independent term, represents intrinsic task uncertainty and vanishes exponentially fast with few in-context examples. The theory also characterizes ICL stability under input-distribution shifts, showing the Bayes Gap incurs a penalty proportional to the Wasserstein distance. This unified view suggests Transformers select optimal meta-algorithms during pretraining and rapidly converge to optimal algorithms for true tasks at test time.

Key takeaway

For AI Scientists and Research Scientists aiming to optimize ICL performance, this theory highlights critical levers. Focus on scaling pretraining data (pN) to minimize the Bayes Gap, ensuring your model effectively approximates the Bayes-optimal predictor. Simultaneously, understand that inference-time context length (k) is crucial for rapidly reducing the Posterior Variance, allowing the model to identify and adapt to the true task. Be mindful that input-distribution shifts will primarily impact the Bayes Gap, requiring careful domain adaptation strategies.

Key insights

ICL is provably Bayesian inference, with risk decomposing into model approximation (Bayes Gap) and intrinsic task uncertainty (Posterior Variance).

Principles

ICL risk orthogonally decomposes into Bayes Gap and Posterior Variance.
Posterior Variance is irreducible, determined by true task difficulty.
Bayes Gap depends on pretraining scale (pN) and model expressiveness (m).

Method

The paper develops a Bayes-centric framework, decomposing ICL risk into Bayes Gap and Posterior Variance, then derives non-asymptotic upper bounds for each term.

In practice

Increase pretraining prompts (N) and context length (p) to reduce Bayes Gap.
Longer in-context examples reduce Posterior Variance exponentially fast.
Input distribution shifts primarily affect the Bayes Gap, scaling with Wasserstein distance.

Topics

In-Context Learning
Bayesian Inference
Meta-Learning
Transformer Architectures
Generalization Theory
Out-of-Distribution Stability

Code references

features/copilot

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.