In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning
Summary
This paper introduces a finite-sample statistical theory for in-context learning (ICL) within a meta-learning framework, accommodating diverse task types. It proposes a risk decomposition separating total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well a trained uniform-attention Transformer approximates the Bayes-optimal in-context predictor, with a non-asymptotic upper bound clarifying dependence on pretraining prompts (N) and context length (p), showing a rate proportional to m/(pN). The Posterior Variance, a model-independent term, represents intrinsic task uncertainty and vanishes exponentially fast with few in-context examples. The theory also characterizes ICL stability under input-distribution shifts, showing the Bayes Gap incurs a penalty proportional to the Wasserstein distance. This unified view suggests Transformers select optimal meta-algorithms during pretraining and rapidly converge to optimal algorithms for true tasks at test time.
Key takeaway
For AI Scientists and Research Scientists aiming to optimize ICL performance, this theory highlights critical levers. Focus on scaling pretraining data (pN) to minimize the Bayes Gap, ensuring your model effectively approximates the Bayes-optimal predictor. Simultaneously, understand that inference-time context length (k) is crucial for rapidly reducing the Posterior Variance, allowing the model to identify and adapt to the true task. Be mindful that input-distribution shifts will primarily impact the Bayes Gap, requiring careful domain adaptation strategies.
Key insights
ICL is provably Bayesian inference, with risk decomposing into model approximation (Bayes Gap) and intrinsic task uncertainty (Posterior Variance).
Principles
- ICL risk orthogonally decomposes into Bayes Gap and Posterior Variance.
- Posterior Variance is irreducible, determined by true task difficulty.
- Bayes Gap depends on pretraining scale (pN) and model expressiveness (m).
Method
The paper develops a Bayes-centric framework, decomposing ICL risk into Bayes Gap and Posterior Variance, then derives non-asymptotic upper bounds for each term.
In practice
- Increase pretraining prompts (N) and context length (p) to reduce Bayes Gap.
- Longer in-context examples reduce Posterior Variance exponentially fast.
- Input distribution shifts primarily affect the Bayes Gap, scaling with Wasserstein distance.
Topics
- In-Context Learning
- Bayesian Inference
- Meta-Learning
- Transformer Architectures
- Generalization Theory
- Out-of-Distribution Stability
Code references
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.