In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This paper introduces a finite-sample statistical theory for in-context learning (ICL) within a meta-learning framework, accommodating diverse task types. It proposes a risk decomposition separating total ICL risk into two orthogonal components: Bayes Gap and Posterior Variance. The Bayes Gap quantifies how well a trained uniform-attention Transformer approximates the Bayes-optimal in-context predictor, with a non-asymptotic upper bound clarifying dependence on pretraining prompts (N) and context length (p), showing a rate proportional to m/(pN). The Posterior Variance, a model-independent term, represents intrinsic task uncertainty and vanishes exponentially fast with few in-context examples. The theory also characterizes ICL stability under input-distribution shifts, showing the Bayes Gap incurs a penalty proportional to the Wasserstein distance. This unified view suggests Transformers select optimal meta-algorithms during pretraining and rapidly converge to optimal algorithms for true tasks at test time.

Key takeaway

For AI Scientists and Research Scientists aiming to optimize ICL performance, this theory highlights critical levers. Focus on scaling pretraining data (pN) to minimize the Bayes Gap, ensuring your model effectively approximates the Bayes-optimal predictor. Simultaneously, understand that inference-time context length (k) is crucial for rapidly reducing the Posterior Variance, allowing the model to identify and adapt to the true task. Be mindful that input-distribution shifts will primarily impact the Bayes Gap, requiring careful domain adaptation strategies.

Key insights

ICL is provably Bayesian inference, with risk decomposing into model approximation (Bayes Gap) and intrinsic task uncertainty (Posterior Variance).

Principles

Method

The paper develops a Bayes-centric framework, decomposing ICL risk into Bayes Gap and Posterior Variance, then derives non-asymptotic upper bounds for each term.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.