Optimism Stabilizes Thompson Sampling for Adaptive Inference
Summary
The paper "Optimism Stabilizes Thompson Sampling for Adaptive Inference" addresses the challenge of valid statistical inference in multi-armed bandits (MAB) using Thompson Sampling (TS) under adaptive data collection. Classical asymptotic theory fails because arm-specific sample sizes are random and coupled with rewards. The authors identify "optimism" as a key mechanism to restore "stability," a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. They prove that two optimistic TS variants achieve this stability for any K-armed Gaussian bandit (K≥2), including scenarios with multiple optimal arms. The first variant, variance-inflated TS (extending Halder et al., 2025), is proven stable. The second variant, TS with an explicit mean bonus, also achieves stability. Both methods incur only a mild additional regret cost, on the order of (ˆlogˆlog T)² over the classical O(ˆlog T) rate, while enabling standard Wald-type confidence intervals.
Key takeaway
For research scientists designing adaptive experimentation or online A/B tests, if you use Thompson Sampling, you should implement optimism to ensure valid statistical inference. Employing variance inflation or a mean bonus in your TS algorithm stabilizes arm pull counts, enabling reliable Wald-type confidence intervals. This approach allows you to confidently interpret results from adaptively collected data, despite a mild, controlled increase in regret.
Key insights
Optimism, implemented via variance inflation or mean bonus, stabilizes Thompson Sampling for valid adaptive inference in multi-armed bandits.
Principles
- Adaptive data collection invalidates classical asymptotic inference.
- Stability ensures sample sizes concentrate deterministically.
- Optimism can be injected through variance or mean adjustments.
Method
Two methods: 1) Inflate posterior sampling variance (σ(ℬ)>1). 2) Add an explicit mean bonus Bₗ,ₜ:=√(2β(ℬ)ˆlog T/Nₗ,ₜ) to the posterior mean. Both modify Gaussian indices.
In practice
- Use variance-inflated TS for K-armed bandits.
- Apply a mean bonus to posterior means in TS.
- Construct Wald-type confidence intervals from stable TS.
Topics
- Thompson Sampling
- Multi-armed Bandits
- Adaptive Inference
- Statistical Stability
- Variance Inflation
- Mean Bonus
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.