Optimism Stabilizes Thompson Sampling for Adaptive Inference

2026-06-17 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

The paper "Optimism Stabilizes Thompson Sampling for Adaptive Inference" addresses the challenge of valid statistical inference in multi-armed bandits (MAB) using Thompson Sampling (TS) under adaptive data collection. Classical asymptotic theory fails because arm-specific sample sizes are random and coupled with rewards. The authors identify "optimism" as a key mechanism to restore "stability," a sufficient condition for valid asymptotic inference requiring each arm's pull count to concentrate around a deterministic scale. They prove that two optimistic TS variants achieve this stability for any K-armed Gaussian bandit (K≥2), including scenarios with multiple optimal arms. The first variant, variance-inflated TS (extending Halder et al., 2025), is proven stable. The second variant, TS with an explicit mean bonus, also achieves stability. Both methods incur only a mild additional regret cost, on the order of (ˆlogˆlog T)² over the classical O(ˆlog T) rate, while enabling standard Wald-type confidence intervals.

Key takeaway

For research scientists designing adaptive experimentation or online A/B tests, if you use Thompson Sampling, you should implement optimism to ensure valid statistical inference. Employing variance inflation or a mean bonus in your TS algorithm stabilizes arm pull counts, enabling reliable Wald-type confidence intervals. This approach allows you to confidently interpret results from adaptively collected data, despite a mild, controlled increase in regret.

Key insights

Optimism, implemented via variance inflation or mean bonus, stabilizes Thompson Sampling for valid adaptive inference in multi-armed bandits.

Principles

Adaptive data collection invalidates classical asymptotic inference.
Stability ensures sample sizes concentrate deterministically.
Optimism can be injected through variance or mean adjustments.

Method

Two methods: 1) Inflate posterior sampling variance (σ(ℬ)>1). 2) Add an explicit mean bonus Bₗ,ₜ:=√(2β(ℬ)ˆlog T/Nₗ,ₜ) to the posterior mean. Both modify Gaussian indices.

In practice

Use variance-inflated TS for K-armed bandits.
Apply a mean bonus to posterior means in TS.
Construct Wald-type confidence intervals from stable TS.

Topics

Thompson Sampling
Multi-armed Bandits
Adaptive Inference
Statistical Stability
Variance Inflation
Mean Bonus

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.