Stochastic Linear Contextual Bandits with Bounded Noise: A Set-Membership Approach
Summary
A novel algorithm, SME-OFU, addresses Stochastic Linear Contextual Bandits (SLCB) by explicitly leveraging bounded reward noise, a condition common in applications like personalized recommendations and mobile health. Unlike existing methods that assume sub-Gaussian noise and achieve an ~O(sqrt(T)) regret bound, SME-OFU employs Set-Membership Estimation (SME) for uncertainty quantification within an Optimism in the Face of Uncertainty (OFU) framework. This approach yields a significantly improved, instance-independent regret bound of O(log T), specifically O(d^3 log(dT/delta)). The algorithm's theoretical analysis required a new convex geometry-based proof framework due to SME's polytopic uncertainty sets. Empirical simulations demonstrate SME-OFU's superior performance over sub-Gaussian noise benchmarks when reward noise is indeed bounded.
Key takeaway
For Machine Learning Engineers developing online decision-making systems like personalized recommenders or mobile health interventions where reward noise is inherently bounded, you should re-evaluate standard contextual bandit algorithms. Instead of relying on sub-Gaussian noise assumptions, consider SME-OFU or similar Set-Membership Estimation (SME)-based approaches. This can significantly improve your system's long-term performance by achieving an O(log T) regret, a substantial gain over the ~O(sqrt(T)) bounds of traditional methods, provided you have a valid noise upper bound.
Key insights
Explicitly leveraging bounded reward noise in contextual bandits can yield logarithmic regret, significantly outperforming square-root bounds.
Principles
- Bounded noise offers stronger information than sub-Gaussian assumptions.
- Set-membership estimation (SME) quantifies uncertainty effectively with bounded noise.
- Optimism in the Face of Uncertainty (OFU) guides arm selection using uncertainty sets.
Method
The SME-OFU algorithm iteratively constructs polytopic uncertainty sets for the unknown parameter using historical data and a known noise bound, then selects the arm maximizing expected reward over this set.
In practice
- Apply to personalized recommendation systems with binary click rewards.
- Use in mobile health for bounded metrics like sedentary time or walking steps.
Topics
- Stochastic Linear Contextual Bandits
- Bounded Noise
- Set-Membership Estimation
- Optimism in the Face of Uncertainty
- Regret Bounds
- Convex Geometry
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.