Stochastic Linear Contextual Bandits with Bounded Noise: A Set-Membership Approach

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

A novel algorithm, SME-OFU, addresses Stochastic Linear Contextual Bandits (SLCB) by explicitly leveraging bounded reward noise, a condition common in applications like personalized recommendations and mobile health. Unlike existing methods that assume sub-Gaussian noise and achieve an ~O(sqrt(T)) regret bound, SME-OFU employs Set-Membership Estimation (SME) for uncertainty quantification within an Optimism in the Face of Uncertainty (OFU) framework. This approach yields a significantly improved, instance-independent regret bound of O(log T), specifically O(d^3 log(dT/delta)). The algorithm's theoretical analysis required a new convex geometry-based proof framework due to SME's polytopic uncertainty sets. Empirical simulations demonstrate SME-OFU's superior performance over sub-Gaussian noise benchmarks when reward noise is indeed bounded.

Key takeaway

For Machine Learning Engineers developing online decision-making systems like personalized recommenders or mobile health interventions where reward noise is inherently bounded, you should re-evaluate standard contextual bandit algorithms. Instead of relying on sub-Gaussian noise assumptions, consider SME-OFU or similar Set-Membership Estimation (SME)-based approaches. This can significantly improve your system's long-term performance by achieving an O(log T) regret, a substantial gain over the ~O(sqrt(T)) bounds of traditional methods, provided you have a valid noise upper bound.

Key insights

Explicitly leveraging bounded reward noise in contextual bandits can yield logarithmic regret, significantly outperforming square-root bounds.

Principles

Method

The SME-OFU algorithm iteratively constructs polytopic uncertainty sets for the unknown parameter using historical data and a known noise bound, then selects the arm maximizing expected reward over this set.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.