Offline Contextual Bandits in the Presence of New Actions
Summary
Automated decision-making systems, like recommendation engines, typically use off-policy contextual bandits or off-policy learning (OPL) to select actions that maximize expected rewards from a predefined set. However, real-world applications often involve continuously evolving action spaces, where new actions emerge after data collection. This paper introduces a novel OPL method designed to address this challenge by leveraging action features. The approach first presents the Local Combination PseudoInverse (LCPI) estimator, which generalizes the PseudoInverse estimator for policy gradient estimation, controlling the trade-off between reward-modeling and data collection conditions for action features. Building on this, the authors propose Policy Optimization for Effective New Actions (PONA), an algorithm that combines LCPI, specialized for new action selection, with the Doubly Robust (DR) estimator, which is effective for existing actions. PONA is defined as a weighted sum of LCPI and DR estimators, allowing adjustment of new action selection proportions and demonstrating efficient new action selection while maintaining overall policy performance.
Key takeaway
For research scientists developing off-policy learning algorithms in dynamic environments, PONA offers a robust solution for handling continuously evolving action spaces. You should consider integrating PONA's LCPI and DR estimators to effectively select both existing and newly introduced actions, thereby improving the adaptability and performance of your automated decision-making systems, such as recommendation engines or search platforms.
Key insights
A new off-policy learning method, PONA, effectively selects both existing and novel actions in evolving environments.
Principles
- Action features enable learning for new actions.
- Combine specialized estimators for comprehensive action selection.
Method
PONA integrates the LCPI estimator, which generalizes PseudoInverse for new action selection, with the Doubly Robust (DR) estimator for existing actions, using a weighted sum to optimize both.
In practice
- Use LCPI for policy gradient estimation with action features.
- Implement PONA for dynamic recommendation systems.
- Adjust new action selection via PONA's weight parameter.
Topics
- Offline Contextual Bandits
- Off-Policy Learning
- New Actions
- Policy Optimization for Effective New Actions
- Local Combination PseudoInverse
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.