Offline Contextual Bandits in the Presence of New Actions

2026-05-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Automated decision-making systems, like recommendation engines, typically use off-policy contextual bandits or off-policy learning (OPL) to select actions that maximize expected rewards from a predefined set. However, real-world applications often involve continuously evolving action spaces, where new actions emerge after data collection. This paper introduces a novel OPL method designed to address this challenge by leveraging action features. The approach first presents the Local Combination PseudoInverse (LCPI) estimator, which generalizes the PseudoInverse estimator for policy gradient estimation, controlling the trade-off between reward-modeling and data collection conditions for action features. Building on this, the authors propose Policy Optimization for Effective New Actions (PONA), an algorithm that combines LCPI, specialized for new action selection, with the Doubly Robust (DR) estimator, which is effective for existing actions. PONA is defined as a weighted sum of LCPI and DR estimators, allowing adjustment of new action selection proportions and demonstrating efficient new action selection while maintaining overall policy performance.

Key takeaway

For research scientists developing off-policy learning algorithms in dynamic environments, PONA offers a robust solution for handling continuously evolving action spaces. You should consider integrating PONA's LCPI and DR estimators to effectively select both existing and newly introduced actions, thereby improving the adaptability and performance of your automated decision-making systems, such as recommendation engines or search platforms.

Key insights

A new off-policy learning method, PONA, effectively selects both existing and novel actions in evolving environments.

Principles

Action features enable learning for new actions.
Combine specialized estimators for comprehensive action selection.

Method

PONA integrates the LCPI estimator, which generalizes PseudoInverse for new action selection, with the Doubly Robust (DR) estimator for existing actions, using a weighted sum to optimize both.

In practice

Use LCPI for policy gradient estimation with action features.
Implement PONA for dynamic recommendation systems.
Adjust new action selection via PONA's weight parameter.

Topics

Offline Contextual Bandits
Off-Policy Learning
New Actions
Policy Optimization for Effective New Actions
Local Combination PseudoInverse

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.