Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random
Summary
Off-Policy Evaluation (OPE) in finite-horizon Markov decision processes is investigated for scenarios where immediate rewards in logged batch data are Missing Not At Random (MNAR), a common issue in healthcare and marketing that induces selection bias. This research formalizes a reward-dependent propensity model and uses future states as shadow variables to identify the full-data conditional mean reward. A novel "bridge function" is introduced to recover this conditional mean without explicitly modeling the MNAR mechanism, estimated via a min-max procedure to avoid double sampling. Building on these identification results, the paper proposes a Fitted-Q-Evaluation-style estimator that propagates recovered rewards and allows target policies to incorporate past missingness indicators. The method establishes consistency and finite-sample error bounds, demonstrating strong performance against existing methods on simulated and MIMIC-III Sepsis data.
Key takeaway
For Machine Learning Engineers evaluating policies from offline reinforcement learning data where rewards are Missing Not At Random, traditional off-policy evaluation methods will yield biased results. You should consider implementing this paper's Fitted-Q-Evaluation-style estimator, which uses a bridge function and future states to accurately recover conditional mean rewards. This approach provides consistency and finite-sample error bounds, offering a robust solution for policy evaluation in critical domains like healthcare and marketing.
Key insights
A novel OPE method addresses Missing Not At Random rewards in RL by using a bridge function and future states as shadow variables.
Principles
- MNAR rewards induce selection bias in OPE.
- Future states can identify conditional mean rewards.
- Bridge functions recover rewards without MNAR modeling.
Method
Formalize a reward-dependent propensity model, use future states as shadow variables, introduce a bridge function estimated via min-max procedure, then apply a Fitted-Q-Evaluation-style estimator.
In practice
- Apply OPE to healthcare data with missing rewards.
- Evaluate policies in marketing with sparse records.
Topics
- Off-Policy Evaluation
- Reinforcement Learning
- Missing Not At Random
- Markov Decision Processes
- Fitted Q-Evaluation
- Healthcare AI
- MIMIC-III Sepsis
Best for: AI Scientist, Machine Learning Engineer, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.