Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random
Summary
This research introduces a novel off-policy evaluation (OPE) method for finite-horizon Markov decision processes (MDPs) where immediate rewards are missing not at random (MNAR). This issue, common in healthcare and marketing, induces selection bias. The proposed approach formalizes a reward-dependent propensity model and leverages future states ($S_{t+1}$) as endogenous "shadow variables" to identify the full-data conditional mean reward. A key innovation is a bridge function, estimated via a min-max procedure, which recovers the conditional mean reward without explicitly modeling the MNAR mechanism, thereby avoiding double sampling. Building on these identification results, an Fitted-Q-Evaluation (FQE)-style estimator is developed, allowing target policies to depend on past missingness indicators. Experiments on simulated data and the MIMIC-III Sepsis dataset demonstrate superior performance over existing baselines, achieving lower bias across varying missingness rates (20% to 80%).
Key takeaway
For AI Scientists and Research Scientists evaluating policies in offline Reinforcement Learning settings with potentially missing rewards, this method offers a robust solution. Your current OPE approaches may be biased if rewards are Missing Not At Random (MNAR). By adopting the proposed bridge function and FQE-style estimator, which leverages future states as shadow variables, you can achieve more accurate policy value estimates, especially in high-stakes domains like healthcare, without needing to model complex MNAR mechanisms or acquire additional data.
Key insights
A novel OPE method uses future states as shadow variables and a bridge function to address Missing Not At Random (MNAR) rewards in MDPs.
Principles
- MNAR rewards break ignorability, inducing selection bias.
- Future states ($S_{t+1}$) can serve as endogenous shadow variables.
- Bridge functions recover conditional mean rewards without explicit MNAR modeling.
Method
The method formalizes a reward-dependent propensity model, uses future states as shadow variables, introduces a bridge function estimated via a min-max procedure, and integrates this into an FQE-style estimator that propagates recovered rewards.
In practice
- Apply to healthcare (e.g., MIMIC-III Sepsis data) for robust policy evaluation.
- Avoid explicit MNAR mechanism modeling to reduce variance.
- Utilize existing logged data ($S_{t+1}$) as shadow variables, no new measurements needed.
Topics
- Off-Policy Evaluation
- Missing Not At Random
- Markov Decision Processes
- Reinforcement Learning
- Shadow Variables
- Fitted-Q-Evaluation
- MIMIC-III
Code references
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.