Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

2026-06-19 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, extended

Summary

This research introduces a novel off-policy evaluation (OPE) method for finite-horizon Markov decision processes (MDPs) where immediate rewards are missing not at random (MNAR). This issue, common in healthcare and marketing, induces selection bias. The proposed approach formalizes a reward-dependent propensity model and leverages future states ($S_{t+1}$) as endogenous "shadow variables" to identify the full-data conditional mean reward. A key innovation is a bridge function, estimated via a min-max procedure, which recovers the conditional mean reward without explicitly modeling the MNAR mechanism, thereby avoiding double sampling. Building on these identification results, an Fitted-Q-Evaluation (FQE)-style estimator is developed, allowing target policies to depend on past missingness indicators. Experiments on simulated data and the MIMIC-III Sepsis dataset demonstrate superior performance over existing baselines, achieving lower bias across varying missingness rates (20% to 80%).

Key takeaway

For AI Scientists and Research Scientists evaluating policies in offline Reinforcement Learning settings with potentially missing rewards, this method offers a robust solution. Your current OPE approaches may be biased if rewards are Missing Not At Random (MNAR). By adopting the proposed bridge function and FQE-style estimator, which leverages future states as shadow variables, you can achieve more accurate policy value estimates, especially in high-stakes domains like healthcare, without needing to model complex MNAR mechanisms or acquire additional data.

Key insights

A novel OPE method uses future states as shadow variables and a bridge function to address Missing Not At Random (MNAR) rewards in MDPs.

Principles

MNAR rewards break ignorability, inducing selection bias.
Future states ($S_{t+1}$) can serve as endogenous shadow variables.
Bridge functions recover conditional mean rewards without explicit MNAR modeling.

Method

The method formalizes a reward-dependent propensity model, uses future states as shadow variables, introduces a bridge function estimated via a min-max procedure, and integrates this into an FQE-style estimator that propagates recovered rewards.

In practice

Apply to healthcare (e.g., MIMIC-III Sepsis data) for robust policy evaluation.
Avoid explicit MNAR mechanism modeling to reduce variance.
Utilize existing logged data ($S_{t+1}$) as shadow variables, no new measurements needed.

Topics

Off-Policy Evaluation
Missing Not At Random
Markov Decision Processes
Reinforcement Learning
Shadow Variables
Fitted-Q-Evaluation
MIMIC-III

Code references

NAIVlab/ShadOPE

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.