Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

2026-06-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Off-Policy Evaluation (OPE) in finite-horizon Markov decision processes is investigated for scenarios where immediate rewards in logged batch data are Missing Not At Random (MNAR), a common issue in healthcare and marketing that induces selection bias. This research formalizes a reward-dependent propensity model and uses future states as shadow variables to identify the full-data conditional mean reward. A novel "bridge function" is introduced to recover this conditional mean without explicitly modeling the MNAR mechanism, estimated via a min-max procedure to avoid double sampling. Building on these identification results, the paper proposes a Fitted-Q-Evaluation-style estimator that propagates recovered rewards and allows target policies to incorporate past missingness indicators. The method establishes consistency and finite-sample error bounds, demonstrating strong performance against existing methods on simulated and MIMIC-III Sepsis data.

Key takeaway

For Machine Learning Engineers evaluating policies from offline reinforcement learning data where rewards are Missing Not At Random, traditional off-policy evaluation methods will yield biased results. You should consider implementing this paper's Fitted-Q-Evaluation-style estimator, which uses a bridge function and future states to accurately recover conditional mean rewards. This approach provides consistency and finite-sample error bounds, offering a robust solution for policy evaluation in critical domains like healthcare and marketing.

Key insights

A novel OPE method addresses Missing Not At Random rewards in RL by using a bridge function and future states as shadow variables.

Principles

MNAR rewards induce selection bias in OPE.
Future states can identify conditional mean rewards.
Bridge functions recover rewards without MNAR modeling.

Method

Formalize a reward-dependent propensity model, use future states as shadow variables, introduce a bridge function estimated via min-max procedure, then apply a Fitted-Q-Evaluation-style estimator.

In practice

Apply OPE to healthcare data with missing rewards.
Evaluate policies in marketing with sparse records.

Topics

Off-Policy Evaluation
Reinforcement Learning
Missing Not At Random
Markov Decision Processes
Fitted Q-Evaluation
Healthcare AI
MIMIC-III Sepsis

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.