Off-Policy Evaluation for Missingness-Aware Policies in MDPs with Rewards Missing Not at Random

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, quick

Summary

Off-Policy Evaluation (OPE) in finite-horizon Markov decision processes is investigated for scenarios where immediate rewards in logged batch data are Missing Not At Random (MNAR), a common issue in healthcare and marketing that induces selection bias. This research formalizes a reward-dependent propensity model and uses future states as shadow variables to identify the full-data conditional mean reward. A novel "bridge function" is introduced to recover this conditional mean without explicitly modeling the MNAR mechanism, estimated via a min-max procedure to avoid double sampling. Building on these identification results, the paper proposes a Fitted-Q-Evaluation-style estimator that propagates recovered rewards and allows target policies to incorporate past missingness indicators. The method establishes consistency and finite-sample error bounds, demonstrating strong performance against existing methods on simulated and MIMIC-III Sepsis data.

Key takeaway

For Machine Learning Engineers evaluating policies from offline reinforcement learning data where rewards are Missing Not At Random, traditional off-policy evaluation methods will yield biased results. You should consider implementing this paper's Fitted-Q-Evaluation-style estimator, which uses a bridge function and future states to accurately recover conditional mean rewards. This approach provides consistency and finite-sample error bounds, offering a robust solution for policy evaluation in critical domains like healthcare and marketing.

Key insights

A novel OPE method addresses Missing Not At Random rewards in RL by using a bridge function and future states as shadow variables.

Principles

Method

Formalize a reward-dependent propensity model, use future states as shadow variables, introduce a bridge function estimated via min-max procedure, then apply a Fitted-Q-Evaluation-style estimator.

In practice

Topics

Best for: AI Scientist, Machine Learning Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.