Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression

2026-06-12 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Mathematics & Computational Sciences · Depth: Expert, extended

Summary

This study introduces Prediction-Powered Causal Inference (PPCI), a framework for semiparametric efficient estimation of causal and structural parameters in a semi-supervised setting. It leverages unlabeled auxiliary regressors alongside labeled observations to achieve smaller asymptotic variances than methods using only labeled data. The research derives the efficient influence function and efficiency bounds, demonstrating that unlabeled data can reduce the regressor-averaging component of the efficiency bound. The proposed methods, called DML-PPCI (Debiased Machine Learning-PPCI), include Estimating-Equation (EE-DML-PPCI) and Targeted Maximum Likelihood (TMLE-DML-PPCI) estimators. A key component is the development of semi-supervised generalized Riesz regression for estimating the Riesz representer, with convergence rate guarantees for various function classes, including deep ReLU sieves.

Key takeaway

For data scientists and machine learning engineers working on causal inference, you should explore integrating unlabeled auxiliary regressor datasets using the DML-PPCI framework. This approach can significantly improve the precision of your causal parameter estimates, such as ATE or APE, by reducing estimation variance. Consider implementing semi-supervised generalized Riesz regression to effectively leverage these unlabeled data, especially when dealing with large datasets or complex models like deep ReLU networks.

Key insights

Unlabeled auxiliary regressors can significantly reduce asymptotic variance in causal inference.

Principles

Unlabeled data improves efficiency by reducing regressor-averaging noise.
Efficiency gains are possible even without outcome information in auxiliary data.
Neyman orthogonal scores enable robust estimation with machine learning nuisance functions.

Method

DML-PPCI combines efficient influence functions with debiased machine learning, using either estimating equations (EE-DML-PPCI) or targeted maximum likelihood (TMLE-DML-PPCI) and semi-supervised generalized Riesz regression for nuisance parameter estimation.

In practice

Apply DML-PPCI for Average Treatment Effect (ATE) or Average Policy Effect (APE) estimation.
Utilize deep ReLU networks for Riesz representer estimation with convergence guarantees.
Consider two-sample or one-sample scenarios based on data generation process.

Topics

Causal Inference
Semi-Supervised Learning
Debiased Machine Learning
Riesz Regression
Semiparametric Efficiency
Average Treatment Effect
Deep ReLU Networks

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Data Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.