Data Fusion for High-Resolution Estimation

2026-06-12 · Source: stat.ML updates on arXiv.org · Field: Science & Research — Mathematics & Computational Sciences, Research Methodology & Innovation, Data Science & Analytics · Depth: Expert, extended

Summary

A new data fusion method, developed by Stanford University, enhances high-resolution estimation of population health indicators by combining distinct data sources. This approach fuses unbiased, low-resolution aggregated administrative data with potentially biased, high-resolution individual-level online survey responses, such as those from the Household Pulse Survey. The method assumes a sampling bias model where log probabilities of response are linear in sufficient statistics of observables and outcomes, effectively using an exponential tilting. It learns a distribution closest (in KL divergence) to the online survey data while remaining consistent with the administrative data and the bias model. On a testbed of public health indicators, the method achieved an 84% decrease in population-weighted mean absolute error (MAE) for state-level COVID-19 vaccination rates and a 75% decrease for Medicaid enrollment compared to using online survey data alone. It also reduced MAE by 59% and 25% respectively, compared to using only aggregated administrative data, without degrading performance for SNAP enrollment.

Key takeaway

For Public Health Decision Makers and Data Scientists needing accurate, high-resolution estimates from imperfect data, this data fusion method offers a robust solution. If your current approach relies solely on biased online surveys or low-resolution administrative data, you risk significant inaccuracies in granular population health indicators. Implement this framework to combine data sources, leveraging its exponential tilting model to correct for sampling bias and achieve substantially more reliable state-level insights for critical public health initiatives.

Key insights

Fusing biased high-resolution surveys with unbiased low-resolution aggregate data improves granular estimates by modeling sampling bias as exponential tilting.

Principles

Sampling bias can be modeled as conditional exponential tilting.
Combining complementary data sources yields superior estimates.
KL divergence minimization selects plausible population distributions.

Method

Learns a population distribution by minimizing KL divergence to the online survey distribution, constrained by aggregated administrative data and an exponential tilting sampling bias model, solved via moment matching and one-step estimation.

In practice

Enhance state-level public health indicator accuracy.
Adjust for "missing not at random" bias in online surveys.
Integrate diverse data for robust subgroup analysis.

Topics

Data Fusion
High-Resolution Estimation
Sampling Bias
Public Health Indicators
Online Surveys
Exponential Tilting
Statistical Inference

Code references

roshni714/data_fusion

Best for: AI Scientist, Research Scientist, Data Scientist

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.