Data Fusion for High-Resolution Estimation
Summary
A new data fusion method, developed by Stanford University, enhances high-resolution estimation of population health indicators by combining distinct data sources. This approach fuses unbiased, low-resolution aggregated administrative data with potentially biased, high-resolution individual-level online survey responses, such as those from the Household Pulse Survey. The method assumes a sampling bias model where log probabilities of response are linear in sufficient statistics of observables and outcomes, effectively using an exponential tilting. It learns a distribution closest (in KL divergence) to the online survey data while remaining consistent with the administrative data and the bias model. On a testbed of public health indicators, the method achieved an 84% decrease in population-weighted mean absolute error (MAE) for state-level COVID-19 vaccination rates and a 75% decrease for Medicaid enrollment compared to using online survey data alone. It also reduced MAE by 59% and 25% respectively, compared to using only aggregated administrative data, without degrading performance for SNAP enrollment.
Key takeaway
For Public Health Decision Makers and Data Scientists needing accurate, high-resolution estimates from imperfect data, this data fusion method offers a robust solution. If your current approach relies solely on biased online surveys or low-resolution administrative data, you risk significant inaccuracies in granular population health indicators. Implement this framework to combine data sources, leveraging its exponential tilting model to correct for sampling bias and achieve substantially more reliable state-level insights for critical public health initiatives.
Key insights
Fusing biased high-resolution surveys with unbiased low-resolution aggregate data improves granular estimates by modeling sampling bias as exponential tilting.
Principles
- Sampling bias can be modeled as conditional exponential tilting.
- Combining complementary data sources yields superior estimates.
- KL divergence minimization selects plausible population distributions.
Method
Learns a population distribution by minimizing KL divergence to the online survey distribution, constrained by aggregated administrative data and an exponential tilting sampling bias model, solved via moment matching and one-step estimation.
In practice
- Enhance state-level public health indicator accuracy.
- Adjust for "missing not at random" bias in online surveys.
- Integrate diverse data for robust subgroup analysis.
Topics
- Data Fusion
- High-Resolution Estimation
- Sampling Bias
- Public Health Indicators
- Online Surveys
- Exponential Tilting
- Statistical Inference
Code references
Best for: AI Scientist, Research Scientist, Data Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.