In-Context Learning for the Imputation of Public Opinion Data with Large Language Models
Summary
A new study proposes using in-context learning (ICL) with large language models (LLMs) for imputing missing public opinion data, addressing the common problem of partial non-response in surveys. The research systematically evaluates ICL design choices across various missingness mechanisms (MCAR, MAR, MNAR) using 150 opinion variables from 15 waves of the American Trends Panel. The ICL approach consistently reduces absolute error compared to established statistical methods like MICE PMM, showing the largest gains under non-random missingness (MNAR). Specifically, the best-performing configuration, gpt-oss-120b with 100 in-context examples, achieves near-nominal aggregate coverage (approaching the 95% level) and confidence intervals two to five times narrower than MICE PMM. A Python package with an sklearn-like API is also released for easy deployment.
Key takeaway
For data scientists and survey researchers tasked with imputing missing values in public opinion datasets, you should consider integrating in-context learning (ICL) with LLMs. This approach offers superior accuracy, particularly for non-random missingness, and provides significantly narrower confidence intervals than MICE PMM. Leverage the released Python package to streamline the deployment of this method, enhancing the reliability and precision of your survey data analysis.
Key insights
In-context learning with LLMs significantly improves missing public opinion data imputation over traditional statistical methods.
Principles
- Imputation fundamentally differs from prediction.
- ICL can reduce imputation error, especially for non-random missingness.
- Systematic evaluation of ICL design choices is crucial.
Method
Missing survey data is imputed through in-context learning, systematically evaluating ICL design choices across MCAR, MAR, and MNAR missingness mechanisms.
In practice
- Use gpt-oss-120b with 100 in-context examples for optimal imputation.
- Deploy the method via the provided Python package with an sklearn-like API.
Topics
- In-Context Learning
- Data Imputation
- Large Language Models
- Public Opinion Data
- Missing Data Mechanisms
- Python Package
Best for: AI Scientist, Data Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.