The threat of analytic flexibility in using large language models to simulate human data
Summary
A new study investigates the impact of "analytic flexibility" when using large language models (LLMs) to generate "silicon samples"—synthetic datasets intended to mimic human responses in social science research. The research, conducted across two studies, reveals that various analytic choices, such as model selection, sampling parameters, and prompt format, significantly affect the correspondence between silicon samples and actual human data. Study 1 generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, finding substantial variation in participant rankings, response distributions, and between-scale correlations. Study 2 re-examined Argyle et al.'s (2023) Study 3 with 66 alternative configurations, observing that correlations between human and silicon association structures ranged widely from r = .23 to r = .84. These findings underscore that seemingly defensible configuration choices can materially alter conclusions regarding the fidelity of silicon samples.
Key takeaway
For social scientists developing "silicon samples" with LLMs, you must rigorously test and report the impact of your analytic choices. Different configurations can drastically alter the fidelity of synthetic data, potentially leading to misleading conclusions. To mitigate this, consider pre-registering your configuration choices and conducting sensitivity analyses across a range of parameters to ensure the robustness of your findings.
Key insights
Analytic choices in LLM-generated synthetic data significantly alter its fidelity to human responses.
Principles
- Configuration choices impact silicon sample fidelity.
- Performance metrics can vary across dimensions.
Method
The study generated hundreds of LLM configurations, varying parameters like model, sampling, and prompt, then evaluated their correspondence to human data across multiple criteria.
In practice
- Test multiple LLM configurations for synthetic data.
- Evaluate silicon samples across diverse metrics.
Topics
- Large Language Models
- Silicon Samples
- Analytic Flexibility
- Synthetic Data
- Social Science Research
Best for: Research Scientist, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.