The threat of analytic flexibility in using large language models to simulate human data

2025-09-16 · Source: cs.AI updates on arXiv.org · Field: Science & Research — Social Sciences & Behavioral Studies, Research Methodology & Innovation · Depth: Intermediate, quick

Summary

A new study investigates the impact of "analytic flexibility" when using large language models (LLMs) to generate "silicon samples"—synthetic datasets intended to mimic human responses in social science research. The research, conducted across two studies, reveals that various analytic choices, such as model selection, sampling parameters, and prompt format, significantly affect the correspondence between silicon samples and actual human data. Study 1 generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, finding substantial variation in participant rankings, response distributions, and between-scale correlations. Study 2 re-examined Argyle et al.'s (2023) Study 3 with 66 alternative configurations, observing that correlations between human and silicon association structures ranged widely from r = .23 to r = .84. These findings underscore that seemingly defensible configuration choices can materially alter conclusions regarding the fidelity of silicon samples.

Key takeaway

For social scientists developing "silicon samples" with LLMs, you must rigorously test and report the impact of your analytic choices. Different configurations can drastically alter the fidelity of synthetic data, potentially leading to misleading conclusions. To mitigate this, consider pre-registering your configuration choices and conducting sensitivity analyses across a range of parameters to ensure the robustness of your findings.

Key insights

Analytic choices in LLM-generated synthetic data significantly alter its fidelity to human responses.

Principles

Configuration choices impact silicon sample fidelity.
Performance metrics can vary across dimensions.

Method

The study generated hundreds of LLM configurations, varying parameters like model, sampling, and prompt, then evaluated their correspondence to human data across multiple criteria.

In practice

Test multiple LLM configurations for synthetic data.
Evaluate silicon samples across diverse metrics.

Topics

Large Language Models
Silicon Samples
Analytic Flexibility
Synthetic Data
Social Science Research

Best for: Research Scientist, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.