Evaluating LLMs as Human Surrogates in Controlled Experiments
Summary
A study by Adnan Hoq and Tim Weninger from the University of Notre Dame evaluates the efficacy of off-the-shelf Large Language Models (LLMs) as human surrogates in controlled behavioral experiments. The research directly compares LLM-generated responses with human data from a canonical survey experiment on accuracy perception of political news headlines. Each human observation is converted into a structured prompt, and LLMs (Llama 3.2:3B, Gemma 2:9B, and GPT-5.2) generate a single 0-10 outcome variable without task-specific training. The study applies identical statistical analyses to both human and synthetic responses, finding that LLMs reproduce several directional effects observed in humans, such as political alignment and credibility feedback shifts. However, effect magnitudes and moderation patterns vary significantly across models, with GPT-5.2 most closely matching human-scale effects, while Gemma and Llama show attenuated or exaggerated responses.
Key takeaway
For AI Scientists and Research Scientists considering LLMs for behavioral simulation, you should prioritize empirical validation against human benchmarks for each specific hypothesis and model. While LLMs can effectively reproduce the direction of behavioral effects, their ability to accurately capture effect magnitudes and moderation patterns varies significantly. Therefore, use LLM surrogates for initial hypothesis screening or exploratory design, but rely on calibrated human data for estimating realistic behavioral effect sizes and drawing substantive conclusions.
Key insights
LLMs can reproduce directional behavioral effects but often miscalibrate effect magnitudes compared to human data.
Principles
- Surrogate validity requires empirical verification for each hypothesis and model.
- Directional agreement alone is insufficient for LLM surrogate validity.
Method
Convert human observations into structured prompts for LLMs, generate single outcome variables without task-specific training, and apply identical statistical analyses to both human and synthetic data to compare experimental inferences.
In practice
- Use LLMs for rapid hypothesis screening or exploratory design.
- Calibrate LLM outputs against human data for effect magnitude estimation.
Topics
- Large Language Models
- Human Behavioral Simulation
- Experimental Inference
- News Accuracy Judgments
- AI Credibility Feedback
Best for: AI Scientist, Research Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.