Evaluating LLMs as Human Surrogates in Controlled Experiments

2026-04-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies · Depth: Expert, extended

Summary

A study by Adnan Hoq and Tim Weninger from the University of Notre Dame evaluates the efficacy of off-the-shelf Large Language Models (LLMs) as human surrogates in controlled behavioral experiments. The research directly compares LLM-generated responses with human data from a canonical survey experiment on accuracy perception of political news headlines. Each human observation is converted into a structured prompt, and LLMs (Llama 3.2:3B, Gemma 2:9B, and GPT-5.2) generate a single 0-10 outcome variable without task-specific training. The study applies identical statistical analyses to both human and synthetic responses, finding that LLMs reproduce several directional effects observed in humans, such as political alignment and credibility feedback shifts. However, effect magnitudes and moderation patterns vary significantly across models, with GPT-5.2 most closely matching human-scale effects, while Gemma and Llama show attenuated or exaggerated responses.

Key takeaway

For AI Scientists and Research Scientists considering LLMs for behavioral simulation, you should prioritize empirical validation against human benchmarks for each specific hypothesis and model. While LLMs can effectively reproduce the direction of behavioral effects, their ability to accurately capture effect magnitudes and moderation patterns varies significantly. Therefore, use LLM surrogates for initial hypothesis screening or exploratory design, but rely on calibrated human data for estimating realistic behavioral effect sizes and drawing substantive conclusions.

Key insights

LLMs can reproduce directional behavioral effects but often miscalibrate effect magnitudes compared to human data.

Principles

Surrogate validity requires empirical verification for each hypothesis and model.
Directional agreement alone is insufficient for LLM surrogate validity.

Method

Convert human observations into structured prompts for LLMs, generate single outcome variables without task-specific training, and apply identical statistical analyses to both human and synthetic data to compare experimental inferences.

In practice

Use LLMs for rapid hypothesis screening or exploratory design.
Calibrate LLM outputs against human data for effect magnitude estimation.

Topics

Large Language Models
Human Behavioral Simulation
Experimental Inference
News Accuracy Judgments
AI Credibility Feedback

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.