Evaluating LLMs as Human Surrogates in Controlled Experiments

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Social Sciences & Behavioral Studies · Depth: Expert, extended

Summary

A study by Adnan Hoq and Tim Weninger from the University of Notre Dame evaluates the efficacy of off-the-shelf Large Language Models (LLMs) as human surrogates in controlled behavioral experiments. The research directly compares LLM-generated responses with human data from a canonical survey experiment on accuracy perception of political news headlines. Each human observation is converted into a structured prompt, and LLMs (Llama 3.2:3B, Gemma 2:9B, and GPT-5.2) generate a single 0-10 outcome variable without task-specific training. The study applies identical statistical analyses to both human and synthetic responses, finding that LLMs reproduce several directional effects observed in humans, such as political alignment and credibility feedback shifts. However, effect magnitudes and moderation patterns vary significantly across models, with GPT-5.2 most closely matching human-scale effects, while Gemma and Llama show attenuated or exaggerated responses.

Key takeaway

For AI Scientists and Research Scientists considering LLMs for behavioral simulation, you should prioritize empirical validation against human benchmarks for each specific hypothesis and model. While LLMs can effectively reproduce the direction of behavioral effects, their ability to accurately capture effect magnitudes and moderation patterns varies significantly. Therefore, use LLM surrogates for initial hypothesis screening or exploratory design, but rely on calibrated human data for estimating realistic behavioral effect sizes and drawing substantive conclusions.

Key insights

LLMs can reproduce directional behavioral effects but often miscalibrate effect magnitudes compared to human data.

Principles

Method

Convert human observations into structured prompts for LLMs, generate single outcome variables without task-specific training, and apply identical statistical analyses to both human and synthetic data to compare experimental inferences.

In practice

Topics

Best for: AI Scientist, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.