Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces
Summary
Researchers have introduced OmniBehavior, a novel user simulation benchmark built exclusively from real-world data, designed to overcome the limitations of existing benchmarks that rely on isolated scenarios, narrow action spaces, or synthetic data. OmniBehavior integrates long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Initial evaluations using this benchmark demonstrate that prior datasets with isolated scenarios suffer from "tunnel vision," whereas actual decision-making involves long-term, cross-scenario causal chains. State-of-the-art Large Language Models (LLMs) struggle to accurately simulate these complex behaviors, with performance not improving significantly even with larger context windows. A key finding is a structural bias in LLMs, which tend to simulate a "positive average person," leading to hyper-activity, persona homogenization, and a Utopian bias, thereby losing individual differences and long-tail behaviors.
Key takeaway
For research scientists developing user simulators, you should recognize that current LLMs exhibit a structural bias towards an "average person" persona, leading to hyper-activity and loss of individual differences. Your focus should shift towards developing models that can capture long-horizon, cross-scenario, and heterogeneous behavioral patterns, moving beyond isolated scenarios. Consider using real-world benchmarks like OmniBehavior to validate your models' fidelity to authentic human behavior.
Key insights
Real-world human behavior simulation requires long-horizon, cross-scenario data, which current LLMs struggle to model accurately.
Principles
- Real-world behavior is long-term and cross-scenario.
- LLMs exhibit a "Utopian bias" in simulation.
Method
OmniBehavior is a user simulation benchmark constructed from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns for evaluating LLMs.
In practice
- Use OmniBehavior for realistic LLM user simulation.
- Address LLM bias towards "average person" behavior.
Topics
- Large Language Models
- Human Behavior Simulation
- OmniBehavior Benchmark
- Real-world Data
- LLM Bias
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.