Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Researchers have introduced OmniBehavior, a novel user simulation benchmark built exclusively from real-world data, designed to overcome the limitations of existing benchmarks that rely on isolated scenarios, narrow action spaces, or synthetic data. OmniBehavior integrates long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Initial evaluations using this benchmark demonstrate that prior datasets with isolated scenarios suffer from "tunnel vision," whereas actual decision-making involves long-term, cross-scenario causal chains. State-of-the-art Large Language Models (LLMs) struggle to accurately simulate these complex behaviors, with performance not improving significantly even with larger context windows. A key finding is a structural bias in LLMs, which tend to simulate a "positive average person," leading to hyper-activity, persona homogenization, and a Utopian bias, thereby losing individual differences and long-tail behaviors.

Key takeaway

For research scientists developing user simulators, you should recognize that current LLMs exhibit a structural bias towards an "average person" persona, leading to hyper-activity and loss of individual differences. Your focus should shift towards developing models that can capture long-horizon, cross-scenario, and heterogeneous behavioral patterns, moving beyond isolated scenarios. Consider using real-world benchmarks like OmniBehavior to validate your models' fidelity to authentic human behavior.

Key insights

Real-world human behavior simulation requires long-horizon, cross-scenario data, which current LLMs struggle to model accurately.

Principles

Method

OmniBehavior is a user simulation benchmark constructed from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns for evaluating LLMs.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.