Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

Persona Policies (PPol) is a novel framework designed to generate realistic and diverse user personas for evaluating and training Large Language Model (LLM) agents. Traditional LLM-based user simulators often produce overly cooperative and homogeneous interactions, leading to a "behavioral gap" where agents perform well in simulation but fail with real users. PPol addresses this by introducing a plug-and-play control layer that induces behavioral variation through an LLM-driven evolutionary program search. This process optimizes a Python generator to discover diverse communication styles and translate them into task-preserving roleplay policies. The optimization uses a multi-objective fitness score combining human-likeness and broad coverage of human behavioral patterns, measured by 19 lexical and interaction-level features. Across $\tau^{2}$-bench retail and airline domains, PPol programs achieved 33\% to 62\% absolute gains in fitness score over baseline simulators. In blinded human evaluations, PPol-conditioned users were rated as human 80.4\% of the time, nearly double the baseline, and agents trained with PPol showed a +17\% relative improvement in task success against challenging, out-of-distribution behaviors.

Key takeaway

For AI Engineers and Research Scientists developing LLM agents, relying solely on cooperative user simulators risks deploying brittle systems. You should integrate Persona Policies (PPol) into your evaluation and training pipelines to expose agents to a wider, more realistic spectrum of human communication. This approach will enhance agent robustness against challenging, out-of-distribution user behaviors, ensuring better real-world performance and trustworthiness.

Key insights

Evolving LLM-driven persona generators creates diverse, human-like user behaviors for robust agent evaluation and training.

Principles

Method

PPol uses an LLM-driven evolutionary program search to optimize a Python generator. This generator creates diverse persona policies, which are evaluated via agent-user rollouts using a multi-objective fitness score based on behavioral fingerprints and human-likeness.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.