Beyond Cooperative Simulators: Generating Realistic User Personas for Robust Evaluation of LLM Agents
Summary
Persona Policies (PPol) is a novel framework designed to generate realistic and diverse user personas for evaluating and training Large Language Model (LLM) agents. Traditional LLM-based user simulators often produce overly cooperative and homogeneous interactions, leading to a "behavioral gap" where agents perform well in simulation but fail with real users. PPol addresses this by introducing a plug-and-play control layer that induces behavioral variation through an LLM-driven evolutionary program search. This process optimizes a Python generator to discover diverse communication styles and translate them into task-preserving roleplay policies. The optimization uses a multi-objective fitness score combining human-likeness and broad coverage of human behavioral patterns, measured by 19 lexical and interaction-level features. Across $\tau^{2}$-bench retail and airline domains, PPol programs achieved 33\% to 62\% absolute gains in fitness score over baseline simulators. In blinded human evaluations, PPol-conditioned users were rated as human 80.4\% of the time, nearly double the baseline, and agents trained with PPol showed a +17\% relative improvement in task success against challenging, out-of-distribution behaviors.
Key takeaway
For AI Engineers and Research Scientists developing LLM agents, relying solely on cooperative user simulators risks deploying brittle systems. You should integrate Persona Policies (PPol) into your evaluation and training pipelines to expose agents to a wider, more realistic spectrum of human communication. This approach will enhance agent robustness against challenging, out-of-distribution user behaviors, ensuring better real-world performance and trustworthiness.
Key insights
Evolving LLM-driven persona generators creates diverse, human-like user behaviors for robust agent evaluation and training.
Principles
- User simulators require behavioral diversity to bridge the sim-to-real gap.
- Evolutionary search can discover complex, human-like interaction patterns.
- Multi-objective fitness combining human-likeness and coverage prevents adversarial drift.
Method
PPol uses an LLM-driven evolutionary program search to optimize a Python generator. This generator creates diverse persona policies, which are evaluated via agent-user rollouts using a multi-objective fitness score based on behavioral fingerprints and human-likeness.
In practice
- Integrate PPol as a control layer for existing user simulators.
- Use behavioral fingerprints to quantify human-likeness and coverage.
- Fine-tune agents on PPol-generated interactions for improved robustness.
Topics
- Persona Policies
- LLM Agents
- User Simulation
- Evolutionary Program Search
- Behavioral Fingerprints
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.