ArcANE: Do Role-Playing Language Agents Stay in Character at the Right Time?
Summary
ArcANE (Arc-Aware Narrative Evaluation) is a new benchmark designed to assess whether role-playing language agents (RPLAs) can maintain character consistency as their psychological state evolves through a story, rather than just recalling facts. This automatically constructed benchmark utilizes 17 novels and 80 principal characters, segmenting narratives into psychological phases. It probes agents with identical scenarios across these phases, including situations both within and beyond the source text. Across six models and six context modes, conditioning on the Character Arc consistently outperformed other context strategies, with the largest performance gap observed in scenarios outside the source text. Furthermore, fine-tuning open-weight models resulted in ArcANE-8B/32B, which further amplified the Character Arc advantage in out-of-source contexts.
Key takeaway
For NLP Engineers developing or evaluating role-playing language agents, relying solely on factual recall benchmarks is insufficient for assessing true character consistency. You should integrate character arc-aware conditioning into your agent designs, as it demonstrably improves psychological trajectory alignment, especially for novel scenarios. Consider leveraging benchmarks like ArcANE to validate your agents' ability to evolve character values and behavior dynamically.
Key insights
Role-playing language agents require evaluation beyond factual recall to assess character psychological evolution.
Principles
- Character psychological trajectory is crucial for realistic RPLAs.
- Existing benchmarks inadequately measure character evolution.
- Character Arc conditioning significantly improves RPLA consistency.
Method
ArcANE segments narratives into psychological phases, then probes agents with identical scenarios across these phases, including situations not explicitly in the source text.
In practice
- Condition LLMs on Character Arc for improved RPLA performance.
- Fine-tune open-weight models using ArcANE-like data.
Topics
- Role-Playing Language Agents
- Character Arc
- Narrative Evaluation
- LLM Benchmarking
- Context Conditioning
- Fine-tuning
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.