Improving General Role-Playing Agents via Psychology-Grounded Reasoning and Role-Aware Policy Optimization

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

General-purpose role-playing agents often struggle with faithful character portrayal and out-of-distribution generalization due to reliance on superficial behavioral mimicry. To address this, researchers propose Psy-CoT, a psychology-grounded chain-of-thought framework that decomposes pre-response reasoning into three role-specific steps: Interaction Perception, Psychological Empathy, and Logical Construction. This framework enables dynamic thinking from a character profile rather than mere pattern matching. Additionally, they introduce Role-Aware Policy Optimization (RAPO) to counter reward model "hacking" by generic phrases in reinforcement learning. RAPO uses profile-token mutual information to asymmetrically weight gradients, amplifying role-specific tokens under positive advantage and attenuating them under negative. Experiments on CoSER, CharacterBench, and CharacterEval demonstrate Psy-CoT outperforms existing role-playing CoT methods, and RAPO consistently surpasses GRPO across multiple model scales.

Key takeaway

For Machine Learning Engineers developing role-playing agents struggling with character fidelity or out-of-distribution generalization, you should consider integrating psychology-grounded reasoning and specialized reinforcement learning optimization. Explore Psy-CoT's three-step reasoning framework and RAPO's gradient weighting mechanism to enhance your agents' performance and prevent reward model exploitation by generic responses. This approach can lead to more robust and believable character portrayals.

Key insights

Combining psychology-grounded reasoning with role-aware policy optimization significantly enhances general role-playing agent fidelity.

Principles

Method

Psy-CoT decomposes reasoning into Interaction Perception, Psychological Empathy, and Logical Construction. RAPO uses profile-token mutual information to weight gradients, amplifying role-specific tokens under positive advantage and attenuating them under negative.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.