Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents
Summary
Researchers developed the Adversarial Empathy Benchmark (AEB) and Emotional Consistency Score (ECS) to evaluate the robustness of Reinforcement Learning from Verifiable Emotion Rewards (RLVER)-trained language models under adversarial user conditions. The AEB comprises six psychologically grounded adversarial trajectory types, including gaslighting and emotional escalation, designed to penalize formulaic responses. The ECS disentangles a model's capacity to track user emotional states from its ability to improve them. In a controlled experiment involving 480 adversarial dialogues across eight conditions, RLVER-PPO-Think significantly outperformed the untuned baseline (0.963 vs. 0.761, p<0.001, r=0.688), showing zero dialogue collapses and 47% higher hidden-intention detection. However, ECS remained nearly flat and not significantly different, indicating RL training improves emotional responsiveness without measurable gains in observable state tracking.
Key takeaway
For research scientists developing empathetic AI, you should prioritize adversarial evaluation using benchmarks like AEB before deployment in sensitive settings. Your models may improve emotional outcomes without necessarily enhancing their internal understanding of user states, as indicated by the ECS-FS dissociation. Consider integrating reasoning scaffolds only after RLVER training, as they significantly aid performance in that context.
Key insights
RLVER-trained empathetic agents are robust against adversarial users, improving emotional outcomes without better observable state tracking.
Principles
- Adversarial evaluation reveals policy robustness.
- Emotional responsiveness can diverge from state tracking.
- Reasoning scaffolds aid RL-trained models.
Method
The Adversarial Empathy Benchmark (AEB) uses six psychologically grounded adversarial trajectories with discriminative reward rules. The Emotional Consistency Score (ECS) measures emotion-state legibility from public conversation, distinct from final emotional outcomes.
In practice
- Use AEB to stress-test empathetic AI.
- Implement discriminative reward rules.
- Measure both FS and ECS for empathy.
Topics
- RLVER
- Adversarial Empathy Benchmark
- Emotional Consistency Score
- Empathetic Agents
- LLM Robustness
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.