Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy, Social Sciences & Behavioral Studies · Depth: Expert, long

Summary

Researchers developed the Adversarial Empathy Benchmark (AEB) and Emotional Consistency Score (ECS) to evaluate the robustness of Reinforcement Learning from Verifiable Emotion Rewards (RLVER)-trained language models under adversarial user conditions. The AEB comprises six psychologically grounded adversarial trajectory types, including gaslighting and emotional escalation, designed to penalize formulaic responses. The ECS disentangles a model's capacity to track user emotional states from its ability to improve them. In a controlled experiment involving 480 adversarial dialogues across eight conditions, RLVER-PPO-Think significantly outperformed the untuned baseline (0.963 vs. 0.761, p<0.001, r=0.688), showing zero dialogue collapses and 47% higher hidden-intention detection. However, ECS remained nearly flat and not significantly different, indicating RL training improves emotional responsiveness without measurable gains in observable state tracking.

Key takeaway

For research scientists developing empathetic AI, you should prioritize adversarial evaluation using benchmarks like AEB before deployment in sensitive settings. Your models may improve emotional outcomes without necessarily enhancing their internal understanding of user states, as indicated by the ECS-FS dissociation. Consider integrating reasoning scaffolds only after RLVER training, as they significantly aid performance in that context.

Key insights

RLVER-trained empathetic agents are robust against adversarial users, improving emotional outcomes without better observable state tracking.

Principles

Method

The Adversarial Empathy Benchmark (AEB) uses six psychologically grounded adversarial trajectories with discriminative reward rules. The Emotional Consistency Score (ECS) measures emotion-state legibility from public conversation, distinct from final emotional outcomes.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.