RVPO: Risk-Sensitive Alignment via Variance Regularization
Summary
Reward-Variance Policy Optimization (RVPO) is a new risk-sensitive framework designed to address constraint neglect in current critic-less Reinforcement Learning from Human Feedback (RLHF) methods. Existing methods, which aggregate multi-objective rewards using an arithmetic mean, can mask critical failures in objectives like safety or formatting when high success in another objective numerically offsets them. RVPO penalizes inter-reward variance during advantage aggregation, shifting the optimization objective from "maximize sum" to "maximize consistency." The framework utilizes a LogSumExp (SoftMin) operator, which acts as a smooth variance penalty. Evaluated on rubric-based medical and scientific reasoning with up to 17 LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B), RVPO improved overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, p < 0.001) and maintained competitive accuracy on GPQA-Diamond, mitigating constraint neglect across model scales.
Key takeaway
For research scientists developing multi-objective RLHF systems, RVPO offers a robust approach to prevent constraint neglect. By focusing on reward consistency rather than just sum maximization, you can ensure that critical constraints, such as safety or formatting, are not overlooked. Implementing RVPO's variance regularization can lead to more reliable and aligned models, particularly in complex domains like medical reasoning or tool-calling where diverse objectives must be met without sacrificing general capabilities.
Key insights
RVPO mitigates constraint neglect in multi-objective RLHF by penalizing inter-reward variance, enhancing consistency over sum maximization.
Principles
- Arithmetic mean aggregation can mask critical failures.
- Penalizing inter-reward variance improves consistency.
- LogSumExp acts as a smooth variance penalty.
Method
RVPO penalizes inter-reward variance during advantage aggregation using a LogSumExp (SoftMin) operator to shift the objective from maximizing sum to maximizing consistency.
In practice
- Apply RVPO for multi-objective LLM alignment.
- Use variance regularization to prevent constraint neglect.
- Improve reliability in safety-critical applications.
Topics
- RVPO
- Risk-Sensitive Alignment
- Constraint Neglect
- Multi-objective RLHF
- Variance Regularization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.