RVPO: Risk-Sensitive Alignment via Variance Regularization

· Source: Apple Machine Learning Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Reward-Variance Policy Optimization (RVPO) is a new risk-sensitive framework designed to address constraint neglect in current critic-less Reinforcement Learning from Human Feedback (RLHF) methods. Existing methods, which aggregate multi-objective rewards using an arithmetic mean, can mask critical failures in objectives like safety or formatting when high success in another objective numerically offsets them. RVPO penalizes inter-reward variance during advantage aggregation, shifting the optimization objective from "maximize sum" to "maximize consistency." The framework utilizes a LogSumExp (SoftMin) operator, which acts as a smooth variance penalty. Evaluated on rubric-based medical and scientific reasoning with up to 17 LLM-judged reward signals (Qwen2.5-3B/7B/14B) and on tool-calling with rule-based constraints (Qwen2.5-1.5B/3B), RVPO improved overall scores on HealthBench (0.261 vs. 0.215 for GDPO at 14B, p < 0.001) and maintained competitive accuracy on GPQA-Diamond, mitigating constraint neglect across model scales.

Key takeaway

For research scientists developing multi-objective RLHF systems, RVPO offers a robust approach to prevent constraint neglect. By focusing on reward consistency rather than just sum maximization, you can ensure that critical constraints, such as safety or formatting, are not overlooked. Implementing RVPO's variance regularization can lead to more reliable and aligned models, particularly in complex domains like medical reasoning or tool-calling where diverse objectives must be met without sacrificing general capabilities.

Key insights

RVPO mitigates constraint neglect in multi-objective RLHF by penalizing inter-reward variance, enhancing consistency over sum maximization.

Principles

Method

RVPO penalizes inter-reward variance during advantage aggregation using a LogSumExp (SoftMin) operator to shift the objective from maximizing sum to maximizing consistency.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Apple Machine Learning Research.