A Gradient Perspective on RLVR Stability and Winner Advantage Policy Optimization
Summary
A new analysis investigates the instability issues in GRPO-style optimization within Reinforcement Learning with Verifiable Rewards (RLVR), a method used to enhance language model reasoning. This research employs token-level gradient dynamics to develop a taxonomy, revealing that optimization stability is jointly determined by the advantage sign and the token distribution under the current policy. Based on this finding, the authors introduce Winner Advantage Policy Optimization (WAPO), a straightforward online clipped policy-gradient objective. WAPO specifically updates only on completions that exhibit a positive advantage. Evaluated across mathematical reasoning and multi-hop QA benchmarks, WAPO consistently improves training stability and either matches or surpasses the performance of existing baselines across various model families. Full code is available on GitHub.
Key takeaway
For Machine Learning Engineers developing language models with RLVR, consider integrating Winner Advantage Policy Optimization (WAPO) to mitigate training instability. Your GRPO-style optimization efforts can benefit from WAPO's approach of updating only on positive-advantage completions, which has demonstrated improved stability and competitive performance on mathematical reasoning and multi-hop QA benchmarks. Explore the provided GitHub repository to implement WAPO in your current projects.
Key insights
RLVR optimization stability hinges on advantage sign and token distribution, addressed by WAPO's positive-advantage updates.
Principles
- RLVR stability depends on advantage sign.
- Token distribution impacts policy updates.
- Positive-advantage updates enhance stability.
Method
Winner Advantage Policy Optimization (WAPO) is an online clipped policy-gradient objective that updates only on completions with a positive advantage.
In practice
- Apply WAPO for stable RLVR training.
- Use WAPO on mathematical reasoning tasks.
- Implement WAPO for multi-hop QA models.
Topics
- Reinforcement Learning with Verifiable Rewards
- Policy Optimization
- Gradient Dynamics
- Language Models
- Mathematical Reasoning
- Multi-hop QA
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.