GRPO++: Tricks for Making RL Actually Work
Summary
This overview details various improvements and "tricks" for Group Relative Policy Optimization (GRPO), a key reinforcement learning (RL) optimizer for training open-source reasoning models. While GRPO is popular for its simplicity and efficiency, vanilla GRPO suffers from issues like entropy collapse, reward noise, and training instability, especially at scale. The article introduces DAPO, which addresses these by decoupling clipping bounds (clip higher), dynamically sampling prompts to filter out perfectly accurate ones, using token-level loss aggregation to prevent length bias, and implementing soft overlong reward shaping. It also presents Dr. GRPO, which modifies advantage formulation and loss aggregation to mitigate length and difficulty biases, achieving 43.3% accuracy on AIME 2024 with Qwen-2.5-Math-7B. Furthermore, the article discusses Truncated Importance Sampling (TIS) to correct for discrepancies between sampler and learner engines, and other variants like GSPO, GMPO, and CISPO, which enhance stability and efficiency, particularly for Mixture-of-Experts (MoE) models.
Key takeaway
For AI Engineers and Research Scientists developing reasoning LLMs, understanding and applying advanced GRPO modifications is crucial. Your team should integrate techniques like decoupled clipping, dynamic sampling, token-level loss, and Truncated Importance Sampling to overcome vanilla GRPO's limitations, ensuring stable training, improved sample efficiency, and higher model performance, especially when working with large-scale or MoE models. Continuously monitoring key metrics like entropy and response length will help diagnose and resolve training issues effectively.
Key insights
Optimizing GRPO for LLM reasoning requires addressing inherent biases and system-level mismatches to achieve stable and efficient training.
Principles
- Maintain policy exploration to prevent entropy collapse.
- Ensure consistent reward signals and stable gradient updates.
- Align optimization granularity with reward structure.
Method
DAPO improves GRPO by using decoupled clipping, dynamic sampling, token-level loss, and soft overlong reward shaping. Dr. GRPO modifies advantage and loss aggregation to reduce length and difficulty biases. TIS corrects sampler-learner engine mismatches.
In practice
- Use larger batch and group sizes for GRPO training.
- Curate diverse prompts, filtering out easily guessable questions.
- Monitor response length, training reward, entropy, and held-out evaluation.
Topics
- Group Relative Policy Optimization
- Reinforcement Learning
- LLM Reasoning
- Policy Optimization
- Training Stability
Code references
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Deep (Learning) Focus.