GRPO in Production: The Failure Modes Nobody Writes About
Summary
GRPO (Group Relative Policy Optimization) is now the default RL algorithm for post-training Large Language Models, adopted by models like DeepSeek-R1, Nemotron 3 Super, and Qwen-Math derivatives. It eliminates the PPO critic network, cutting training compute by nearly half. However, GRPO's group-relative advantage computation has three silent failure modes: advantage collapse, entropy collapse, and KL drift. Advantage collapse, the most common, occurs when all responses in a group receive uniform rewards, leading to zero gradient signal. Entropy collapse reduces response diversity, while KL drift results from an improperly tuned KL penalty. The article also identifies a fourth issue: sample-level loss normalization, which biases against longer chain-of-thought responses. DAPO (Dynamic Sampling Policy Optimization) offers algorithmic fixes for these issues, including dynamic sampling, an asymmetric KL clip, decoupled KL, and token-level normalization.
Key takeaway
For MLOps Engineers or AI Scientists deploying GRPO for LLM post-training, you must actively monitor for advantage collapse, entropy collapse, and KL drift. Implement DAPO's dynamic sampling, asymmetric KL clipping, decoupled KL, and token-level normalization to prevent training stalls and ensure robust model improvement. Your reward model's accuracy and the initial SFT checkpoint's entropy are critical upstream factors impacting GRPO's success.
Key insights
GRPO, a popular LLM post-training algorithm, has three silent failure modes—advantage collapse, entropy collapse, and KL drift—that stall training.
Principles
- GRPO's group-relative advantage is prone to collapse with uniform rewards.
- Policy entropy is critical for maintaining generation diversity.
- KL penalty tuning impacts policy drift and training signal.
Method
DAPO addresses GRPO failures via dynamic sampling to filter uniform reward groups, an asymmetric KL clip to preserve entropy, decoupled KL from reward, and token-level loss normalization to reinforce longer responses.
In practice
- Monitor entropy and advantage variance as key training metrics.
- Audit your reward model before diagnosing GRPO training issues.
- Check SFT model entropy before starting RL fine-tuning.
Topics
- GRPO
- LLM Post-training
- Policy Optimization
- Advantage Collapse
- Entropy Collapse
- DAPO Algorithm
Best for: Machine Learning Engineer, AI Scientist, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.