GRPO in Production: The Failure Modes Nobody Writes About

2026-06-22 · Source: Deep Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, medium

Summary

GRPO (Group Relative Policy Optimization) is now the default RL algorithm for post-training Large Language Models, adopted by models like DeepSeek-R1, Nemotron 3 Super, and Qwen-Math derivatives. It eliminates the PPO critic network, cutting training compute by nearly half. However, GRPO's group-relative advantage computation has three silent failure modes: advantage collapse, entropy collapse, and KL drift. Advantage collapse, the most common, occurs when all responses in a group receive uniform rewards, leading to zero gradient signal. Entropy collapse reduces response diversity, while KL drift results from an improperly tuned KL penalty. The article also identifies a fourth issue: sample-level loss normalization, which biases against longer chain-of-thought responses. DAPO (Dynamic Sampling Policy Optimization) offers algorithmic fixes for these issues, including dynamic sampling, an asymmetric KL clip, decoupled KL, and token-level normalization.

Key takeaway

For MLOps Engineers or AI Scientists deploying GRPO for LLM post-training, you must actively monitor for advantage collapse, entropy collapse, and KL drift. Implement DAPO's dynamic sampling, asymmetric KL clipping, decoupled KL, and token-level normalization to prevent training stalls and ensure robust model improvement. Your reward model's accuracy and the initial SFT checkpoint's entropy are critical upstream factors impacting GRPO's success.

Key insights

GRPO, a popular LLM post-training algorithm, has three silent failure modes—advantage collapse, entropy collapse, and KL drift—that stall training.

Principles

GRPO's group-relative advantage is prone to collapse with uniform rewards.
Policy entropy is critical for maintaining generation diversity.
KL penalty tuning impacts policy drift and training signal.

Method

DAPO addresses GRPO failures via dynamic sampling to filter uniform reward groups, an asymmetric KL clip to preserve entropy, decoupled KL from reward, and token-level loss normalization to reinforce longer responses.

In practice

Monitor entropy and advantage variance as key training metrics.
Audit your reward model before diagnosing GRPO training issues.
Check SFT model entropy before starting RL fine-tuning.

Topics

GRPO
LLM Post-training
Policy Optimization
Advantage Collapse
Entropy Collapse
DAPO Algorithm

Best for: Machine Learning Engineer, AI Scientist, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep Learning on Medium.