GRPO++: Tricks for Making RL Actually Work

· Source: Deep (Learning) Focus · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, extended

Summary

This overview details various improvements and "tricks" for Group Relative Policy Optimization (GRPO), a key reinforcement learning (RL) optimizer for training open-source reasoning models. While GRPO is popular for its simplicity and efficiency, vanilla GRPO suffers from issues like entropy collapse, reward noise, and training instability, especially at scale. The article introduces DAPO, which addresses these by decoupling clipping bounds (clip higher), dynamically sampling prompts to filter out perfectly accurate ones, using token-level loss aggregation to prevent length bias, and implementing soft overlong reward shaping. It also presents Dr. GRPO, which modifies advantage formulation and loss aggregation to mitigate length and difficulty biases, achieving 43.3% accuracy on AIME 2024 with Qwen-2.5-Math-7B. Furthermore, the article discusses Truncated Importance Sampling (TIS) to correct for discrepancies between sampler and learner engines, and other variants like GSPO, GMPO, and CISPO, which enhance stability and efficiency, particularly for Mixture-of-Experts (MoE) models.

Key takeaway

For AI Engineers and Research Scientists developing reasoning LLMs, understanding and applying advanced GRPO modifications is crucial. Your team should integrate techniques like decoupled clipping, dynamic sampling, token-level loss, and Truncated Importance Sampling to overcome vanilla GRPO's limitations, ensuring stable training, improved sample efficiency, and higher model performance, especially when working with large-scale or MoE models. Continuously monitoring key metrics like entropy and response length will help diagnose and resolve training issues effectively.

Key insights

Optimizing GRPO for LLM reasoning requires addressing inherent biases and system-level mismatches to achieve stable and efficient training.

Principles

Method

DAPO improves GRPO by using decoupled clipping, dynamic sampling, token-level loss, and soft overlong reward shaping. Dr. GRPO modifies advantage and loss aggregation to reduce length and difficulty biases. TIS corrects sampler-learner engine mismatches.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep (Learning) Focus.