GRPO++: Tricks for Making RL Actually Work

2024-03-04 · Source: Deep (Learning) Focus · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, extended

Summary

This overview details various improvements and "tricks" for Group Relative Policy Optimization (GRPO), a key reinforcement learning (RL) optimizer for training open-source reasoning models. While GRPO is popular for its simplicity and efficiency, vanilla GRPO suffers from issues like entropy collapse, reward noise, and training instability, especially at scale. The article introduces DAPO, which addresses these by decoupling clipping bounds (clip higher), dynamically sampling prompts to filter out perfectly accurate ones, using token-level loss aggregation to prevent length bias, and implementing soft overlong reward shaping. It also presents Dr. GRPO, which modifies advantage formulation and loss aggregation to mitigate length and difficulty biases, achieving 43.3% accuracy on AIME 2024 with Qwen-2.5-Math-7B. Furthermore, the article discusses Truncated Importance Sampling (TIS) to correct for discrepancies between sampler and learner engines, and other variants like GSPO, GMPO, and CISPO, which enhance stability and efficiency, particularly for Mixture-of-Experts (MoE) models.

Key takeaway

For AI Engineers and Research Scientists developing reasoning LLMs, understanding and applying advanced GRPO modifications is crucial. Your team should integrate techniques like decoupled clipping, dynamic sampling, token-level loss, and Truncated Importance Sampling to overcome vanilla GRPO's limitations, ensuring stable training, improved sample efficiency, and higher model performance, especially when working with large-scale or MoE models. Continuously monitoring key metrics like entropy and response length will help diagnose and resolve training issues effectively.

Key insights

Optimizing GRPO for LLM reasoning requires addressing inherent biases and system-level mismatches to achieve stable and efficient training.

Principles

Maintain policy exploration to prevent entropy collapse.
Ensure consistent reward signals and stable gradient updates.
Align optimization granularity with reward structure.

Method

DAPO improves GRPO by using decoupled clipping, dynamic sampling, token-level loss, and soft overlong reward shaping. Dr. GRPO modifies advantage and loss aggregation to reduce length and difficulty biases. TIS corrects sampler-learner engine mismatches.

In practice

Use larger batch and group sizes for GRPO training.
Curate diverse prompts, filtering out easily guessable questions.
Monitor response length, training reward, entropy, and held-out evaluation.

Topics

Group Relative Policy Optimization
Reinforcement Learning
LLM Reasoning
Policy Optimization
Training Stability

Code references

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep (Learning) Focus.