RL Scaling Laws for LLMs
Summary
This analysis explores the evolution and application of scaling laws in Large Language Model (LLM) training, contrasting their well-defined role in pretraining with their more complex and bespoke nature in reinforcement learning (RL). For pretraining, scaling laws rigorously define the relationship between compute, model parameters, data volume, and performance (test loss), enabling predictable extrapolation of model capabilities. In RL, however, scaling laws are less standardized, often modeled by sigmoidal compute-performance curves or log-linear power laws relating test loss to compute or data. The article details the Group Relative Policy Optimization (GRPO) algorithm and its variants (GSPO, DAPO, Dr. GRPO, TIS, CISPO), which aim to improve RL training stability and efficiency by addressing issues like high variance, entropy collapse, and engine mismatches. Studies show that optimal compute allocation in RL, particularly for sampling rollouts, is crucial and depends on factors like problem difficulty and batch size, with larger models generally performing better given sufficient data.
Key takeaway
Research Scientists optimizing LLM reinforcement learning should adopt a systematic approach to compute allocation. Focus on fitting sigmoidal scaling curves from early training phases to predict asymptotic performance and efficiency. Prioritize increasing the number of rollouts per prompt, especially for harder problems, and ensure appropriate regularization (e.g., entropy bonus for easy tasks, no regularization for hard tasks) to maintain training stability. This allows for informed decisions on resource investment without incurring the full cost of large-scale experiments.
Key insights
Scaling laws, while precise for LLM pretraining, are more complex and context-dependent in reinforcement learning.
Principles
- Larger models generally yield better performance.
- RL training benefits from increased sampling compute.
- Optimal regularization is problem-difficulty dependent.
Method
RL scaling laws can be modeled using sigmoidal curves or log-linear power laws to extrapolate performance from early training, allowing for efficient evaluation of different training configurations and compute allocations.
In practice
- Use asynchronous RL with a split generator-trainer setup.
- Employ full precision for the LLM's language modeling head.
- Filter zero-variance prompts and use dynamic data curricula.
Topics
- RL Scaling Laws
- LLM Pretraining Scaling
- Group Relative Policy Optimization
- Compute-Optimal Allocation
- RL Optimization Techniques
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Deep (Learning) Focus.