RL Scaling Laws for LLMs

2024-03-04 · Source: Deep (Learning) Focus · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

This analysis explores the evolution and application of scaling laws in Large Language Model (LLM) training, contrasting their well-defined role in pretraining with their more complex and bespoke nature in reinforcement learning (RL). For pretraining, scaling laws rigorously define the relationship between compute, model parameters, data volume, and performance (test loss), enabling predictable extrapolation of model capabilities. In RL, however, scaling laws are less standardized, often modeled by sigmoidal compute-performance curves or log-linear power laws relating test loss to compute or data. The article details the Group Relative Policy Optimization (GRPO) algorithm and its variants (GSPO, DAPO, Dr. GRPO, TIS, CISPO), which aim to improve RL training stability and efficiency by addressing issues like high variance, entropy collapse, and engine mismatches. Studies show that optimal compute allocation in RL, particularly for sampling rollouts, is crucial and depends on factors like problem difficulty and batch size, with larger models generally performing better given sufficient data.

Key takeaway

Research Scientists optimizing LLM reinforcement learning should adopt a systematic approach to compute allocation. Focus on fitting sigmoidal scaling curves from early training phases to predict asymptotic performance and efficiency. Prioritize increasing the number of rollouts per prompt, especially for harder problems, and ensure appropriate regularization (e.g., entropy bonus for easy tasks, no regularization for hard tasks) to maintain training stability. This allows for informed decisions on resource investment without incurring the full cost of large-scale experiments.

Key insights

Scaling laws, while precise for LLM pretraining, are more complex and context-dependent in reinforcement learning.

Principles

Larger models generally yield better performance.
RL training benefits from increased sampling compute.
Optimal regularization is problem-difficulty dependent.

Method

RL scaling laws can be modeled using sigmoidal curves or log-linear power laws to extrapolate performance from early training, allowing for efficient evaluation of different training configurations and compute allocations.

In practice

Use asynchronous RL with a split generator-trainer setup.
Employ full precision for the LLM's language modeling head.
Filter zero-variance prompts and use dynamic data curricula.

Topics

RL Scaling Laws
LLM Pretraining Scaling
Group Relative Policy Optimization
Compute-Optimal Allocation
RL Optimization Techniques

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep (Learning) Focus.