JustRL: A Simple RL Recipe Just Works, No Tricks/Patches Needed

2024-03-06 · Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

JustRL introduces a simplified Reinforcement Learning (RL) recipe for scaling 1.5B Large Language Models (LLMs) in mathematical reasoning. It applies a single-stage GRPO run with fixed hyperparameters, a basic rule-based verifier, and a flat 16k context window, intentionally omitting complex elements like KL terms, entropy regularization, dynamic sampling, and multi-stage curricula. This minimal approach, applied to DeepSeek-R1-Distill-Qwen-1.5B and OpenMath-Nemotron-1.5B, enables JustRL-DeepSeek-1.5B to slightly outperform ProRL-V2 on nine math benchmarks using approximately half the token budget. Similarly, JustRL-Nemotron surpasses QuestA with 2-2.5x less compute, demonstrating that a "barebones" RL setup can achieve competitive results without architectural changes or extra supervision.

Key takeaway

For research scientists optimizing LLM performance, you should critically evaluate the necessity of complex RL techniques. JustRL demonstrates that a simplified, single-stage GRPO approach with fixed hyperparameters can achieve competitive or superior results with significantly less computational overhead, suggesting that many "fixes" for RL pathologies might be unnecessary for a clean baseline.

Key insights

A simplified RL approach can outperform complex methods for scaling LLMs in math reasoning.

Principles

Simplicity in RL can yield superior performance.
Avoid cargo-culting complex RL "best practices".

Method

JustRL uses a single-stage GRPO run with fixed hyperparameters, 8 rollouts per prompt, batch size 256, constant learning rate 1e-6, max response length 15k, and a "clip higher" trick.

In practice

Implement single-stage GRPO for LLM fine-tuning.
Test minimal RL setups before adding complexity.

Topics

Reinforcement Learning
Large Language Models
Math Reasoning
GRPO
Model Scaling

Code references

thunlp/JustRL

Best for: Research Scientist, AI Researcher, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.