JustRL: A Simple RL Recipe Just Works, No Tricks/Patches Needed
Summary
JustRL introduces a simplified Reinforcement Learning (RL) recipe for scaling 1.5B Large Language Models (LLMs) in mathematical reasoning. It applies a single-stage GRPO run with fixed hyperparameters, a basic rule-based verifier, and a flat 16k context window, intentionally omitting complex elements like KL terms, entropy regularization, dynamic sampling, and multi-stage curricula. This minimal approach, applied to DeepSeek-R1-Distill-Qwen-1.5B and OpenMath-Nemotron-1.5B, enables JustRL-DeepSeek-1.5B to slightly outperform ProRL-V2 on nine math benchmarks using approximately half the token budget. Similarly, JustRL-Nemotron surpasses QuestA with 2-2.5x less compute, demonstrating that a "barebones" RL setup can achieve competitive results without architectural changes or extra supervision.
Key takeaway
For research scientists optimizing LLM performance, you should critically evaluate the necessity of complex RL techniques. JustRL demonstrates that a simplified, single-stage GRPO approach with fixed hyperparameters can achieve competitive or superior results with significantly less computational overhead, suggesting that many "fixes" for RL pathologies might be unnecessary for a clean baseline.
Key insights
A simplified RL approach can outperform complex methods for scaling LLMs in math reasoning.
Principles
- Simplicity in RL can yield superior performance.
- Avoid cargo-culting complex RL "best practices".
Method
JustRL uses a single-stage GRPO run with fixed hyperparameters, 8 rollouts per prompt, batch size 256, constant learning rate 1e-6, max response length 15k, and a "clip higher" trick.
In practice
- Implement single-stage GRPO for LLM fine-tuning.
- Test minimal RL setups before adding complexity.
Topics
- Reinforcement Learning
- Large Language Models
- Math Reasoning
- GRPO
- Model Scaling
Code references
Best for: Research Scientist, AI Researcher, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.