JustRL: A Simple RL Recipe Just Works, No Tricks/Patches Needed

· Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, quick

Summary

JustRL introduces a simplified Reinforcement Learning (RL) recipe for scaling 1.5B Large Language Models (LLMs) in mathematical reasoning. It applies a single-stage GRPO run with fixed hyperparameters, a basic rule-based verifier, and a flat 16k context window, intentionally omitting complex elements like KL terms, entropy regularization, dynamic sampling, and multi-stage curricula. This minimal approach, applied to DeepSeek-R1-Distill-Qwen-1.5B and OpenMath-Nemotron-1.5B, enables JustRL-DeepSeek-1.5B to slightly outperform ProRL-V2 on nine math benchmarks using approximately half the token budget. Similarly, JustRL-Nemotron surpasses QuestA with 2-2.5x less compute, demonstrating that a "barebones" RL setup can achieve competitive results without architectural changes or extra supervision.

Key takeaway

For research scientists optimizing LLM performance, you should critically evaluate the necessity of complex RL techniques. JustRL demonstrates that a simplified, single-stage GRPO approach with fixed hyperparameters can achieve competitive or superior results with significantly less computational overhead, suggesting that many "fixes" for RL pathologies might be unnecessary for a clean baseline.

Key insights

A simplified RL approach can outperform complex methods for scaling LLMs in math reasoning.

Principles

Method

JustRL uses a single-stage GRPO run with fixed hyperparameters, 8 rollouts per prompt, batch size 256, constant learning rate 1e-6, max response length 15k, and a "clip higher" trick.

In practice

Topics

Code references

Best for: Research Scientist, AI Researcher, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.