ExpRL: Exploratory RL for LLM Mid-Training

2026-06-15 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

ExpRL is a novel method for exploratory reinforcement learning (RL) applied to large language model (LLM) mid-training, designed to overcome the limitations of sparse reward RL and manual skill specification. Unlike traditional approaches that imitate reference solutions, ExpRL utilizes human-written question-answer data as "reward scaffolds." During training, the policy samples from problem prompts, and an LLM judge evaluates the sampled reasoning trace against a hidden reference solution. This process assigns dense, problem-specific rewards at both outcome and process levels, enabling the reinforcement of partial progress, useful intermediate reductions, and productive reasoning behaviors. On challenging math reasoning tasks, ExpRL demonstrates superior RL priming compared to SFT, sparse-reward GRPO, and self-distillation. Furthermore, it provides a more effective initialization for subsequent sparse-reward RL and shows applicability in mixed-domain settings.

Key takeaway

For Machine Learning Engineers developing LLMs for complex reasoning tasks, consider integrating ExpRL's approach into your mid-training pipeline. By using an LLM judge to provide dense, process-level rewards against hidden reference solutions, you can significantly enhance model priming beyond traditional SFT or sparse-reward methods. This strategy reinforces productive reasoning behaviors and partial progress, leading to stronger initializations for subsequent sparse-reward RL and improved performance on challenging problems.

Key insights

ExpRL uses LLM judges and hidden reference solutions to provide dense, process-level rewards for LLM mid-training.

Principles

Sparse reward RL depends on base model coverage.
Manual skill specification limits LLM mid-training.
Dense rewards reinforce partial progress.

Method

ExpRL uses an LLM judge to compare on-policy reasoning traces against hidden reference solutions from human Q&A data, assigning dense outcome-level or process-level rewards.

In practice

Improve LLM reasoning on hard problems.
Better initialize sparse-reward RL.
Extend mid-training to mixed domains.

Topics

Reinforcement Learning
LLM Mid-Training
Dense Rewards
Reasoning Tasks
LLM Judge
Reward Scaffolds

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.