ExpRL: Exploratory RL for LLM Mid-Training
Summary
ExpRL is a novel method for exploratory reinforcement learning (RL) applied to large language model (LLM) mid-training, designed to overcome the limitations of sparse reward RL and manual skill specification. Unlike traditional approaches that imitate reference solutions, ExpRL utilizes human-written question-answer data as "reward scaffolds." During training, the policy samples from problem prompts, and an LLM judge evaluates the sampled reasoning trace against a hidden reference solution. This process assigns dense, problem-specific rewards at both outcome and process levels, enabling the reinforcement of partial progress, useful intermediate reductions, and productive reasoning behaviors. On challenging math reasoning tasks, ExpRL demonstrates superior RL priming compared to SFT, sparse-reward GRPO, and self-distillation. Furthermore, it provides a more effective initialization for subsequent sparse-reward RL and shows applicability in mixed-domain settings.
Key takeaway
For Machine Learning Engineers developing LLMs for complex reasoning tasks, consider integrating ExpRL's approach into your mid-training pipeline. By using an LLM judge to provide dense, process-level rewards against hidden reference solutions, you can significantly enhance model priming beyond traditional SFT or sparse-reward methods. This strategy reinforces productive reasoning behaviors and partial progress, leading to stronger initializations for subsequent sparse-reward RL and improved performance on challenging problems.
Key insights
ExpRL uses LLM judges and hidden reference solutions to provide dense, process-level rewards for LLM mid-training.
Principles
- Sparse reward RL depends on base model coverage.
- Manual skill specification limits LLM mid-training.
- Dense rewards reinforce partial progress.
Method
ExpRL uses an LLM judge to compare on-policy reasoning traces against hidden reference solutions from human Q&A data, assigning dense outcome-level or process-level rewards.
In practice
- Improve LLM reasoning on hard problems.
- Better initialize sparse-reward RL.
- Extend mid-training to mixed domains.
Topics
- Reinforcement Learning
- LLM Mid-Training
- Dense Rewards
- Reasoning Tasks
- LLM Judge
- Reward Scaffolds
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.