Understanding Diversity Collapse in RLVR via the Lens of Overtraining
Summary
Reinforcement learning with verifiable rewards (RLVR) often encounters "diversity collapse," where Pass@1 improves but high-k Pass@k degrades, indicating a narrowing of the model's reasoning boundary. This phenomenon is formalized as overtraining: once a problem's contribution to the reference metric saturates, further updates concentrate probability mass on already favored trajectories rather than expanding solvable problems. With few rollouts per problem, a single success can saturate high-k Pass@k, making most standard RLVR updates overtraining. While RLVR is structurally biased against high-k Pass@k, its decline doesn't preclude new reasoning gains. Interventions, such as restricting updates to problems with zero observed success, can lift Pass@256 above the base model. The proposed Bayesian Boundary Gating (BBG) redirects optimization by estimating each problem's marginal contribution, improving average Pass@k across a wide range of k on multiple reasoning benchmarks.
Key takeaway
For Machine Learning Engineers optimizing large language models with RLVR, you should recognize that diversity collapse is often overtraining, not a fundamental limit on new reasoning gains. Your focus should shift from merely improving Pass@1 to expanding the model's reasoning boundary. Consider implementing Bayesian Boundary Gating (BBG) or restricting updates to initially unsolvable problems to improve average Pass@k metrics across a wider range, enhancing overall model robustness and capability.
Key insights
Diversity collapse in RLVR is overtraining, concentrating probability mass on already solved problems.
Principles
- RLVR's diversity collapse stems from overtraining.
- Saturated problems concentrate probability mass on favored trajectories.
- High-k Pass@k decline doesn't negate all reasoning gains.
Method
Bayesian Boundary Gating (BBG) estimates each problem's marginal contribution to the reasoning boundary, redirecting optimization away from overtraining.
In practice
- Restrict RLVR updates to problems with zero observed success.
- Implement BBG to optimize for reasoning boundary expansion.
Topics
- Reinforcement Learning with Verifiable Rewards
- Diversity Collapse
- Overtraining
- Large Language Models
- Bayesian Boundary Gating
- Pass@k Metric
- Reasoning Benchmarks
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.