Understanding Diversity Collapse in RLVR via the Lens of Overtraining

2026-06-13 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Reinforcement learning with verifiable rewards (RLVR) often encounters "diversity collapse," where Pass@1 improves but high-k Pass@k degrades, indicating a narrowing of the model's reasoning boundary. This phenomenon is formalized as overtraining: once a problem's contribution to the reference metric saturates, further updates concentrate probability mass on already favored trajectories rather than expanding solvable problems. With few rollouts per problem, a single success can saturate high-k Pass@k, making most standard RLVR updates overtraining. While RLVR is structurally biased against high-k Pass@k, its decline doesn't preclude new reasoning gains. Interventions, such as restricting updates to problems with zero observed success, can lift Pass@256 above the base model. The proposed Bayesian Boundary Gating (BBG) redirects optimization by estimating each problem's marginal contribution, improving average Pass@k across a wide range of k on multiple reasoning benchmarks.

Key takeaway

For Machine Learning Engineers optimizing large language models with RLVR, you should recognize that diversity collapse is often overtraining, not a fundamental limit on new reasoning gains. Your focus should shift from merely improving Pass@1 to expanding the model's reasoning boundary. Consider implementing Bayesian Boundary Gating (BBG) or restricting updates to initially unsolvable problems to improve average Pass@k metrics across a wider range, enhancing overall model robustness and capability.

Key insights

Diversity collapse in RLVR is overtraining, concentrating probability mass on already solved problems.

Principles

RLVR's diversity collapse stems from overtraining.
Saturated problems concentrate probability mass on favored trajectories.
High-k Pass@k decline doesn't negate all reasoning gains.

Method

Bayesian Boundary Gating (BBG) estimates each problem's marginal contribution to the reasoning boundary, redirecting optimization away from overtraining.

In practice

Restrict RLVR updates to problems with zero observed success.
Implement BBG to optimize for reasoning boundary expansion.

Topics

Reinforcement Learning with Verifiable Rewards
Diversity Collapse
Overtraining
Large Language Models
Bayesian Boundary Gating
Pass@k Metric
Reasoning Benchmarks

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.