Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity
Summary
Reinforcement Learning with Verifiable Rewards (RLVR) significantly improves single-attempt accuracy (Pass@1) in large language models (LLMs) for reasoning tasks but often suffers from diversity collapse, reducing multi-sample coverage (Pass@K). This degradation occurs because common RLVR objectives, like GRPO, are indifferent to how probability mass is distributed among correct solutions. This indifference, combined with stochastic training dynamics, leads to a self-reinforcing collapse where probability concentrates on a narrow subset of correct outputs, suppressing alternative valid solutions. Researchers from Purdue University propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty. UCPO redistributes gradient signals towards underrepresented correct responses, encouraging uniform probability allocation within the correct set. Across three LLMs (1.5B–7B parameters) and five mathematical reasoning benchmarks, UCPO improved Pass@K and diversity while maintaining competitive Pass@1, achieving up to a +10% absolute gain on AIME24 at Pass@64 and up to 45% higher equation-level diversity.
Key takeaway
For research scientists developing or fine-tuning LLMs for reasoning tasks, you should integrate Uniform-Correct Policy Optimization (UCPO) into your RLVR pipelines. This approach directly counteracts diversity collapse by ensuring probability mass is uniformly distributed among correct solutions, significantly improving multi-sample coverage (Pass@K) without sacrificing single-attempt accuracy (Pass@1). Implementing UCPO can lead to more robust and versatile LLMs capable of generating a wider array of valid reasoning paths.
Key insights
RLVR's diversity collapse stems from objective indifference and on-policy sampling, which UCPO addresses by promoting uniform probability across correct solutions.
Principles
- RLVR objectives are indifferent to within-correct solution distribution.
- On-policy sampling amplifies initial probability asymmetries.
- Uniform-Correct Policy is optimal for robustness and entropy-regularized objectives.
Method
UCPO modifies GRPO by adding a conditional uniformity penalty (KL divergence) to redistribute gradient signal, amplifying underrepresented correct solutions and tempering dominant ones, ensuring the Uniform-Correct Policy is the unique optimum.
In practice
- Implement UCPO to improve Pass@K in LLM reasoning tasks.
- Use conditional uniformity penalties to prevent diversity collapse.
- Prioritize uniform distribution across correct solutions for robust performance.
Topics
- Reinforcement Learning with Verifiable Rewards
- Diversity Collapse
- Uniform-Correct Policy Optimization
- Group Relative Policy Optimization
- Large Language Models
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.