Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

2026-05-04 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Reinforcement Learning with Verifiable Rewards (RLVR) significantly improves single-attempt accuracy (Pass@1) in large language models (LLMs) for reasoning tasks but often suffers from diversity collapse, reducing multi-sample coverage (Pass@K). This degradation occurs because common RLVR objectives, like GRPO, are indifferent to how probability mass is distributed among correct solutions. This indifference, combined with stochastic training dynamics, leads to a self-reinforcing collapse where probability concentrates on a narrow subset of correct outputs, suppressing alternative valid solutions. Researchers from Purdue University propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty. UCPO redistributes gradient signals towards underrepresented correct responses, encouraging uniform probability allocation within the correct set. Across three LLMs (1.5B–7B parameters) and five mathematical reasoning benchmarks, UCPO improved Pass@K and diversity while maintaining competitive Pass@1, achieving up to a +10% absolute gain on AIME24 at Pass@64 and up to 45% higher equation-level diversity.

Key takeaway

For research scientists developing or fine-tuning LLMs for reasoning tasks, you should integrate Uniform-Correct Policy Optimization (UCPO) into your RLVR pipelines. This approach directly counteracts diversity collapse by ensuring probability mass is uniformly distributed among correct solutions, significantly improving multi-sample coverage (Pass@K) without sacrificing single-attempt accuracy (Pass@1). Implementing UCPO can lead to more robust and versatile LLMs capable of generating a wider array of valid reasoning paths.

Key insights

RLVR's diversity collapse stems from objective indifference and on-policy sampling, which UCPO addresses by promoting uniform probability across correct solutions.

Principles

RLVR objectives are indifferent to within-correct solution distribution.
On-policy sampling amplifies initial probability asymmetries.
Uniform-Correct Policy is optimal for robustness and entropy-regularized objectives.

Method

UCPO modifies GRPO by adding a conditional uniformity penalty (KL divergence) to redistribute gradient signal, amplifying underrepresented correct solutions and tempering dominant ones, ensuring the Uniform-Correct Policy is the unique optimum.

In practice

Implement UCPO to improve Pass@K in LLM reasoning tasks.
Use conditional uniformity penalties to prevent diversity collapse.
Prioritize uniform distribution across correct solutions for robust performance.

Topics

Reinforcement Learning with Verifiable Rewards
Diversity Collapse
Uniform-Correct Policy Optimization
Group Relative Policy Optimization
Large Language Models

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.