Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

Reinforcement Learning with Verifiable Rewards (RLVR) significantly improves single-attempt accuracy (Pass@1) in large language models (LLMs) for reasoning tasks but often suffers from diversity collapse, reducing multi-sample coverage (Pass@K). This degradation occurs because common RLVR objectives, like GRPO, are indifferent to how probability mass is distributed among correct solutions. This indifference, combined with stochastic training dynamics, leads to a self-reinforcing collapse where probability concentrates on a narrow subset of correct outputs, suppressing alternative valid solutions. Researchers from Purdue University propose Uniform-Correct Policy Optimization (UCPO), a modification to GRPO that adds a conditional uniformity penalty. UCPO redistributes gradient signals towards underrepresented correct responses, encouraging uniform probability allocation within the correct set. Across three LLMs (1.5B–7B parameters) and five mathematical reasoning benchmarks, UCPO improved Pass@K and diversity while maintaining competitive Pass@1, achieving up to a +10% absolute gain on AIME24 at Pass@64 and up to 45% higher equation-level diversity.

Key takeaway

For research scientists developing or fine-tuning LLMs for reasoning tasks, you should integrate Uniform-Correct Policy Optimization (UCPO) into your RLVR pipelines. This approach directly counteracts diversity collapse by ensuring probability mass is uniformly distributed among correct solutions, significantly improving multi-sample coverage (Pass@K) without sacrificing single-attempt accuracy (Pass@1). Implementing UCPO can lead to more robust and versatile LLMs capable of generating a wider array of valid reasoning paths.

Key insights

RLVR's diversity collapse stems from objective indifference and on-policy sampling, which UCPO addresses by promoting uniform probability across correct solutions.

Principles

Method

UCPO modifies GRPO by adding a conditional uniformity penalty (KL divergence) to redistribute gradient signal, amplifying underrepresented correct solutions and tempering dominant ones, ensuring the Uniform-Correct Policy is the unique optimum.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.