On Advantage Estimates for Max@K Policy Gradients

2026-06-04 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

A new study addresses challenges in reinforcement learning for post-training reasoning models, where sparse rewards hinder exploration. It focuses on optimizing inference-time objectives like pass@K and max@K, whose current policy-gradient estimators lack clarity due to varied signals and baselines. The research analyzes a leading method's advantage estimator, identifying it as policy-gradient unbiased but non-centered. To resolve this, the authors introduce a Leave-Two-Out (L2O) baseline. This L2O baseline maintains policy-gradient unbiasedness while ensuring realized batch advantages are exactly centered. The resulting method, named MaxPO, offers an efficient quadratic-time implementation and seamlessly integrates into group-based reinforcement learning for large language model post-training. Empirical results confirm that the L2O baseline effectively reduces gradient variance and surpasses non-centered alternatives.

Key takeaway

For Machine Learning Engineers optimizing large language models with reinforcement learning, especially for max@K objectives, you should adopt the MaxPO method. Its Leave-Two-Out baseline provides exactly centered advantages, empirically reducing gradient variance. This leads to a more stable and efficient training process for post-training reasoning models, potentially improving your model's performance and convergence.

Key insights

A Leave-Two-Out baseline improves policy gradient estimation for max@K objectives by centering advantages.

Principles

Policy-gradient unbiasedness can be maintained with centered advantages.
Centered advantages reduce gradient variance in policy gradients.
Unified advantage estimators clarify relationships among methods.

Method

Introduce a Leave-Two-Out (L2O) baseline to ensure realized batch advantages are exactly centered while preserving policy-gradient unbiasedness, leading to the MaxPO method.

In practice

Integrate MaxPO into group-based RL for LLM post-training.
Employ L2O baseline to reduce gradient variance in max@K optimization.

Topics

Reinforcement Learning
Policy Gradients
Max@K Objective
Advantage Estimation
Leave-Two-Out Baseline
LLM Post-training

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.