Beyond Importance Sampling: Rejection-Gated Policy Optimization

2026-04-16 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Researchers propose Rejection-Gated Policy Optimization (RGPO), a new policy optimization method that selectively trusts samples for policy updates instead of reweighting all samples by importance ratios. RGPO replaces the importance sampling ratio r_theta with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) that directly participates in gradient computation and updates alongside the policy. This approach ensures finite, bounded gradient variance even with heavy-tailed importance sampling ratios, where traditional importance sampling variance diverges. RGPO introduces only a bounded, controllable bias and offers an approximate monotonic policy improvement guarantee similar to TRPO. It matches PPO in computational cost, avoids second-order optimization, and extends to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF, RGPO achieved a +14.8% higher reward than PPO-RLHF and a -16.0% lower KL divergence to the reference model compared to PPO-RLHF.

Key takeaway

For AI Engineers and Research Scientists developing reinforcement learning algorithms, RGPO offers a robust alternative to traditional importance sampling. Its ability to guarantee finite, bounded gradient variance, even with problematic importance ratios, means more stable and reliable training. You should consider integrating RGPO, especially for RLHF applications, to achieve superior reward and lower KL divergence compared to methods like PPO-RLHF, without incurring higher computational costs.

Key insights

RGPO uses a differentiable acceptance gate to selectively trust samples, ensuring bounded gradient variance in policy optimization.

Principles

Selectively trust samples for policy updates.
Ensure bounded gradient variance with heavy-tailed ratios.

Method

RGPO replaces importance sampling ratios with a smooth, differentiable acceptance gate g(r_theta) that is implicitly updated with the policy, participating directly in gradient computation.

In practice

Apply RGPO for stable policy optimization.
Use RGPO in RLHF for preference alignment.

Topics

Rejection-Gated Policy Optimization
Policy Optimization
Importance Sampling
Reinforcement Learning from Human Feedback
Gradient Variance

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.