Beyond Importance Sampling: Rejection-Gated Policy Optimization
Summary
Researchers propose Rejection-Gated Policy Optimization (RGPO), a new policy optimization method that selectively trusts samples for policy updates instead of reweighting all samples by importance ratios. RGPO replaces the importance sampling ratio r_theta with a smooth, differentiable acceptance gate alpha_theta(s, a) = g(r_theta(s, a)) that directly participates in gradient computation and updates alongside the policy. This approach ensures finite, bounded gradient variance even with heavy-tailed importance sampling ratios, where traditional importance sampling variance diverges. RGPO introduces only a bounded, controllable bias and offers an approximate monotonic policy improvement guarantee similar to TRPO. It matches PPO in computational cost, avoids second-order optimization, and extends to RLHF-style preference alignment. In online preference fine-tuning of Qwen2.5-1.5B-Instruct on Anthropic HH-RLHF, RGPO achieved a +14.8% higher reward than PPO-RLHF and a -16.0% lower KL divergence to the reference model compared to PPO-RLHF.
Key takeaway
For AI Engineers and Research Scientists developing reinforcement learning algorithms, RGPO offers a robust alternative to traditional importance sampling. Its ability to guarantee finite, bounded gradient variance, even with problematic importance ratios, means more stable and reliable training. You should consider integrating RGPO, especially for RLHF applications, to achieve superior reward and lower KL divergence compared to methods like PPO-RLHF, without incurring higher computational costs.
Key insights
RGPO uses a differentiable acceptance gate to selectively trust samples, ensuring bounded gradient variance in policy optimization.
Principles
- Selectively trust samples for policy updates.
- Ensure bounded gradient variance with heavy-tailed ratios.
Method
RGPO replaces importance sampling ratios with a smooth, differentiable acceptance gate g(r_theta) that is implicitly updated with the policy, participating directly in gradient computation.
In practice
- Apply RGPO for stable policy optimization.
- Use RGPO in RLHF for preference alignment.
Topics
- Rejection-Gated Policy Optimization
- Policy Optimization
- Importance Sampling
- Reinforcement Learning from Human Feedback
- Gradient Variance
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.