sGPO: Trading Inference FLOPs for Training Efficiency in RLVR
Summary
Sorted Group Policy Optimization (sGPO) is a new compute-efficient strategy for Reinforcement Learning with Verifiable Rewards (RLVR) training. It addresses the inefficiency of standard RLVR, which wastes training FLOPs on queries that are either too easy (producing near-zero advantage) or unsolvable (producing no signal). sGPO trades a small budget of inference FLOPs for a significant reduction in wasted training FLOPs. Its core insight uses cheap inference compute as an offline proxy for query difficulty, generating a small batch of parallel samples to determine an empirical success rate. This rate then dictates the training rollout group size, maximizing sample efficiency. sGPO also uses this profiling pass for data filtering, adaptive group size allocation, and curriculum construction, scheduling queries from easy to hard. This method matches or exceeds baseline performance while reducing total training compute by a factor of three, including the upfront inference profiling cost.
Key takeaway
For Machine Learning Engineers optimizing Reinforcement Learning with Verifiable Rewards (RLVR) training, you should consider implementing sGPO to significantly reduce computational waste. By profiling query difficulty with a small inference budget, you can adaptively size rollout groups and filter data, potentially cutting total training compute by a factor of three. This allows you to achieve comparable or better performance with greater efficiency.
Key insights
sGPO uses cheap inference to profile query difficulty, optimizing RLVR training by reducing wasted FLOPs.
Principles
- Fixed rollout budgets waste compute.
- Inference can proxy query difficulty.
- Adaptive group size maximizes efficiency.
Method
sGPO generates a small batch of parallel samples per query under the initial policy to obtain an empirical success rate, then sets the training rollout group size to the inverse of this rate.
In practice
- Filter trivial RLVR queries.
- Sub-sample unsolvable queries.
- Construct curricula from easy to hard.
Topics
- Reinforcement Learning
- Policy Optimization
- Computational Efficiency
- Verifiable Rewards
- Curriculum Learning
- Inference Profiling
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.