HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime
Summary
Hysteretic Policy Optimization (HPO) and its adaptive variant, A-HPO, address a common failure mode in GRPO-style reinforcement learning, particularly under sparse verifiable reward conditions. This issue arises when early updates contain more negative advantages than positive ones, and response-level length normalization ties update magnitude to output length. HPO modifies GRPO by reducing the weight of negative-advantage updates and replacing per-response length normalization with mean-length normalization. Adaptive HPO (A-HPO) further refines this by dynamically setting the hysteretic weight based on batch-level advantage-sign statistics, eliminating the need for manual tuning. In TeleLogs and Countdown experiments, A-HPO significantly improved reward per update, especially in early sparse reward regimes. On TeleLogs, A-HPO achieved a final reward of 0.84, surpassing SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining comparable response length. A-HPO also showed the largest gains on Countdown across 1.5B-7B models in initial and difficult configurations.
Key takeaway
For Machine Learning Engineers developing GRPO-style reinforcement learning agents in sparse-reward environments, adopting Adaptive Hysteretic Policy Optimization (A-HPO) is crucial. You should integrate A-HPO to mitigate early training instability caused by negative advantage updates and inconsistent length normalization. This approach, especially beneficial for 1.5B-7B models, significantly improves reward per update and overall performance, eliminating the need for manual hysteretic weight tuning.
Key insights
A-HPO improves sparse-reward reinforcement learning stability and efficiency by adaptively balancing positive and negative advantage contributions.
Principles
- Negative advantage updates can destabilize sparse-reward RL.
- Adaptive weighting of advantages enhances training stability.
- Mean-length normalization improves update magnitude consistency.
Method
HPO modifies GRPO by reducing negative-advantage update weight and using mean-length normalization. A-HPO adaptively sets this weight based on batch-level advantage-sign statistics.
In practice
- Apply A-HPO to GRPO-style RL for sparse reward tasks.
- Use A-HPO for improved early training stability.
- Consider A-HPO for 1.5B-7B models in difficult configurations.
Topics
- Reinforcement Learning
- Sparse Rewards
- Policy Optimization
- GRPO
- Hysteretic Policy Optimization
- Adaptive HPO
Best for: Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.