HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

· Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

Hysteretic Policy Optimization (HPO) and its adaptive variant, A-HPO, address a common failure mode in GRPO-style reinforcement learning, particularly under sparse verifiable reward conditions. This issue arises when early updates contain more negative advantages than positive ones, and response-level length normalization ties update magnitude to output length. HPO modifies GRPO by reducing the weight of negative-advantage updates and replacing per-response length normalization with mean-length normalization. Adaptive HPO (A-HPO) further refines this by dynamically setting the hysteretic weight based on batch-level advantage-sign statistics, eliminating the need for manual tuning. In TeleLogs and Countdown experiments, A-HPO significantly improved reward per update, especially in early sparse reward regimes. On TeleLogs, A-HPO achieved a final reward of 0.84, surpassing SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining comparable response length. A-HPO also showed the largest gains on Countdown across 1.5B-7B models in initial and difficult configurations.

Key takeaway

For Machine Learning Engineers developing GRPO-style reinforcement learning agents in sparse-reward environments, adopting Adaptive Hysteretic Policy Optimization (A-HPO) is crucial. You should integrate A-HPO to mitigate early training instability caused by negative advantage updates and inconsistent length normalization. This approach, especially beneficial for 1.5B-7B models, significantly improves reward per update and overall performance, eliminating the need for manual hysteretic weight tuning.

Key insights

A-HPO improves sparse-reward reinforcement learning stability and efficiency by adaptively balancing positive and negative advantage contributions.

Principles

Method

HPO modifies GRPO by reducing negative-advantage update weight and using mean-length normalization. A-HPO adaptively sets this weight based on batch-level advantage-sign statistics.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.