Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
Summary
Selective-Advantage Entropy-Adaptive Horizon GRPO (SA-AH-GRPO) is an extension to Group Relative Policy Optimisation (GRPO), a reinforcement learning algorithm for aligning language models on reasoning tasks. While GRPO treats all token positions and rollouts symmetrically, SA-AH-GRPO introduces two complementary features: Adaptive-Horizon GRPO (AH-GRPO), which uses a cumulative entropy-based discount to reduce the effective horizon when the model is uncertain, and Selective-Advantage AH-GRPO, which applies this discounting only to negative-advantage rollouts. Evaluated on the GSM8K mathematical reasoning benchmark using Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA, SA-AH-GRPO on the 3B model achieved a peak Pass@1 of 0.858 at step 30, maintaining 0.846 at 180 steps, with training variance reduced by 3.6 times to 0.0246. For the 1.5B model, it reached a peak Pass@1 of 0.686, surpassing the 0.637 zero-shot baseline.
Key takeaway
For Machine Learning Engineers aligning language models on reasoning tasks, SA-AH-GRPO offers a principled inductive bias to stabilize training and improve performance. Your teams should consider implementing this asymmetric token-level discounting approach, especially for structured generation tasks with verifiable rewards, to achieve higher accuracy and significantly reduce training variance compared to standard GRPO methods.
Key insights
Asymmetric token-level discounting in GRPO significantly stabilizes and improves reinforcement learning for language models.
Principles
- Symmetric token/rollout treatment in GRPO can be suboptimal.
- Entropy-based discounting reduces effective horizon during uncertainty.
- Asymmetric discounting preserves full gradient signal on correct solutions.
Method
AH-GRPO weights policy gradients with a cumulative entropy-based discount. SA-AH-GRPO applies this discount exclusively to negative-advantage rollouts, leaving positive trajectories unattenuated.
In practice
- Implement SA-AH-GRPO for robust LM alignment on reasoning tasks.
- Apply SA-AH-GRPO to structured generation with verifiable rewards.
- Combine SA-AH-GRPO with LoRA for efficient fine-tuning.
Topics
- Reinforcement Learning
- Language Model Alignment
- GRPO
- Token-Level Discounting
- Entropy-Adaptive Horizon
- GSM8K Benchmark
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.