Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

2026-06-03 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Selective-Advantage Entropy-Adaptive Horizon GRPO (SA-AH-GRPO) is an extension to Group Relative Policy Optimisation (GRPO), a reinforcement learning algorithm for aligning language models on reasoning tasks. While GRPO treats all token positions and rollouts symmetrically, SA-AH-GRPO introduces two complementary features: Adaptive-Horizon GRPO (AH-GRPO), which uses a cumulative entropy-based discount to reduce the effective horizon when the model is uncertain, and Selective-Advantage AH-GRPO, which applies this discounting only to negative-advantage rollouts. Evaluated on the GSM8K mathematical reasoning benchmark using Qwen 2.5-1.5B-Instruct and Qwen 2.5-3B-Instruct fine-tuned with LoRA, SA-AH-GRPO on the 3B model achieved a peak Pass@1 of 0.858 at step 30, maintaining 0.846 at 180 steps, with training variance reduced by 3.6 times to 0.0246. For the 1.5B model, it reached a peak Pass@1 of 0.686, surpassing the 0.637 zero-shot baseline.

Key takeaway

For Machine Learning Engineers aligning language models on reasoning tasks, SA-AH-GRPO offers a principled inductive bias to stabilize training and improve performance. Your teams should consider implementing this asymmetric token-level discounting approach, especially for structured generation tasks with verifiable rewards, to achieve higher accuracy and significantly reduce training variance compared to standard GRPO methods.

Key insights

Asymmetric token-level discounting in GRPO significantly stabilizes and improves reinforcement learning for language models.

Principles

Symmetric token/rollout treatment in GRPO can be suboptimal.
Entropy-based discounting reduces effective horizon during uncertainty.
Asymmetric discounting preserves full gradient signal on correct solutions.

Method

AH-GRPO weights policy gradients with a cumulative entropy-based discount. SA-AH-GRPO applies this discount exclusively to negative-advantage rollouts, leaving positive trajectories unattenuated.

In practice

Implement SA-AH-GRPO for robust LM alignment on reasoning tasks.
Apply SA-AH-GRPO to structured generation with verifiable rewards.
Combine SA-AH-GRPO with LoRA for efficient fine-tuning.

Topics

Reinforcement Learning
Language Model Alignment
GRPO
Token-Level Discounting
Entropy-Adaptive Horizon
GSM8K Benchmark

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.