STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

2026-06-18 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability) is a novel mechanism designed to mitigate policy entropy collapse in Group Relative Policy Optimization (GRPO) algorithms used for large language model (LLM) post-training. A first-order gradient analysis revealed a token-level credit assignment mismatch in GRPO, leading to an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by this, STARE identifies entropy-critical token subsets using batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Evaluated across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps, maintaining policy entropy within a target band. It consistently outperforms DAPO and other baselines by 4%–8% in average accuracy on AIME24 and AIME25.

Key takeaway

For AI Scientists and Machine Learning Engineers working on post-training LLMs with GRPO-style algorithms, you should consider integrating STARE to overcome policy entropy collapse. Its surprisal-guided token-level advantage reweighting and closed-loop entropy gating enable stable, long-horizon RL training, leading to 4%-8% accuracy gains on reasoning tasks. Implementing STARE can unlock further optimization potential and improve exploration-exploitation balance in your models.

Key insights

STARE rebalances token-level credit in GRPO via surprisal-guided reweighting and a closed-loop gate to prevent policy entropy collapse.

Principles

GRPO's shared trajectory-level advantages cause a token-level credit assignment mismatch.
Policy entropy evolution exhibits an advantage–surprisal four-quadrant structure.
A mild token-level weight perturbation suffices to alter the entropy evolution direction.

Method

STARE identifies entropy-critical tokens via batch-internal surprisal quantiles, selectively reweights their effective advantages, and uses a target-entropy closed-loop gate for stable regulation.

In practice

Implement surprisal-guided advantage reweighting for GRPO-style RL training.
Utilize a target-entropy closed-loop gate to prevent both entropy collapse and over-exploration.
Prioritize amplifying positive-advantage high-surprisal tokens for best performance.

Topics

RL with Verifiable Rewards
Policy Entropy Stability
GRPO Algorithms
Token-level Credit Assignment
Surprisal Reweighting
LLM Post-training

Code references

hp-luo/STARE

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.