STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability
Summary
STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability) is a novel mechanism designed to mitigate policy entropy collapse in Group Relative Policy Optimization (GRPO) algorithms used for large language model (LLM) post-training. A first-order gradient analysis revealed a token-level credit assignment mismatch in GRPO, leading to an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by this, STARE identifies entropy-critical token subsets using batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Evaluated across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps, maintaining policy entropy within a target band. It consistently outperforms DAPO and other baselines by 4%–8% in average accuracy on AIME24 and AIME25.
Key takeaway
For AI Scientists and Machine Learning Engineers working on post-training LLMs with GRPO-style algorithms, you should consider integrating STARE to overcome policy entropy collapse. Its surprisal-guided token-level advantage reweighting and closed-loop entropy gating enable stable, long-horizon RL training, leading to 4%-8% accuracy gains on reasoning tasks. Implementing STARE can unlock further optimization potential and improve exploration-exploitation balance in your models.
Key insights
STARE rebalances token-level credit in GRPO via surprisal-guided reweighting and a closed-loop gate to prevent policy entropy collapse.
Principles
- GRPO's shared trajectory-level advantages cause a token-level credit assignment mismatch.
- Policy entropy evolution exhibits an advantage–surprisal four-quadrant structure.
- A mild token-level weight perturbation suffices to alter the entropy evolution direction.
Method
STARE identifies entropy-critical tokens via batch-internal surprisal quantiles, selectively reweights their effective advantages, and uses a target-entropy closed-loop gate for stable regulation.
In practice
- Implement surprisal-guided advantage reweighting for GRPO-style RL training.
- Utilize a target-entropy closed-loop gate to prevent both entropy collapse and over-exploration.
- Prioritize amplifying positive-advantage high-surprisal tokens for best performance.
Topics
- RL with Verifiable Rewards
- Policy Entropy Stability
- GRPO Algorithms
- Token-level Credit Assignment
- Surprisal Reweighting
- LLM Post-training
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.