STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, extended

Summary

STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability) is a novel mechanism designed to mitigate policy entropy collapse in Group Relative Policy Optimization (GRPO) algorithms used for large language model (LLM) post-training. A first-order gradient analysis revealed a token-level credit assignment mismatch in GRPO, leading to an advantage-surprisal four-quadrant structure and a near-criticality property. Motivated by this, STARE identifies entropy-critical token subsets using batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation. Evaluated across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps, maintaining policy entropy within a target band. It consistently outperforms DAPO and other baselines by 4%–8% in average accuracy on AIME24 and AIME25.

Key takeaway

For AI Scientists and Machine Learning Engineers working on post-training LLMs with GRPO-style algorithms, you should consider integrating STARE to overcome policy entropy collapse. Its surprisal-guided token-level advantage reweighting and closed-loop entropy gating enable stable, long-horizon RL training, leading to 4%-8% accuracy gains on reasoning tasks. Implementing STARE can unlock further optimization potential and improve exploration-exploitation balance in your models.

Key insights

STARE rebalances token-level credit in GRPO via surprisal-guided reweighting and a closed-loop gate to prevent policy entropy collapse.

Principles

Method

STARE identifies entropy-critical tokens via batch-internal surprisal quantiles, selectively reweights their effective advantages, and uses a target-entropy closed-loop gate for stable regulation.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.