STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

2026-06-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computation and Language · Depth: Expert, quick

Summary

STARE (Surprisal-guided Token-level Advantage Reweighting for policy Entropy stability) is a novel algorithm addressing policy entropy collapse in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models. Existing methods like GRPO often suffer from this issue during training. A first-order gradient analysis revealed a token-level credit assignment mismatch, characterized by an advantage-surprisal four-quadrant structure and a near-criticality property. STARE mitigates this by identifying entropy-critical token subsets via batch-internal surprisal quantiles, selectively reweighting their effective advantages, and integrating a target-entropy closed-loop gate for stable entropy regulation. Evaluated across model scales from 1.5B to 32B and three task families (Short CoT, Long CoT, Multi-Turn Tool Use), STARE sustains stable RL training over thousands of steps, maintaining policy entropy within the target band. It outperforms DAPO and other baselines by 4%-8% in average accuracy on AIME24 and AIME25, demonstrating sustained exploration-exploitation balance.

Key takeaway

For Machine Learning Engineers developing complex reasoning LLMs with RL, STARE offers a robust solution to policy entropy collapse. If you are struggling with unstable training or suboptimal exploration-exploitation balance in GRPO-like algorithms, consider integrating STARE's surprisal-guided advantage reweighting and closed-loop entropy regulation. This approach can sustain stable RL training over thousands of steps and improve accuracy by 4%-8% on benchmarks like AIME24/25, unlocking further training potential for your models.

Key insights

Policy entropy collapse in LLM RL can be stabilized by reweighting token-level advantages based on surprisal and using a closed-loop entropy gate.

Principles

Token-level credit assignment mismatch causes entropy collapse.
Surprisal quantiles identify entropy-critical token subsets.
Stable entropy regulation requires a closed-loop gate.

Method

STARE identifies entropy-critical token subsets using batch-internal surprisal quantiles, selectively reweights their effective advantages, and incorporates a target-entropy closed-loop gate for stable entropy regulation.

In practice

Apply surprisal-guided reweighting for RL stability.
Use target-entropy gates for LLM policy control.
Consider STARE for complex LLM reasoning tasks.

Topics

Reinforcement Learning
Large Language Models
Policy Entropy Stability
Token-level Advantage Reweighting
Surprisal
GRPO

Code references

hp-luo/STARE

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.