Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index
Summary
A new information-theoretic metric, the Relative Surprisal Index (RSI), has been introduced to enhance Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). This metric addresses the tension between existing RLVR token selection approaches that either prioritize high-entropy tokens or caution against low-probability tokens, both yielding performance gains despite their apparent contradiction. RSI uniquely couples a token's entropy with its selected probability, providing a more comprehensive view of policy optimization dynamics. Building on this, RSI Selection (RSI-S) is proposed as an entropy-adaptive token filtering method that retains tokens within a stable RSI interval. RSI-S successfully reconciles previous paradigms by filtering out both redundant low-surprisal and unstable high-surprisal tail tokens. Empirical evaluations demonstrate that RSI-S improves avg@32 accuracy by 2-3 percentage points over GRPO across Qwen2.5-1.5B, 3B, and 7B models on AIME and AMC benchmarks.
Key takeaway
For Machine Learning Engineers optimizing Large Language Models with Reinforcement Learning with Verifiable Rewards (RLVR), you should consider integrating the new RSI Selection (RSI-S) method. This approach, based on the Relative Surprisal Index, offers a principled way to filter tokens, reconciling prior contradictory strategies. Implementing RSI-S can improve avg@32 accuracy by 2-3 percentage points over GRPO on benchmarks like AIME and AMC, enhancing LLM reasoning capabilities across various model scales.
Key insights
The Relative Surprisal Index (RSI) unifies conflicting RLVR token selection strategies by coupling token entropy with selected token probability.
Principles
- Token probability or entropy alone is insufficient.
- RSI couples token entropy with selected probability.
- RSI relates to logit-gradient norm and predictive entropy.
Method
RSI Selection (RSI-S) is an entropy-adaptive token filtering method. It retains tokens within a stable Relative Surprisal Index (RSI) interval, filtering both low- and high-surprisal tokens.
In practice
- Improve RLVR avg@32 accuracy by 2-3%.
- Apply RSI-S to Qwen2.5-1.5B, 3B, 7B models.
- Enhance reasoning on AIME and AMC benchmarks.
Topics
- Reinforcement Learning
- Large Language Models
- Token Selection
- Relative Surprisal Index
- RL with Verifiable Rewards
- Qwen2.5
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.