Which Tokens Matter? Adaptive Token Selection for RLVR with the Relative Surprisal Index

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

A new information-theoretic metric, the Relative Surprisal Index (RSI), has been introduced to enhance Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models (LLMs). This metric addresses the tension between existing RLVR token selection approaches that either prioritize high-entropy tokens or caution against low-probability tokens, both yielding performance gains despite their apparent contradiction. RSI uniquely couples a token's entropy with its selected probability, providing a more comprehensive view of policy optimization dynamics. Building on this, RSI Selection (RSI-S) is proposed as an entropy-adaptive token filtering method that retains tokens within a stable RSI interval. RSI-S successfully reconciles previous paradigms by filtering out both redundant low-surprisal and unstable high-surprisal tail tokens. Empirical evaluations demonstrate that RSI-S improves avg@32 accuracy by 2-3 percentage points over GRPO across Qwen2.5-1.5B, 3B, and 7B models on AIME and AMC benchmarks.

Key takeaway

For Machine Learning Engineers optimizing Large Language Models with Reinforcement Learning with Verifiable Rewards (RLVR), you should consider integrating the new RSI Selection (RSI-S) method. This approach, based on the Relative Surprisal Index, offers a principled way to filter tokens, reconciling prior contradictory strategies. Implementing RSI-S can improve avg@32 accuracy by 2-3 percentage points over GRPO on benchmarks like AIME and AMC, enhancing LLM reasoning capabilities across various model scales.

Key insights

The Relative Surprisal Index (RSI) unifies conflicting RLVR token selection strategies by coupling token entropy with selected token probability.

Principles

Method

RSI Selection (RSI-S) is an entropy-adaptive token filtering method. It retains tokens within a stable Relative Surprisal Index (RSI) interval, filtering both low- and high-surprisal tokens.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.