Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

The Independent Combinatorial Tokens (ICT) framework addresses optimization instability in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning, specifically the issues of entropy collapse and explosion. ICT shifts optimization from scalar uncertainty to token logit distributional properties, using Jensen-Shannon (JS) divergence to identify critical branching points for exploration. This method selectively updates tokens with distinctive distributional patterns. Theoretical analysis, based on Shannon and second-order Rényi entropy, shows ICT regulates policy concentration by reducing overall distribution uncertainty while controlling probability concentration, preventing over-concentrated token generation and stabilizing training. Empirical results on Qwen2.5 (0.5B/1.5B/7B) models demonstrate that updating only the top 10% of unique tokens achieves an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks including math, commonsense, and Olympiad-level problems.

Key takeaway

For Machine Learning Engineers optimizing LLM reasoning with RLVR, consider implementing the Independent Combinatorial Tokens (ICT) framework. This approach stabilizes training by preventing entropy collapse and explosion, which often lead to suboptimal strategies. By selectively updating only the top 10% of unique tokens, you can achieve significant performance gains, such as the reported 4.58% average pass@4 improvement on Qwen2.5 models, enhancing reasoning capabilities across diverse problem sets.

Key insights

ICT stabilizes LLM reasoning by focusing on token-level distributional deviations, preventing entropy collapse and explosion.

Principles

Method

The ICT framework identifies tokens with distinctive distributional patterns using Jensen-Shannon divergence between token logit distributions, then selectively updates only these critical tokens to guide LLM exploration.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.