Beyond Entropy: Learning from Token-Level Distributional Deviations for LLM Reasoning
Summary
The Independent Combinatorial Tokens (ICT) framework addresses optimization instability in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning, specifically the issues of entropy collapse and explosion. ICT shifts optimization from scalar uncertainty to token logit distributional properties, using Jensen-Shannon (JS) divergence to identify critical branching points for exploration. This method selectively updates tokens with distinctive distributional patterns. Theoretical analysis, based on Shannon and second-order Rényi entropy, shows ICT regulates policy concentration by reducing overall distribution uncertainty while controlling probability concentration, preventing over-concentrated token generation and stabilizing training. Empirical results on Qwen2.5 (0.5B/1.5B/7B) models demonstrate that updating only the top 10% of unique tokens achieves an average pass@4 improvement of 4.58%, with a maximum gain of 14.9%, over GRPO, 20-Entropy, and STAPO baselines across seven benchmarks including math, commonsense, and Olympiad-level problems.
Key takeaway
For Machine Learning Engineers optimizing LLM reasoning with RLVR, consider implementing the Independent Combinatorial Tokens (ICT) framework. This approach stabilizes training by preventing entropy collapse and explosion, which often lead to suboptimal strategies. By selectively updating only the top 10% of unique tokens, you can achieve significant performance gains, such as the reported 4.58% average pass@4 improvement on Qwen2.5 models, enhancing reasoning capabilities across diverse problem sets.
Key insights
ICT stabilizes LLM reasoning by focusing on token-level distributional deviations, preventing entropy collapse and explosion.
Principles
- Shift from scalar uncertainty to distributional properties.
- Regulate policy concentration via dual entropy control.
- Identify critical branching points using JS divergence.
Method
The ICT framework identifies tokens with distinctive distributional patterns using Jensen-Shannon divergence between token logit distributions, then selectively updates only these critical tokens to guide LLM exploration.
In practice
- Apply ICT to stabilize RLVR training for LLMs.
- Update only top 10% unique tokens for efficiency.
- Improve reasoning performance on math and commonsense tasks.
Topics
- Large Language Models
- Reinforcement Learning
- Token-level Optimization
- Jensen-Shannon Divergence
- Entropy Regularization
- Qwen2.5
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.