TIP: Token Importance in On-Policy Distillation
Summary
The paper introduces Token Importance in On-Policy Distillation (TIP), a two-axis taxonomy that categorizes token importance based on student entropy and teacher-student divergence. It addresses the question of which tokens provide the most useful learning signal in on-policy knowledge distillation (OPD), where a student model learns from a teacher's token-level corrections on its own generated rollouts. The research identifies two critical regions: high student entropy (uncertain tokens) and low student entropy with high teacher-student divergence (overconfident but wrong tokens). Empirical results show that retaining 50% of tokens based on entropy-only sampling matches or exceeds all-token training, reducing peak memory by up to 47%. Furthermore, training on less than 10% of low-entropy, high-divergence tokens nearly matches full-token baselines, demonstrating the dense corrective signal in these "overconfident" tokens. The proposed Soft-OR score, which combines both axes, consistently improves over entropy-only selection, validated across Qwen3, Llama, and Qwen2.5 models on mathematical reasoning and agentic planning tasks.
Key takeaway
For AI Engineers optimizing large language model distillation, understanding token importance beyond just student entropy is critical. Your teams should implement type-aware token selection using the Soft-OR score, which combines student entropy and teacher-student divergence. This approach not only reduces memory footprint by up to 58% but also captures crucial "overconfident" error signals, potentially surpassing full-token training, especially in agentic planning tasks where early confident errors are costly.
Key insights
Token importance in on-policy distillation is best understood through student uncertainty and teacher-student disagreement.
Principles
- High student entropy tokens are informative.
- Low entropy, high divergence tokens carry dense corrective signal.
- Teacher entropy is generally uninformative for token selection.
Method
TIP uses a parameter-free Soft-OR score, $s_{t}=\hat{h}_{t}+\hat{\delta}_{t}-\hat{h}_{t}\cdot\hat{\delta}_{t}$, to select top-K tokens based on normalized student entropy (uncertainty) and teacher-student divergence (disagreement).
In practice
- Retain 50% of tokens by entropy for memory savings.
- Prioritize low-entropy, high-divergence tokens for concentrated correction.
- Use Soft-OR for comprehensive token selection.
Topics
- On-Policy Distillation
- Token Importance
- Student Entropy
- Teacher-Student Divergence
- Soft-OR Score
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.