TIP: Token Importance in On-Policy Distillation

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

The paper introduces Token Importance in On-Policy Distillation (TIP), a two-axis taxonomy that categorizes token importance based on student entropy and teacher-student divergence. It addresses the question of which tokens provide the most useful learning signal in on-policy knowledge distillation (OPD), where a student model learns from a teacher's token-level corrections on its own generated rollouts. The research identifies two critical regions: high student entropy (uncertain tokens) and low student entropy with high teacher-student divergence (overconfident but wrong tokens). Empirical results show that retaining 50% of tokens based on entropy-only sampling matches or exceeds all-token training, reducing peak memory by up to 47%. Furthermore, training on less than 10% of low-entropy, high-divergence tokens nearly matches full-token baselines, demonstrating the dense corrective signal in these "overconfident" tokens. The proposed Soft-OR score, which combines both axes, consistently improves over entropy-only selection, validated across Qwen3, Llama, and Qwen2.5 models on mathematical reasoning and agentic planning tasks.

Key takeaway

For AI Engineers optimizing large language model distillation, understanding token importance beyond just student entropy is critical. Your teams should implement type-aware token selection using the Soft-OR score, which combines student entropy and teacher-student divergence. This approach not only reduces memory footprint by up to 58% but also captures crucial "overconfident" error signals, potentially surpassing full-token training, especially in agentic planning tasks where early confident errors are costly.

Key insights

Token importance in on-policy distillation is best understood through student uncertainty and teacher-student disagreement.

Principles

Method

TIP uses a parameter-free Soft-OR score, $s_{t}=\hat{h}_{t}+\hat{\delta}_{t}-\hat{h}_{t}\cdot\hat{\delta}_{t}$, to select top-K tokens based on normalized student entropy (uncertainty) and teacher-student divergence (disagreement).

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.