TIP: Token Importance in On-Policy Distillation

2026-04-16 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, extended

Summary

The paper introduces Token Importance in On-Policy Distillation (TIP), a two-axis taxonomy that categorizes token importance based on student entropy and teacher-student divergence. It addresses the question of which tokens provide the most useful learning signal in on-policy knowledge distillation (OPD), where a student model learns from a teacher's token-level corrections on its own generated rollouts. The research identifies two critical regions: high student entropy (uncertain tokens) and low student entropy with high teacher-student divergence (overconfident but wrong tokens). Empirical results show that retaining 50% of tokens based on entropy-only sampling matches or exceeds all-token training, reducing peak memory by up to 47%. Furthermore, training on less than 10% of low-entropy, high-divergence tokens nearly matches full-token baselines, demonstrating the dense corrective signal in these "overconfident" tokens. The proposed Soft-OR score, which combines both axes, consistently improves over entropy-only selection, validated across Qwen3, Llama, and Qwen2.5 models on mathematical reasoning and agentic planning tasks.

Key takeaway

For AI Engineers optimizing large language model distillation, understanding token importance beyond just student entropy is critical. Your teams should implement type-aware token selection using the Soft-OR score, which combines student entropy and teacher-student divergence. This approach not only reduces memory footprint by up to 58% but also captures crucial "overconfident" error signals, potentially surpassing full-token training, especially in agentic planning tasks where early confident errors are costly.

Key insights

Token importance in on-policy distillation is best understood through student uncertainty and teacher-student disagreement.

Principles

High student entropy tokens are informative.
Low entropy, high divergence tokens carry dense corrective signal.
Teacher entropy is generally uninformative for token selection.

Method

TIP uses a parameter-free Soft-OR score, $s_{t}=\hat{h}_{t}+\hat{\delta}_{t}-\hat{h}_{t}\cdot\hat{\delta}_{t}$, to select top-K tokens based on normalized student entropy (uncertainty) and teacher-student divergence (disagreement).

In practice

Retain 50% of tokens by entropy for memory savings.
Prioritize low-entropy, high-divergence tokens for concentrated correction.
Use Soft-OR for comprehensive token selection.

Topics

On-Policy Distillation
Token Importance
Student Entropy
Teacher-Student Divergence
Soft-OR Score

Code references

HJSang/OPSD_OnPolicyDistillation

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.