Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning
Summary
The paper introduces Entropy-Guided Reinforced Self-Distillation (EGRSD) and its causal-lookahead variant, CL-EGRSD, to enhance the efficiency of Large Language Model (LLM) reasoning. These methods address the issue of uniform weighting of token-level supervision in on-policy self-distillation, which often overlooks the varying confidence (entropy) of the teacher model's predictions. EGRSD unifies token-level updates using a reward-grounded direction, a teacher–student likelihood-ratio magnitude, and a novel teacher-entropy confidence gate. This gate down-weights high-entropy token positions while maintaining a non-zero lower bound. CL-EGRSD further refines this by distinguishing sustained high-entropy spans from transient high-entropy "pivot" positions whose uncertainty quickly resolves. Experiments with Qwen3-4B and Qwen3-8B in thinking mode demonstrate that EGRSD and CL-EGRSD improve the accuracy–length trade-off, outperforming other trainable methods.
Key takeaway
For AI Engineers optimizing LLM inference costs, integrating Entropy-Guided Reinforced Self-Distillation (EGRSD) or CL-EGRSD into your training pipeline can significantly improve the accuracy-length trade-off. By selectively weighting token-level supervision based on teacher confidence, you can reduce redundant reasoning tokens without sacrificing performance. Consider implementing the causal-lookahead variant, CL-EGRSD, especially for larger models, to better handle transient high-entropy pivot points and further refine efficiency.
Key insights
Teacher predictive entropy is a crucial, often overlooked, signal for efficient LLM self-distillation.
Principles
- Not all dense supervision is equally reliable.
- Preserve high-entropy teacher tokens with a non-zero floor.
- Distinguish sustained forks from transient pivots.
Method
EGRSD gates token updates by normalizing privileged-teacher entropy within a minibatch, applying a multiplicative confidence gate $\omega_{i,t}\in[0.1,1]$ to down-weight high-entropy positions. CL-EGRSD uses minimum entropy over a causal future window.
In practice
- Apply entropy-guided weighting in self-distillation.
- Use causal lookahead to identify strategy-shift pivots.
- Maintain a non-zero lower bound on token weights.
Topics
- On-Policy Self-Distillation
- LLM Reasoning Efficiency
- Entropy-Guided Distillation
- Teacher Predictive Entropy
- EGRSD
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.