Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

2026-05-15 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

The paper introduces Entropy-Guided Reinforced Self-Distillation (EGRSD) and its causal-lookahead variant, CL-EGRSD, to enhance the efficiency of Large Language Model (LLM) reasoning. These methods address the issue of uniform weighting of token-level supervision in on-policy self-distillation, which often overlooks the varying confidence (entropy) of the teacher model's predictions. EGRSD unifies token-level updates using a reward-grounded direction, a teacher–student likelihood-ratio magnitude, and a novel teacher-entropy confidence gate. This gate down-weights high-entropy token positions while maintaining a non-zero lower bound. CL-EGRSD further refines this by distinguishing sustained high-entropy spans from transient high-entropy "pivot" positions whose uncertainty quickly resolves. Experiments with Qwen3-4B and Qwen3-8B in thinking mode demonstrate that EGRSD and CL-EGRSD improve the accuracy–length trade-off, outperforming other trainable methods.

Key takeaway

For AI Engineers optimizing LLM inference costs, integrating Entropy-Guided Reinforced Self-Distillation (EGRSD) or CL-EGRSD into your training pipeline can significantly improve the accuracy-length trade-off. By selectively weighting token-level supervision based on teacher confidence, you can reduce redundant reasoning tokens without sacrificing performance. Consider implementing the causal-lookahead variant, CL-EGRSD, especially for larger models, to better handle transient high-entropy pivot points and further refine efficiency.

Key insights

Teacher predictive entropy is a crucial, often overlooked, signal for efficient LLM self-distillation.

Principles

Not all dense supervision is equally reliable.
Preserve high-entropy teacher tokens with a non-zero floor.
Distinguish sustained forks from transient pivots.

Method

EGRSD gates token updates by normalizing privileged-teacher entropy within a minibatch, applying a multiplicative confidence gate $\omega_{i,t}\in[0.1,1]$ to down-weight high-entropy positions. CL-EGRSD uses minimum entropy over a causal future window.

In practice

Apply entropy-guided weighting in self-distillation.
Use causal lookahead to identify strategy-shift pivots.
Maintain a non-zero lower bound on token weights.

Topics

On-Policy Self-Distillation
LLM Reasoning Efficiency
Entropy-Guided Distillation
Teacher Predictive Entropy
EGRSD

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.