Skill-Conditioned Gated Self-Distillation for LLM Reasoning
Summary
Skill-Conditioned Gated Self-Distillation (SGSD) improves LLM reasoning by using an experience-derived skill bank as privileged information (PI) for on-policy self-distillation (SD), addressing the limitation of existing methods that assume trusted PI. Unlike approaches relying on trusted PI like reference answers, SGSD formulates skill-based SD as teacher hypothesis validation. It retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets these skill-conditioned teachers score a plain-prompt student rollout. A verifier validates each teacher's polarity, and a robust gated objective distills informative teacher-student disagreements while suppressing uncertain signals. Experiments on mathematical reasoning benchmarks show SGSD consistently outperforms GRPO. For example, on Qwen3-1.7B, SGSD achieved an average improvement of 6.2% over GRPO and 1.7% over OPSD on AIME24, AIME25, and HMMT25, operating under a weaker PI assumption.
Key takeaway
For AI Scientists or Machine Learning Engineers improving LLM reasoning performance with self-distillation, SGSD offers a robust approach to leverage less-than-perfect "skill bank" data for privileged information. This method consistently outperforms GRPO and remains competitive with answer-conditioned OPSD, even under weaker PI assumptions. Consider implementing SGSD's skill-conditioned, gated distillation framework to enhance your LLM's mathematical reasoning capabilities, especially when high-quality reference answers are scarce.
Key insights
SGSD enhances LLM reasoning by validating skill-conditioned teacher hypotheses from an experience-derived skill bank.
Principles
- Experience-derived skill banks can serve as privileged information.
- Validate teacher hypotheses rather than unconditionally imitating them.
- Gated objectives effectively filter uncertain distillation signals.
Method
SGSD retrieves skill-mistake pairs, forms a multi-teacher pool, and uses a verifier to validate teacher polarity on student rollouts, distilling informative disagreements via a robust gated objective.
In practice
- Explore skill-mistake pairs for LLM self-distillation.
- Implement multi-teacher validation for robust supervision.
- Apply gated objectives to filter noisy teacher signals.
Topics
- Large Language Models
- Self-Distillation
- LLM Reasoning
- Skill Learning
- Mathematical Reasoning
- Reinforcement Learning
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.