Skill-Conditioned Gated Self-Distillation for LLM Reasoning

2026-05-27 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

Skill-Conditioned Gated Self-Distillation (SGSD) improves LLM reasoning by using an experience-derived skill bank as privileged information (PI) for on-policy self-distillation (SD), addressing the limitation of existing methods that assume trusted PI. Unlike approaches relying on trusted PI like reference answers, SGSD formulates skill-based SD as teacher hypothesis validation. It retrieves skill-mistake pairs, constructs a multi-teacher pool, and lets these skill-conditioned teachers score a plain-prompt student rollout. A verifier validates each teacher's polarity, and a robust gated objective distills informative teacher-student disagreements while suppressing uncertain signals. Experiments on mathematical reasoning benchmarks show SGSD consistently outperforms GRPO. For example, on Qwen3-1.7B, SGSD achieved an average improvement of 6.2% over GRPO and 1.7% over OPSD on AIME24, AIME25, and HMMT25, operating under a weaker PI assumption.

Key takeaway

For AI Scientists or Machine Learning Engineers improving LLM reasoning performance with self-distillation, SGSD offers a robust approach to leverage less-than-perfect "skill bank" data for privileged information. This method consistently outperforms GRPO and remains competitive with answer-conditioned OPSD, even under weaker PI assumptions. Consider implementing SGSD's skill-conditioned, gated distillation framework to enhance your LLM's mathematical reasoning capabilities, especially when high-quality reference answers are scarce.

Key insights

SGSD enhances LLM reasoning by validating skill-conditioned teacher hypotheses from an experience-derived skill bank.

Principles

Experience-derived skill banks can serve as privileged information.
Validate teacher hypotheses rather than unconditionally imitating them.
Gated objectives effectively filter uncertain distillation signals.

Method

SGSD retrieves skill-mistake pairs, forms a multi-teacher pool, and uses a verifier to validate teacher polarity on student rollouts, distilling informative disagreements via a robust gated objective.

In practice

Explore skill-mistake pairs for LLM self-distillation.
Implement multi-teacher validation for robust supervision.
Apply gated objectives to filter noisy teacher signals.

Topics

Large Language Models
Self-Distillation
LLM Reasoning
Skill Learning
Mathematical Reasoning
Reinforcement Learning

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.