Learning from the Self-future: On-policy Self-distillation for dLLMs
Summary
d-OPSD introduces the first On-policy self-distillation (OPSD) framework specifically designed for diffusion LLMs (dLLMs), addressing the inherent conflict between existing autoregressive-centric OPSD methods and dLLMs' arbitrary-order generation. Traditional OPSD relies on left-to-right prefix conditioning and token-level divergence supervision, which is incompatible with dLLMs. This new approach reframes self-teacher construction by utilizing self-generated answers as suffix conditioning, allowing the student model to learn from "self future-experience" instead of privileged prefixes. Furthermore, d-OPSD shifts supervision from token-level to step-level, aligning with the iterative denoising process characteristic of dLLMs. Experiments on four reasoning benchmarks demonstrate that d-OPSD consistently outperforms RLVR and SFT baselines, achieving superior sample efficiency by requiring only around 10% of the optimization steps needed by RLVR. The code is publicly available at https://github.com/xingzhejun/d-OPSD.
Key takeaway
For Machine Learning Engineers post-training diffusion LLMs (dLLMs) for reasoning tasks, you should consider integrating d-OPSD. This framework offers superior sample efficiency, requiring only about 10% of the optimization steps compared to RLVR baselines. Implementing d-OPSD can significantly accelerate your dLLM fine-tuning process and improve performance on complex reasoning benchmarks. Explore its open-source code to adapt it to your specific dLLM architectures and tasks.
Key insights
d-OPSD adapts self-distillation for dLLMs by using suffix conditioning and step-level supervision, significantly boosting sample efficiency.
Principles
- Self-distillation can adapt to non-autoregressive models.
- Suffix conditioning enables "self future-experience" learning.
- Step-level supervision aligns with iterative denoising.
Method
d-OPSD reframes self-teacher construction with suffix conditioning from self-generated answers and shifts supervision from token-level to step-level, aligning with dLLM iterative denoising.
In practice
- Apply d-OPSD for dLLM post-training.
- Explore suffix conditioning for non-autoregressive tasks.
- Consider step-level supervision for iterative models.
Topics
- Diffusion LLMs
- On-policy Self-distillation
- d-OPSD Framework
- Suffix Conditioning
- Step-level Supervision
- LLM Post-training
Code references
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.