Learning from the Self-future: On-policy Self-distillation for dLLMs

2026-06-16 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

d-OPSD introduces the first On-policy self-distillation (OPSD) framework specifically designed for diffusion LLMs (dLLMs), addressing the inherent conflict between existing autoregressive-centric OPSD methods and dLLMs' arbitrary-order generation. Traditional OPSD relies on left-to-right prefix conditioning and token-level divergence supervision, which is incompatible with dLLMs. This new approach reframes self-teacher construction by utilizing self-generated answers as suffix conditioning, allowing the student model to learn from "self future-experience" instead of privileged prefixes. Furthermore, d-OPSD shifts supervision from token-level to step-level, aligning with the iterative denoising process characteristic of dLLMs. Experiments on four reasoning benchmarks demonstrate that d-OPSD consistently outperforms RLVR and SFT baselines, achieving superior sample efficiency by requiring only around 10% of the optimization steps needed by RLVR. The code is publicly available at https://github.com/xingzhejun/d-OPSD.

Key takeaway

For Machine Learning Engineers post-training diffusion LLMs (dLLMs) for reasoning tasks, you should consider integrating d-OPSD. This framework offers superior sample efficiency, requiring only about 10% of the optimization steps compared to RLVR baselines. Implementing d-OPSD can significantly accelerate your dLLM fine-tuning process and improve performance on complex reasoning benchmarks. Explore its open-source code to adapt it to your specific dLLM architectures and tasks.

Key insights

d-OPSD adapts self-distillation for dLLMs by using suffix conditioning and step-level supervision, significantly boosting sample efficiency.

Principles

Self-distillation can adapt to non-autoregressive models.
Suffix conditioning enables "self future-experience" learning.
Step-level supervision aligns with iterative denoising.

Method

d-OPSD reframes self-teacher construction with suffix conditioning from self-generated answers and shifts supervision from token-level to step-level, aligning with dLLM iterative denoising.

In practice

Apply d-OPSD for dLLM post-training.
Explore suffix conditioning for non-autoregressive tasks.
Consider step-level supervision for iterative models.

Topics

Diffusion LLMs
On-policy Self-distillation
d-OPSD Framework
Suffix Conditioning
Step-level Supervision
LLM Post-training

Code references

xingzhejun/d-OPSD

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.