Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

2026-06-08 · Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, short

Summary

A new method, On-Policy Diffusion Language Model (OPDLM), facilitates the data-efficient transformation of autoregressive language models (ARLMs) into diffusion language models (DLMs). Traditional approaches for this conversion suffer from two key distribution shifts: one from changing the training objective and another from a train-inference mismatch inherent in standard DLMs. OPDLM addresses these by employing On-Policy Distillation (OPD). In this self-OPD process, a student ARLM, modified with bidirectional attention, generates its own trajectories, while the original, frozen ARLM acts as a teacher, distilling knowledge by providing target logits on these generated sequences. This on-policy training eliminates the train-inference mismatch and significantly improves knowledge retention from the initial ARLM. Empirical evaluations demonstrate that OPDLM achieves strong performance across various tasks, requiring 15x to 7,000x fewer training tokens compared to pretraining DLMs from scratch, effectively making DLM transformation a form of ARLM post-training.

Key takeaway

For machine learning engineers developing diffusion language models or converting existing autoregressive models, OPDLM offers a highly data-efficient pathway. You can significantly reduce training token requirements by 15x to 7,000x, avoiding the prohibitive costs of DLM pretraining. Consider integrating On-Policy Distillation to mitigate distribution shifts and train-inference mismatch, ensuring more robust and performant DLMs from your existing ARLM assets. This approach positions DLM transformation as a practical post-training step.

Key insights

On-Policy Distillation efficiently transforms ARLMs into DLMs by resolving distribution shifts and train-inference mismatch.

Principles

Address distribution shifts in model transformations.
On-policy training improves inference alignment.
Distillation retains knowledge from source models.

Method

OPDLM trains a bidirectional ARLM student using self-OPD, where a frozen ARLM teacher provides target logits on student-generated trajectories.

In practice

Convert ARLMs to DLMs with minimal data.
Reduce DLM pretraining computational costs.
Improve DLM inference consistency.

Topics

Autoregressive Language Models
Diffusion Language Models
On-Policy Distillation
Model Transformation
Data Efficiency
Knowledge Distillation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.