Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Summary
A new method, On-Policy Diffusion Language Model (OPDLM), facilitates the data-efficient transformation of autoregressive language models (ARLMs) into diffusion language models (DLMs). Traditional approaches for this conversion suffer from two key distribution shifts: one from changing the training objective and another from a train-inference mismatch inherent in standard DLMs. OPDLM addresses these by employing On-Policy Distillation (OPD). In this self-OPD process, a student ARLM, modified with bidirectional attention, generates its own trajectories, while the original, frozen ARLM acts as a teacher, distilling knowledge by providing target logits on these generated sequences. This on-policy training eliminates the train-inference mismatch and significantly improves knowledge retention from the initial ARLM. Empirical evaluations demonstrate that OPDLM achieves strong performance across various tasks, requiring 15x to 7,000x fewer training tokens compared to pretraining DLMs from scratch, effectively making DLM transformation a form of ARLM post-training.
Key takeaway
For machine learning engineers developing diffusion language models or converting existing autoregressive models, OPDLM offers a highly data-efficient pathway. You can significantly reduce training token requirements by 15x to 7,000x, avoiding the prohibitive costs of DLM pretraining. Consider integrating On-Policy Distillation to mitigate distribution shifts and train-inference mismatch, ensuring more robust and performant DLMs from your existing ARLM assets. This approach positions DLM transformation as a practical post-training step.
Key insights
On-Policy Distillation efficiently transforms ARLMs into DLMs by resolving distribution shifts and train-inference mismatch.
Principles
- Address distribution shifts in model transformations.
- On-policy training improves inference alignment.
- Distillation retains knowledge from source models.
Method
OPDLM trains a bidirectional ARLM student using self-OPD, where a frozen ARLM teacher provides target logits on student-generated trajectories.
In practice
- Convert ARLMs to DLMs with minimal data.
- Reduce DLM pretraining computational costs.
- Improve DLM inference consistency.
Topics
- Autoregressive Language Models
- Diffusion Language Models
- On-Policy Distillation
- Model Transformation
- Data Efficiency
- Knowledge Distillation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.