Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation
Summary
On-Policy Diffusion Language Model (OPDLM) introduces a data-efficient method for transforming autoregressive language models (ARLMs) into diffusion language models (DLMs). This approach, utilizing On-Policy Distillation (OPD), tackles two critical distribution shifts inherent in prior ARLM-to-DLM transformations. First, it mitigates knowledge loss when transitioning from an ARLM's next-token prediction objective to a DLM objective. Second, OPDLM resolves the train-inference mismatch common in standard DLMs, which typically train on randomly masked sequences but infer using confidence-based decoding trajectories. The self-OPD training involves a student ARLM with bidirectional attention generating its own trajectories, while a frozen original ARLM acts as a teacher, distilling knowledge via target logits. This method significantly reduces training data requirements, needing 15x to 7,000x fewer training tokens, and achieves strong performance across various tasks, effectively positioning DLM transformation as a form of ARLM post-training.
Key takeaway
For Machine Learning Engineers developing or deploying language models, consider On-Policy Distillation (OPD) to efficiently transform existing autoregressive models into diffusion language models. This approach significantly reduces the data and computational resources typically required for DLM pretraining, needing 15x to 7,000x fewer training tokens. You can utilize your pre-trained ARLMs as teachers to retain valuable knowledge, making DLM capabilities more accessible and cost-effective for various applications.
Key insights
On-Policy Distillation efficiently transforms ARLMs into DLMs by addressing distribution shifts and retaining knowledge.
Principles
- Address train-inference mismatch directly.
- Distill knowledge from original ARLM.
- Self-generated trajectories improve training.
Method
OPDLM trains a student ARLM (with bidirectional attention) using self-OPD, where the original frozen ARLM distills knowledge by providing target logits on the student's generated trajectories.
In practice
- Convert ARLMs to DLMs with less data.
- Avoid costly DLM pretraining.
Topics
- Autoregressive Language Models
- Diffusion Language Models
- On-Policy Distillation
- Model Transformation
- Data Efficiency
- Knowledge Distillation
Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.