Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

On-Policy Diffusion Language Model (OPDLM) introduces a data-efficient method for transforming autoregressive language models (ARLMs) into diffusion language models (DLMs). This approach, utilizing On-Policy Distillation (OPD), tackles two critical distribution shifts inherent in prior ARLM-to-DLM transformations. First, it mitigates knowledge loss when transitioning from an ARLM's next-token prediction objective to a DLM objective. Second, OPDLM resolves the train-inference mismatch common in standard DLMs, which typically train on randomly masked sequences but infer using confidence-based decoding trajectories. The self-OPD training involves a student ARLM with bidirectional attention generating its own trajectories, while a frozen original ARLM acts as a teacher, distilling knowledge via target logits. This method significantly reduces training data requirements, needing 15x to 7,000x fewer training tokens, and achieves strong performance across various tasks, effectively positioning DLM transformation as a form of ARLM post-training.

Key takeaway

For Machine Learning Engineers developing or deploying language models, consider On-Policy Distillation (OPD) to efficiently transform existing autoregressive models into diffusion language models. This approach significantly reduces the data and computational resources typically required for DLM pretraining, needing 15x to 7,000x fewer training tokens. You can utilize your pre-trained ARLMs as teachers to retain valuable knowledge, making DLM capabilities more accessible and cost-effective for various applications.

Key insights

On-Policy Distillation efficiently transforms ARLMs into DLMs by addressing distribution shifts and retaining knowledge.

Principles

Method

OPDLM trains a student ARLM (with bidirectional attention) using self-OPD, where the original frozen ARLM distills knowledge by providing target logits on the student's generated trajectories.

In practice

Topics

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.