DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Natural Language Processing · Depth: Expert, quick

Summary

DSL-LLaDA introduces a method to scale continuous denoising for 8B Masked Diffusion Language Models, addressing the length-quality tradeoff inherent in few-step decoding of discrete masked DLMs. By lightly adapting the pretrained LLaDA-8B-Instruct model, researchers applied Discrete Stochastic Localization (DSL) through 1,000 steps of continue-pretraining. This adaptation replaces traditional binary masking with continuous per-token Gaussian noise, functioning as a soft mask. The resulting DSL-LLaDA-SDE model enables continuous inference, jointly evolving all token positions in embedding space and delaying hard token commitment until the final step. This approach significantly improves zero-shot summarization, achieving the best ROUGE-1 scores across four benchmarks at low step budgets (up to 16 forward passes) while effectively preventing premature termination and repetitive output. Furthermore, the model exhibits selective noisy-state robustness, correcting corrupted tokens while preserving uncorrupted ones, a capability not observed with standard masked diffusion training.

Key takeaway

For Machine Learning Engineers developing efficient text generation systems, consider adapting pretrained masked Diffusion LMs with continuous denoising. Your team can achieve superior ROUGE-1 scores in zero-shot summarization at low step budgets (e.g., <=16 forward passes) by applying methods like Discrete Stochastic Localization. This approach mitigates the length-quality tradeoff and offers robust correction of corrupted tokens, making it valuable for high-throughput, quality-sensitive NLP applications.

Key insights

Adapting pretrained masked diffusion LMs with continuous denoising via soft masking resolves few-step generation tradeoffs.

Principles

Method

Adapt a pretrained masked Diffusion Language Model by continue-pretraining for approximately 1,000 steps using Discrete Stochastic Localization (DSL), substituting binary masking with continuous per-token Gaussian noise.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.