DSL-LLaDA: Scaling Continuous Denoising to 8B Masked Diffusion LMs
Summary
DSL-LLaDA introduces a method to scale continuous denoising for 8B Masked Diffusion Language Models, addressing the length-quality tradeoff inherent in few-step decoding of discrete masked DLMs. By lightly adapting the pretrained LLaDA-8B-Instruct model, researchers applied Discrete Stochastic Localization (DSL) through 1,000 steps of continue-pretraining. This adaptation replaces traditional binary masking with continuous per-token Gaussian noise, functioning as a soft mask. The resulting DSL-LLaDA-SDE model enables continuous inference, jointly evolving all token positions in embedding space and delaying hard token commitment until the final step. This approach significantly improves zero-shot summarization, achieving the best ROUGE-1 scores across four benchmarks at low step budgets (up to 16 forward passes) while effectively preventing premature termination and repetitive output. Furthermore, the model exhibits selective noisy-state robustness, correcting corrupted tokens while preserving uncorrupted ones, a capability not observed with standard masked diffusion training.
Key takeaway
For Machine Learning Engineers developing efficient text generation systems, consider adapting pretrained masked Diffusion LMs with continuous denoising. Your team can achieve superior ROUGE-1 scores in zero-shot summarization at low step budgets (e.g., <=16 forward passes) by applying methods like Discrete Stochastic Localization. This approach mitigates the length-quality tradeoff and offers robust correction of corrupted tokens, making it valuable for high-throughput, quality-sensitive NLP applications.
Key insights
Adapting pretrained masked diffusion LMs with continuous denoising via soft masking resolves few-step generation tradeoffs.
Principles
- Continuous denoising improves few-step text generation quality.
- Pretrained DLMs can be efficiently adapted for new capabilities.
- Soft masking with Gaussian noise enhances model robustness.
Method
Adapt a pretrained masked Diffusion Language Model by continue-pretraining for approximately 1,000 steps using Discrete Stochastic Localization (DSL), substituting binary masking with continuous per-token Gaussian noise.
In practice
- Adapt LLaDA-8B-Instruct for continuous denoising tasks.
- Apply DSL-LLaDA for efficient zero-shot summarization.
- Utilize noisy-state robustness for text correction.
Topics
- Masked Diffusion LMs
- Continuous Denoising
- Text Generation
- LLaDA-8B-Instruct
- Discrete Stochastic Localization
- Zero-shot Summarization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.