DMax: Aggressive Parallel Decoding for dLLMs

2026-04-09 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

DMax introduces a novel paradigm for efficient diffusion language models (dLLMs) that addresses error accumulation in parallel decoding, significantly improving generation speed while maintaining quality. Unlike traditional masked dLLMs that use binary mask-to-token transitions, DMax redefines decoding as a progressive self-refinement from mask embeddings to token embeddings. Its core innovation is On-Policy Uniform Training, a strategy that unifies masked and uniform dLLMs, allowing the model to correct errors from both masked inputs and its own predictions. DMax also incorporates Soft Parallel Decoding, which represents intermediate decoding states as interpolations between predicted token and mask embeddings for iterative self-revision. Experiments show DMax improves Tokens Per Forward (TPF) on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 compared to LLaDA-2.0-mini, while achieving 1,338 TPS on two H200 GPUs at batch size 1.

Key takeaway

For AI Engineers optimizing dLLM inference, DMax offers a significant performance boost without sacrificing generation quality. Its aggressive parallel decoding and self-refinement approach can drastically reduce inference latency, making it suitable for applications requiring high throughput. You should explore integrating DMax's training and decoding strategies to enhance the efficiency of your diffusion language models, especially for tasks like code generation or mathematical reasoning where error accumulation is critical.

Key insights

DMax enhances dLLM efficiency by reformulating decoding as progressive self-refinement and unifying training strategies.

Principles

Progressive self-refinement improves decoding.
Unified training enhances error recovery.

Method

DMax uses On-Policy Uniform Training to unify masked and uniform dLLMs, followed by Soft Parallel Decoding for iterative self-revision in embedding space.

In practice

Achieves 1,338 TPS on H200 GPUs.
Improves TPF on GSM8K to 5.47.
Increases TPF on MBPP to 5.86.

Topics

Diffusion Language Models
Parallel Decoding
On-Policy Uniform Training
Soft Parallel Decoding
Model Efficiency

Code references

czg1225/DMax

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.