Fast-dLLM++: Fréchet Profile Decoding for Faster Diffusion LLM Inference
Summary
Fast-dLLM++, a training-free extension, significantly enhances the inference speed of Diffusion large language models by introducing Fréchet profile decoding. This method addresses a key bottleneck in parallel token generation, which previously relied on a homogeneous high-confidence assumption, effectively reducing candidate sets to their weakest token. Fast-dLLM++ instead utilizes the full sorted confidence profile, enabling the selection of parallel commit sets based on heterogeneous confidence. This approach generalizes Fast-dLLM's factor selector, recovering the previous rule in equal-confidence scenarios and adding a provable "heterogeneity bonus" for uneven token confidences. As a drop-in replacement, Fast-dLLM++ requires no changes to the underlying model, diffusion process, or cache implementation. Empirical evaluations using the LLaDA-8B model across benchmarks like GSM8K, MATH, HumanEval, and MBPP demonstrate up to 37% higher throughput at comparable accuracy.
Key takeaway
For Machine Learning Engineers optimizing Diffusion LLM inference, Fast-dLLM++ offers a significant throughput improvement without model retraining. If you are currently using Fast-dLLM, consider implementing this training-free extension to achieve up to 37% higher throughput while maintaining accuracy on tasks like code generation and mathematical reasoning. This allows for more efficient deployment of parallel token generation.
Key insights
Fast-dLLM++ accelerates Diffusion LLM inference by utilizing heterogeneous confidence profiles for parallel token commitment.
Principles
- Homogeneous confidence assumptions limit parallel decoding speed.
- Exploiting heterogeneous confidence profiles improves throughput.
- A "heterogeneity bonus" is provable with profile-aware selection.
Method
Fast-dLLM++ employs Fréchet profile decoding, selecting parallel commit sets from the full sorted confidence profile, generalizing Fast-dLLM's factor selector for heterogeneous confidence.
In practice
- Drop-in replacement for existing Fast-dLLM decoding.
- Improves accuracy-throughput frontier for Diffusion LLMs.
- Applicable to LLaDA-8B on benchmarks like GSM8K.
Topics
- Diffusion LLMs
- Parallel Decoding
- Fréchet Profile Decoding
- LLaDA-8B
- Inference Optimization
- Throughput Improvement
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.