Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding
Summary
DFlash is an open-source block diffusion model designed for speculative decoding, significantly boosting large language model (LLM) inference performance on NVIDIA Blackwell and Hopper GPUs. It enhances traditional speculative decoding by using a block-diffusion drafter that generates an entire block of candidate tokens in a single forward pass, converting sequential drafting into parallel GPU work while maintaining output quality. DFlash increases inference performance for gpt-oss-120b on NVIDIA Blackwell by up to 15x at the same interactivity level and nearly doubles interactivity for Llama 3.1 8B compared to EAGLE-3. The research team released 20 DFlash checkpoints on Hugging Face with recipes for NVIDIA Blackwell and Hopper GPUs, supporting frameworks like TensorRT-LLM, SGLang, and vLLM. For instance, it delivers up to 5.8x higher throughput for Gemma 4 31B on vLLM and 5.1x for Qwen3 8-B on SGLang, enabling faster, more efficient LLM inference.
Key takeaway
For MLOps Engineers optimizing LLM inference on NVIDIA GPUs, DFlash presents a compelling solution to enhance throughput and interactivity. You should evaluate DFlash's block-diffusion speculative decoding, which offers up to 15x speedups on Blackwell, by integrating its open-source checkpoints into your existing vLLM, SGLang, or TensorRT-LLM deployments. This allows you to scale agentic and interactive AI workloads more efficiently without application refactoring.
Key insights
DFlash uses block-diffusion speculative decoding to parallelize token drafting, achieving up to 15x LLM inference speedup on NVIDIA Blackwell GPUs.
Principles
- Parallelize sequential tasks for GPU efficiency.
- Preserve output quality via target model verification.
- Condition drafters on target model hidden states.
Method
DFlash replaces autoregressive drafters with a block-diffusion drafter that predicts masked future tokens in parallel. It uses target hidden-state conditioning and KV injection for high acceptance rates, with the target model verifying the block.
In practice
- Deploy DFlash checkpoints on Hugging Face.
- Integrate DFlash into vLLM or SGLang.
- Utilize NVIDIA Blackwell for DFlash acceleration.
Topics
- DFlash
- Speculative Decoding
- LLM Inference Optimization
- NVIDIA Blackwell
- TensorRT-LLM
- vLLM
- SGLang
Code references
Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.