Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

DFlash is an open-source block diffusion model designed for speculative decoding, significantly boosting large language model (LLM) inference performance on NVIDIA Blackwell and Hopper GPUs. It enhances traditional speculative decoding by using a block-diffusion drafter that generates an entire block of candidate tokens in a single forward pass, converting sequential drafting into parallel GPU work while maintaining output quality. DFlash increases inference performance for gpt-oss-120b on NVIDIA Blackwell by up to 15x at the same interactivity level and nearly doubles interactivity for Llama 3.1 8B compared to EAGLE-3. The research team released 20 DFlash checkpoints on Hugging Face with recipes for NVIDIA Blackwell and Hopper GPUs, supporting frameworks like TensorRT-LLM, SGLang, and vLLM. For instance, it delivers up to 5.8x higher throughput for Gemma 4 31B on vLLM and 5.1x for Qwen3 8-B on SGLang, enabling faster, more efficient LLM inference.

Key takeaway

For MLOps Engineers optimizing LLM inference on NVIDIA GPUs, DFlash presents a compelling solution to enhance throughput and interactivity. You should evaluate DFlash's block-diffusion speculative decoding, which offers up to 15x speedups on Blackwell, by integrating its open-source checkpoints into your existing vLLM, SGLang, or TensorRT-LLM deployments. This allows you to scale agentic and interactive AI workloads more efficiently without application refactoring.

Key insights

DFlash uses block-diffusion speculative decoding to parallelize token drafting, achieving up to 15x LLM inference speedup on NVIDIA Blackwell GPUs.

Principles

Method

DFlash replaces autoregressive drafters with a block-diffusion drafter that predicts masked future tokens in parallel. It uses target hidden-state conditioning and KV injection for high acceptance rates, with the target model verifying the block.

In practice

Topics

Code references

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.