Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding

2026-06-23 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

DFlash is an open-source block diffusion model designed for speculative decoding, significantly boosting large language model (LLM) inference performance on NVIDIA Blackwell and Hopper GPUs. It enhances traditional speculative decoding by using a block-diffusion drafter that generates an entire block of candidate tokens in a single forward pass, converting sequential drafting into parallel GPU work while maintaining output quality. DFlash increases inference performance for gpt-oss-120b on NVIDIA Blackwell by up to 15x at the same interactivity level and nearly doubles interactivity for Llama 3.1 8B compared to EAGLE-3. The research team released 20 DFlash checkpoints on Hugging Face with recipes for NVIDIA Blackwell and Hopper GPUs, supporting frameworks like TensorRT-LLM, SGLang, and vLLM. For instance, it delivers up to 5.8x higher throughput for Gemma 4 31B on vLLM and 5.1x for Qwen3 8-B on SGLang, enabling faster, more efficient LLM inference.

Key takeaway

For MLOps Engineers optimizing LLM inference on NVIDIA GPUs, DFlash presents a compelling solution to enhance throughput and interactivity. You should evaluate DFlash's block-diffusion speculative decoding, which offers up to 15x speedups on Blackwell, by integrating its open-source checkpoints into your existing vLLM, SGLang, or TensorRT-LLM deployments. This allows you to scale agentic and interactive AI workloads more efficiently without application refactoring.

Key insights

DFlash uses block-diffusion speculative decoding to parallelize token drafting, achieving up to 15x LLM inference speedup on NVIDIA Blackwell GPUs.

Principles

Parallelize sequential tasks for GPU efficiency.
Preserve output quality via target model verification.
Condition drafters on target model hidden states.

Method

DFlash replaces autoregressive drafters with a block-diffusion drafter that predicts masked future tokens in parallel. It uses target hidden-state conditioning and KV injection for high acceptance rates, with the target model verifying the block.

In practice

Deploy DFlash checkpoints on Hugging Face.
Integrate DFlash into vLLM or SGLang.
Utilize NVIDIA Blackwell for DFlash acceleration.

Topics

DFlash
Speculative Decoding
LLM Inference Optimization
NVIDIA Blackwell
TensorRT-LLM
vLLM
SGLang

Code references

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.