z-lab / dflash
Summary
DFlash is a lightweight block diffusion model designed to enhance the efficiency and quality of speculative decoding for large language models (LLMs). It facilitates parallel drafting, which significantly speeds up token generation. The project provides pre-trained DFlash draft models for various LLMs, including Qwen3.6-35B-A3B, Kimi-K2.5, Qwen3.5 series (4B, 9B, 27B, 35B-A3B, 122B-A10B, 397B-A17B), Qwen3-Coder models, gpt-oss (20b, 120b), Qwen3 (4B, 8B), and Llama-3.1-8B-Instruct. DFlash supports integration with popular LLM serving backends such as Transformers, SGLang, vLLM, and MLX (for Apple Silicon), with specific installation commands provided for each. Benchmarking scripts are also available for evaluating performance across datasets like gsm8k, math500, humaneval, mbpp, and mt-bench.
Key takeaway
For AI Engineers optimizing LLM inference, DFlash offers a concrete solution to significantly boost token generation speed through speculative decoding. You should consider deploying DFlash draft models with your existing Qwen or Llama-3.1 deployments, especially when using vLLM, SGLang, or MLX backends. This can lead to substantial throughput improvements for applications requiring rapid LLM responses.
Key insights
DFlash uses block diffusion for speculative decoding, enabling efficient, high-quality parallel drafting in LLMs.
Principles
- Parallel drafting accelerates LLM inference.
- Block diffusion models can optimize speculative decoding.
Method
DFlash integrates with LLM serving backends (vLLM, SGLang, Transformers, MLX) to enable speculative decoding by using a lightweight block diffusion draft model to generate tokens in parallel for verification by a larger target model.
In practice
- Use DFlash draft models to accelerate Qwen and Llama-3.1 inference.
- Integrate DFlash with vLLM or SGLang for server-side acceleration.
- Benchmark DFlash performance on common NLP datasets.
Topics
- Speculative Decoding
- Block Diffusion Model
- LLM Acceleration
- vLLM Integration
- SGLang Backend
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.