z-lab / dflash

2026-01-04 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, short

Summary

DFlash is a lightweight block diffusion model designed to enhance the efficiency and quality of speculative decoding for large language models (LLMs). It facilitates parallel drafting, which significantly speeds up token generation. The project provides pre-trained DFlash draft models for various LLMs, including Qwen3.6-35B-A3B, Kimi-K2.5, Qwen3.5 series (4B, 9B, 27B, 35B-A3B, 122B-A10B, 397B-A17B), Qwen3-Coder models, gpt-oss (20b, 120b), Qwen3 (4B, 8B), and Llama-3.1-8B-Instruct. DFlash supports integration with popular LLM serving backends such as Transformers, SGLang, vLLM, and MLX (for Apple Silicon), with specific installation commands provided for each. Benchmarking scripts are also available for evaluating performance across datasets like gsm8k, math500, humaneval, mbpp, and mt-bench.

Key takeaway

For AI Engineers optimizing LLM inference, DFlash offers a concrete solution to significantly boost token generation speed through speculative decoding. You should consider deploying DFlash draft models with your existing Qwen or Llama-3.1 deployments, especially when using vLLM, SGLang, or MLX backends. This can lead to substantial throughput improvements for applications requiring rapid LLM responses.

Key insights

DFlash uses block diffusion for speculative decoding, enabling efficient, high-quality parallel drafting in LLMs.

Principles

Parallel drafting accelerates LLM inference.
Block diffusion models can optimize speculative decoding.

Method

DFlash integrates with LLM serving backends (vLLM, SGLang, Transformers, MLX) to enable speculative decoding by using a lightweight block diffusion draft model to generate tokens in parallel for verification by a larger target model.

In practice

Use DFlash draft models to accelerate Qwen and Llama-3.1 inference.
Integrate DFlash with vLLM or SGLang for server-side acceleration.
Benchmark DFlash performance on common NLP datasets.

Topics

Speculative Decoding
Block Diffusion Model
LLM Acceleration
vLLM Integration
SGLang Backend

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.