Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding
Summary
Domino is a novel speculative decoding framework designed to accelerate Large Language Model (LLM) inference by addressing the trade-off between draft quality and drafting cost. Unlike traditional methods that struggle with sequential overhead from autoregressive drafters or weak dependency modeling from parallel drafters, Domino decouples causal dependency modeling from expensive autoregressive draft execution. It employs a parallel draft backbone to generate preliminary draft distributions for an entire block, subsequently refining these with a lightweight Domino head that incorporates prefix-dependent causal information. To enhance stability during teacher-forced causal encoding, the framework integrates a base-anchored training curriculum. This approach first strengthens the parallel backbone before gradually shifting optimization towards the causally corrected final distribution. Experiments on Qwen3 models demonstrate significant performance gains, achieving up to 5.49× end-to-end speedup using the Transformers backend and up to 5.8× throughput speedup under SGLang serving.
Key takeaway
For Machine Learning Engineers and AI Architects focused on optimizing Large Language Model inference, Domino presents a compelling solution to significantly boost throughput and reduce latency. If you are deploying Qwen3 models or similar architectures, integrating this speculative decoding framework could yield up to 5.8× throughput speedup. You should evaluate Domino's decoupled causal modeling approach to enhance your LLM serving infrastructure.
Key insights
Domino accelerates LLM inference by decoupling causal modeling from autoregressive drafting, using parallel generation refined by causal information.
Principles
- Decouple causal modeling from draft execution.
- Refine parallel drafts with prefix-dependent causal data.
- Stabilize training with a base-anchored curriculum.
Method
Domino uses a parallel draft backbone for initial distributions, then a lightweight head refines them with causal information. A base-anchored training curriculum stabilizes teacher-forced causal encoding.
In practice
- Accelerate LLM inference up to 5.8×.
- Improve throughput for LLM serving.
- Enhance Qwen3 model performance.
Topics
- Speculative Decoding
- LLM Inference
- Causal Modeling
- Autoregressive Drafting
- Qwen3 Models
- SGLang Serving
Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.