Domino: Decoupling Causal Modeling from Autoregressive Drafting in Speculative Decoding

· Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Natural Language Processing · Depth: Expert, quick

Summary

Domino is a novel speculative decoding framework designed to accelerate Large Language Model (LLM) inference by addressing the trade-off between draft quality and drafting cost. Unlike traditional methods that struggle with sequential overhead from autoregressive drafters or weak dependency modeling from parallel drafters, Domino decouples causal dependency modeling from expensive autoregressive draft execution. It employs a parallel draft backbone to generate preliminary draft distributions for an entire block, subsequently refining these with a lightweight Domino head that incorporates prefix-dependent causal information. To enhance stability during teacher-forced causal encoding, the framework integrates a base-anchored training curriculum. This approach first strengthens the parallel backbone before gradually shifting optimization towards the causally corrected final distribution. Experiments on Qwen3 models demonstrate significant performance gains, achieving up to 5.49× end-to-end speedup using the Transformers backend and up to 5.8× throughput speedup under SGLang serving.

Key takeaway

For Machine Learning Engineers and AI Architects focused on optimizing Large Language Model inference, Domino presents a compelling solution to significantly boost throughput and reduce latency. If you are deploying Qwen3 models or similar architectures, integrating this speculative decoding framework could yield up to 5.8× throughput speedup. You should evaluate Domino's decoupled causal modeling approach to enhance your LLM serving infrastructure.

Key insights

Domino accelerates LLM inference by decoupling causal modeling from autoregressive drafting, using parallel generation refined by causal information.

Principles

Method

Domino uses a parallel draft backbone for initial distributions, then a lightweight head refines them with causal information. A base-anchored training curriculum stabilizes teacher-forced causal encoding.

In practice

Topics

Best for: MLOps Engineer, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.