Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

2026-05-28 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

BASTION is a novel budget-aware speculative decoding framework designed to accelerate large language model inference. It introduces tree-based diffusion drafting, dynamically constructing query-dependent trees that balance draft quality with hardware constraints, unlike methods relying on static tree topologies. The framework integrates three components: an acceptance surrogate for expected accepted length estimation, an online latency estimator calibrating a hardware-aware roofline model, and an adaptive best-first expansion. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. It achieves up to a 6.61x speedup over standard autoregressive decoding and outperforms existing block-diffusion baselines by 39% across diverse benchmarks and GPU architectures.

Key takeaway

For Machine Learning Engineers optimizing large language model inference, BASTION presents a significant advancement. Its dynamic, hardware-aware tree construction method provides up to a 6.61x speedup over standard autoregressive decoding and outperforms existing block-diffusion baselines by 39%. You should consider evaluating BASTION for your LLM deployment strategies to enhance throughput and efficiency without compromising the target model's distribution or requiring extensive tuning.

Key insights

BASTION dynamically constructs query-dependent trees for speculative decoding, balancing draft quality and hardware constraints.

Principles

Balance draft quality with hardware constraints.
Preserve the target model's distribution.
Achieve efficiency without training or tuning.

Method

BASTION integrates an acceptance surrogate, an online latency estimator, and an adaptive best-first expansion to dynamically grow a tree until marginal gains no longer justify verification costs.

In practice

Achieve up to 6.61x speedup in LLM inference.
Outperform existing block-diffusion baselines by 39%.
Utilize dynamic tree topologies for efficiency.

Topics

Speculative Decoding
Large Language Models
Block Diffusion Drafting
Inference Optimization
Hardware Constraints
Tree-based Decoding

Best for: NLP Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.