Enabling Speculative Speculative Decoding on MI300X
Summary
AMD has enabled Speculative Speculative Decoding (SSD) on its Instinct MI300X GPUs using ROCm 7.2, significantly accelerating large language model (LLM) inference. SSD is an advanced speculative decoding algorithm that eliminates the sequential dependency in conventional SD by pre-computing next-round speculations on separate hardware while the target model verifies current tokens. This approach hides draft latency, crucial for interactive LLM systems. The enablement involved extensive engineering, including identifying and upstreaming correctness fixes in FlashInfer for bf16/fp16 MMA type-confusion and custom-mask attention, adapting the attention stack for ROCm, adding dual tree-decode backends, patching JIT and graph runtime differences, and automating the setup flow. Benchmarks using Llama-3.1-70B-Instruct and Llama-3.2-1B-Instruct showed SSD achieving 225.86 tokens/s with TP=4, a 4.32x speedup over autoregressive decoding and 1.64x over standard speculative decoding on MI300X.
Key takeaway
For AI Engineers deploying LLMs on AMD Instinct MI300X GPUs, adopting Speculative Speculative Decoding (SSD) can significantly boost inference throughput. Your systems can achieve up to 4.32x speedup over autoregressive decoding and 1.64x over standard speculative decoding, improving latency-sensitive interactive applications. Consider integrating the provided ROCm-optimized SSD implementation to leverage asynchronous multi-device scheduling and advanced decoding algorithms for superior performance.
Key insights
Speculative Speculative Decoding (SSD) hides draft latency by asynchronously pre-computing future token speculations on separate hardware.
Principles
- Overlap draft and verify work.
- Disaggregate drafting and verification.
- Prepare likely future paths in advance.
Method
SSD uses a smaller draft model on separate hardware to pre-compute next-round speculations for multiple verification outcomes, storing them in a cache for immediate return if matched.
In practice
- Use "SSD_TREE_DECODE_BACKEND" for debugging.
- Automate environment setup with "setup_rocm.sh".
- Implement custom attention masks for tree decoding.
Topics
- Speculative Speculative Decoding
- LLM Inference Acceleration
- AMD Instinct MI300X
- ROCm
- FlashInfer
- Asynchronous Decoding
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.