Enabling Speculative Speculative Decoding on MI300X

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

AMD has enabled Speculative Speculative Decoding (SSD) on its Instinct MI300X GPUs using ROCm 7.2, significantly accelerating large language model (LLM) inference. SSD is an advanced speculative decoding algorithm that eliminates the sequential dependency in conventional SD by pre-computing next-round speculations on separate hardware while the target model verifies current tokens. This approach hides draft latency, crucial for interactive LLM systems. The enablement involved extensive engineering, including identifying and upstreaming correctness fixes in FlashInfer for bf16/fp16 MMA type-confusion and custom-mask attention, adapting the attention stack for ROCm, adding dual tree-decode backends, patching JIT and graph runtime differences, and automating the setup flow. Benchmarks using Llama-3.1-70B-Instruct and Llama-3.2-1B-Instruct showed SSD achieving 225.86 tokens/s with TP=4, a 4.32x speedup over autoregressive decoding and 1.64x over standard speculative decoding on MI300X.

Key takeaway

For AI Engineers deploying LLMs on AMD Instinct MI300X GPUs, adopting Speculative Speculative Decoding (SSD) can significantly boost inference throughput. Your systems can achieve up to 4.32x speedup over autoregressive decoding and 1.64x over standard speculative decoding, improving latency-sensitive interactive applications. Consider integrating the provided ROCm-optimized SSD implementation to leverage asynchronous multi-device scheduling and advanced decoding algorithms for superior performance.

Key insights

Speculative Speculative Decoding (SSD) hides draft latency by asynchronously pre-computing future token speculations on separate hardware.

Principles

Method

SSD uses a smaller draft model on separate hardware to pre-compute next-round speculations for multiple verification outcomes, storing them in a cache for immediate return if matched.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.