Productionizing TurboQuant on AMD GPUs for KV-Cache-Bound LLM Inference
Summary
AMD has productionized TurboQuant (TQ), a KV-cache compression algorithm, on its GPUs for large language model inference, specifically targeting KV-cache-bound agentic and long-context workloads. This implementation, deployed via vLLM, refines the original TQ algorithm by incorporating Walsh-Hadamard rotation, asymmetric key/value treatment, boundary-layer skipping for full-attention models, and removing Quantized Johnson-Lindenstrauss (QJL). Through custom Triton, HIP, and FlyDSL kernels, the optimized TQ achieved up to a 3.6x end-to-end speedup over the open-source vLLM TQ baseline. In agentic workloads with 100 conversations and ~25K prefixes on AMD Instinct MI355X GPUs, TQ4/4 (4-bit Key/Value) boosted KV cache hit rates from 5.3% to 67.7% and reduced P50 Time-to-First-Token from 13.9 seconds to 0.89 seconds. The FlyDSL TQ kernel reached 95% of BF16 and 88% of FP8 KV throughput, with TQ4/4 recommended for its balance of compression, accuracy, and performance, especially for full attention models.
Key takeaway
For MLOps Engineers deploying LLMs on AMD GPUs for agentic or long-context workloads, consider implementing the optimized TurboQuant (TQ) 4/4 configuration. This approach significantly improves KV-cache hit rates and reduces Time-to-First-Token, especially when memory capacity is a bottleneck. You should prioritize kernel-optimized TQ implementations, like those using FlyDSL, to achieve near-BF16 performance while benefiting from substantial memory compression. Evaluate your model's attention architecture to determine optimal TQ settings.
Key insights
Optimized TurboQuant on AMD GPUs significantly boosts LLM inference throughput and reduces latency for KV-cache-bound agentic workloads.
Principles
- KV cache capacity often bottlenecks long-context LLM inference.
- Asymmetric quantization (K vs. V) improves accuracy.
- Kernel-level optimizations are crucial for low-bit quantization performance.
Method
The productionized TurboQuant algorithm applies Walsh-Hadamard rotation to keys, uses standard uniform quantization for values, skips boundary layers for full-attention models, and omits QJL for improved accuracy and performance.
In practice
- Default to TQ4/4 for balanced compression, accuracy, and performance.
- Skip boundary layer quantization for full-attention models.
- Use Walsh-Hadamard rotation over random rotation for keys.
Topics
- TurboQuant
- KV Cache Compression
- AMD GPUs
- LLM Inference
- Agentic AI
- vLLM
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.