Productionizing TurboQuant on AMD GPUs for KV-Cache-Bound LLM Inference

2026-06-11 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

AMD has productionized TurboQuant (TQ), a KV-cache compression algorithm, on its GPUs for large language model inference, specifically targeting KV-cache-bound agentic and long-context workloads. This implementation, deployed via vLLM, refines the original TQ algorithm by incorporating Walsh-Hadamard rotation, asymmetric key/value treatment, boundary-layer skipping for full-attention models, and removing Quantized Johnson-Lindenstrauss (QJL). Through custom Triton, HIP, and FlyDSL kernels, the optimized TQ achieved up to a 3.6x end-to-end speedup over the open-source vLLM TQ baseline. In agentic workloads with 100 conversations and ~25K prefixes on AMD Instinct MI355X GPUs, TQ4/4 (4-bit Key/Value) boosted KV cache hit rates from 5.3% to 67.7% and reduced P50 Time-to-First-Token from 13.9 seconds to 0.89 seconds. The FlyDSL TQ kernel reached 95% of BF16 and 88% of FP8 KV throughput, with TQ4/4 recommended for its balance of compression, accuracy, and performance, especially for full attention models.

Key takeaway

For MLOps Engineers deploying LLMs on AMD GPUs for agentic or long-context workloads, consider implementing the optimized TurboQuant (TQ) 4/4 configuration. This approach significantly improves KV-cache hit rates and reduces Time-to-First-Token, especially when memory capacity is a bottleneck. You should prioritize kernel-optimized TQ implementations, like those using FlyDSL, to achieve near-BF16 performance while benefiting from substantial memory compression. Evaluate your model's attention architecture to determine optimal TQ settings.

Key insights

Optimized TurboQuant on AMD GPUs significantly boosts LLM inference throughput and reduces latency for KV-cache-bound agentic workloads.

Principles

KV cache capacity often bottlenecks long-context LLM inference.
Asymmetric quantization (K vs. V) improves accuracy.
Kernel-level optimizations are crucial for low-bit quantization performance.

Method

The productionized TurboQuant algorithm applies Walsh-Hadamard rotation to keys, uses standard uniform quantization for values, skips boundary layers for full-attention models, and omits QJL for improved accuracy and performance.

In practice

Default to TQ4/4 for balanced compression, accuracy, and performance.
Skip boundary layer quantization for full-attention models.
Use Walsh-Hadamard rotation over random rotation for keys.

Topics

TurboQuant
KV Cache Compression
AMD GPUs
LLM Inference
Agentic AI
vLLM

Code references

TheTom/turboquant_plus

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.