Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs
Summary
AMD has introduced the LDS-Pipelined Split-K GEMM technique, implemented as an AITER FlyDSL kernel family, to significantly accelerate large language model (LLM) inference on AMD GPUs. This optimization specifically targets decode-time General Matrix Multiply (GEMM) operations characterized by small "M" (active tokens), large "N" and "K" (hidden sizes), BF16/FP16 inputs, and optional bias, which often underutilize conventional GEMM tiling. The method employs inter-CTA Split-K, intra-CTA K-slice splitting, and a multi-stage LDS pipeline to enhance parallelism and memory efficiency. Benchmarks on an AMD Instinct™ MI355X GPU ("gfx950") show an average 1.64x latency improvement over HipblasLT, AITER Triton, and AITER ASM on the K = 7168 decode grid, and a 1.49x average improvement on additional BF16 model-shape tests.
Key takeaway
For AI Engineers and Architects deploying LLMs on AMD GPUs, this LDS-Pipelined Split-K GEMM technique is crucial for improving interactive decode latency. Your teams should evaluate integrating AITER FlyDSL-generated kernels to optimize small-M, large-K GEMMs, which are prevalent in LLM decode paths. This approach can yield significant performance gains, averaging up to 1.64x faster, directly enhancing user experience and throughput for real-time copilots and chatbots.
Key insights
LDS-Pipelined Split-K GEMM significantly accelerates LLM decode inference on AMD GPUs by optimizing small-M, large-K operations.
Principles
- Split K-reduction across CTAs for GPU occupancy.
- Distribute K-slices across warp groups for parallelism.
- Pipeline K blocks through LDS for memory overlap.
Method
LDS-Pipelined Split-K GEMM splits K across CTAs, further slices K within CTAs, and pipelines K blocks via multi-stage LDS, synchronized by a single-launch signal/semaphore protocol.
In practice
- Target small-M, large-K GEMMs for decode optimization.
- Use FlyDSL for shape-specialized kernel generation.
- Implement single-launch synchronization for Split-K.
Topics
- LLM Inference
- AMD GPUs
- GEMM Optimization
- Low Latency
- FlyDSL
- AITER
- Decode Performance
Code references
Best for: NLP Engineer, MLOps Engineer, AI Scientist, Machine Learning Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.