Accelerating LLM Inference on AMD GPUs with Low-Latency GEMMs

2026-06-29 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

AMD has introduced the LDS-Pipelined Split-K GEMM technique, implemented as an AITER FlyDSL kernel family, to significantly accelerate large language model (LLM) inference on AMD GPUs. This optimization specifically targets decode-time General Matrix Multiply (GEMM) operations characterized by small "M" (active tokens), large "N" and "K" (hidden sizes), BF16/FP16 inputs, and optional bias, which often underutilize conventional GEMM tiling. The method employs inter-CTA Split-K, intra-CTA K-slice splitting, and a multi-stage LDS pipeline to enhance parallelism and memory efficiency. Benchmarks on an AMD Instinct™ MI355X GPU ("gfx950") show an average 1.64x latency improvement over HipblasLT, AITER Triton, and AITER ASM on the K = 7168 decode grid, and a 1.49x average improvement on additional BF16 model-shape tests.

Key takeaway

For AI Engineers and Architects deploying LLMs on AMD GPUs, this LDS-Pipelined Split-K GEMM technique is crucial for improving interactive decode latency. Your teams should evaluate integrating AITER FlyDSL-generated kernels to optimize small-M, large-K GEMMs, which are prevalent in LLM decode paths. This approach can yield significant performance gains, averaging up to 1.64x faster, directly enhancing user experience and throughput for real-time copilots and chatbots.

Key insights

LDS-Pipelined Split-K GEMM significantly accelerates LLM decode inference on AMD GPUs by optimizing small-M, large-K operations.

Principles

Split K-reduction across CTAs for GPU occupancy.
Distribute K-slices across warp groups for parallelism.
Pipeline K blocks through LDS for memory overlap.

Method

LDS-Pipelined Split-K GEMM splits K across CTAs, further slices K within CTAs, and pipelines K blocks via multi-stage LDS, synchronized by a single-launch signal/semaphore protocol.

In practice

Target small-M, large-K GEMMs for decode optimization.
Use FlyDSL for shape-specialized kernel generation.
Implement single-launch synchronization for Split-K.

Topics

LLM Inference
AMD GPUs
GEMM Optimization
Low Latency
FlyDSL
AITER
Decode Performance

Code references

Best for: NLP Engineer, MLOps Engineer, AI Scientist, Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.