FP8 GEMM Optimization on AMD CDNA™4 Architecture

2026-03-10 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

This post details the step-by-step optimization of an FP8 General Matrix Multiply (GEMM) kernel on the AMD Instinct MI355X GPU, which features the CDNA™4 architecture. The optimization process, conducted entirely in HIP/C++, progressively improved performance from a naive implementation's 1.15 TFLOPS/s to 2680.33 TFLOPS/s for an M=N=K=4096 matrix size, closely matching hipBLASLt's 2750.42 TFLOPS/s. Key techniques included LDS tiling, leveraging Matrix Core instructions (MFMA), vectorized and direct global-to-LDS loads, LDS swizzling to mitigate bank conflicts, software pipelining with double buffering, and multi-wave scheduling, culminating in an 8-wave ping-pong scheduling pattern. The CDNA™4 architecture's increased LDS capacity (160 KB), read bandwidth (256 B/clk), and expanded low-precision MFMA support were critical.

Key takeaway

For Deep Learning Engineers optimizing FP8 GEMM kernels on AMD CDNA™4 GPUs, systematically applying techniques like LDS tiling, Matrix Core instructions, double buffering, and 8-wave ping-pong scheduling is crucial. Your focus should be on minimizing global memory access, maximizing compute unit utilization, and carefully managing instruction scheduling and memory access patterns to achieve performance comparable to highly optimized libraries like hipBLASLt.

Key insights

Optimizing FP8 GEMM on AMD CDNA™4 GPUs requires a systematic approach to memory access and instruction scheduling.

Principles

Maximize data reuse via LDS tiling.
Utilize architecture-specific matrix core instructions.
Overlap memory operations with compute.

Method

The optimization workflow involves starting with a naive kernel, implementing LDS tiling, integrating Matrix Core instructions, optimizing data ingress with vectorized and direct global-to-LDS loads, resolving LDS bank conflicts via swizzling, and employing software pipelining with double buffering and multi-wave scheduling.

In practice

Use `__builtin_amdgcn_s_barrier()` for wave synchronization.
Control wave priority with `__builtin_amdgcn_s_setprio(x)`.
Employ `#pragma unroll 2` to reduce register pressure.

Topics

FP8 GEMM Optimization
AMD CDNA4 Architecture
Matrix Core Instructions
GPU Kernel Optimization
Software Pipelining

Best for: Machine Learning Engineer, Deep Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.