Deep Dive Into 4-Wave Interleave FP8 GEMM

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

AMD's 4-wave interleave FP8 GEMM algorithm, introduced by HipKittens, significantly improves performance over the 8-wave ping-pong design on AMD Instinct™ MI355X GPUs. This approach assigns one wave per SIMD unit, granting each wave a full 512 VGPR budget, which enables processing a larger 128x128 output register tile compared to the 8-wave's 64x128. The core innovation is a fine-grained software pipeline that uses `__builtin_amdgcn_sched_barrier(0)` to interleave 16 MFMA instructions with 8 LDS→register loads and 4 global→LDS loads at the 16x16 sub-tile level, effectively hiding memory latency. Further optimizations include an early accumulator store mechanism, which stores `c[i][j]` sub-tiles as they become available, and LDS swizzling to eliminate 8-way bank conflicts arising from column-major access patterns for `16x16x128` MFMA instructions. Additionally, chiplet-aware grid swizzling optimizes L2 and LLC cache hit rates on the MI355X's 8-XCD topology. Benchmarking on ROCm 7.2.2 shows performance gains and improved robustness across compiler versions compared to the 8-wave kernel.

Key takeaway

For Machine Learning Engineers optimizing FP8 GEMM kernels on AMD Instinct™ MI355X GPUs, adopting the 4-wave interleave design is crucial for maximizing performance. You should implement fine-grained software pipelining with `__builtin_amdgcn_sched_barrier(0)` and apply LDS swizzling to mitigate bank conflicts. Additionally, consider chiplet-aware grid swizzling, tuning `WINDOW_SIZE` and `CHUNK_SIZE` to your specific hardware and problem size, to significantly improve L2 and LLC cache hit rates. This approach offers superior performance and compiler robustness.

Key insights

The 4-wave interleave FP8 GEMM optimizes GPU performance by fine-grained software pipelining and memory access patterns.

Principles

Method

The 4-wave kernel uses `__builtin_amdgcn_sched_barrier(0)` to interleave 16 MFMA instructions with 8 LDS→register loads and 4 global→LDS loads for 16x16 sub-tiles, alongside double buffering and early accumulator stores.

In practice

Topics

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.