Deep Dive Into 4-Wave Interleave FP8 GEMM
Summary
AMD's 4-wave interleave FP8 GEMM algorithm, introduced by HipKittens, significantly improves performance over the 8-wave ping-pong design on AMD Instinct™ MI355X GPUs. This approach assigns one wave per SIMD unit, granting each wave a full 512 VGPR budget, which enables processing a larger 128x128 output register tile compared to the 8-wave's 64x128. The core innovation is a fine-grained software pipeline that uses `__builtin_amdgcn_sched_barrier(0)` to interleave 16 MFMA instructions with 8 LDS→register loads and 4 global→LDS loads at the 16x16 sub-tile level, effectively hiding memory latency. Further optimizations include an early accumulator store mechanism, which stores `c[i][j]` sub-tiles as they become available, and LDS swizzling to eliminate 8-way bank conflicts arising from column-major access patterns for `16x16x128` MFMA instructions. Additionally, chiplet-aware grid swizzling optimizes L2 and LLC cache hit rates on the MI355X's 8-XCD topology. Benchmarking on ROCm 7.2.2 shows performance gains and improved robustness across compiler versions compared to the 8-wave kernel.
Key takeaway
For Machine Learning Engineers optimizing FP8 GEMM kernels on AMD Instinct™ MI355X GPUs, adopting the 4-wave interleave design is crucial for maximizing performance. You should implement fine-grained software pipelining with `__builtin_amdgcn_sched_barrier(0)` and apply LDS swizzling to mitigate bank conflicts. Additionally, consider chiplet-aware grid swizzling, tuning `WINDOW_SIZE` and `CHUNK_SIZE` to your specific hardware and problem size, to significantly improve L2 and LLC cache hit rates. This approach offers superior performance and compiler robustness.
Key insights
The 4-wave interleave FP8 GEMM optimizes GPU performance by fine-grained software pipelining and memory access patterns.
Principles
- One wave per SIMD maximizes VGPR utilization.
- Interleave compute and memory to hide latency.
- Swizzle memory access to avoid bank conflicts.
Method
The 4-wave kernel uses `__builtin_amdgcn_sched_barrier(0)` to interleave 16 MFMA instructions with 8 LDS→register loads and 4 global→LDS loads for 16x16 sub-tiles, alongside double buffering and early accumulator stores.
In practice
- Implement `__builtin_amdgcn_sched_barrier(0)` for instruction ordering.
- Apply LDS swizzling for bank conflict avoidance.
- Tune `WINDOW_SIZE`/`CHUNK_SIZE` for cache reuse.
Topics
- FP8 GEMM
- AMD Instinct MI355X
- GPU Optimization
- Software Pipelining
- LDS Swizzling
- Cache Hierarchy
Code references
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.