Deep Dive Into 4-Wave Interleave FP8 GEMM

2026-05-27 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

AMD's 4-wave interleave FP8 GEMM algorithm, introduced by HipKittens, significantly improves performance over the 8-wave ping-pong design on AMD Instinct™ MI355X GPUs. This approach assigns one wave per SIMD unit, granting each wave a full 512 VGPR budget, which enables processing a larger 128x128 output register tile compared to the 8-wave's 64x128. The core innovation is a fine-grained software pipeline that uses `__builtin_amdgcn_sched_barrier(0)` to interleave 16 MFMA instructions with 8 LDS→register loads and 4 global→LDS loads at the 16x16 sub-tile level, effectively hiding memory latency. Further optimizations include an early accumulator store mechanism, which stores `c[i][j]` sub-tiles as they become available, and LDS swizzling to eliminate 8-way bank conflicts arising from column-major access patterns for `16x16x128` MFMA instructions. Additionally, chiplet-aware grid swizzling optimizes L2 and LLC cache hit rates on the MI355X's 8-XCD topology. Benchmarking on ROCm 7.2.2 shows performance gains and improved robustness across compiler versions compared to the 8-wave kernel.

Key takeaway

For Machine Learning Engineers optimizing FP8 GEMM kernels on AMD Instinct™ MI355X GPUs, adopting the 4-wave interleave design is crucial for maximizing performance. You should implement fine-grained software pipelining with `__builtin_amdgcn_sched_barrier(0)` and apply LDS swizzling to mitigate bank conflicts. Additionally, consider chiplet-aware grid swizzling, tuning `WINDOW_SIZE` and `CHUNK_SIZE` to your specific hardware and problem size, to significantly improve L2 and LLC cache hit rates. This approach offers superior performance and compiler robustness.

Key insights

The 4-wave interleave FP8 GEMM optimizes GPU performance by fine-grained software pipelining and memory access patterns.

Principles

One wave per SIMD maximizes VGPR utilization.
Interleave compute and memory to hide latency.
Swizzle memory access to avoid bank conflicts.

Method

The 4-wave kernel uses `__builtin_amdgcn_sched_barrier(0)` to interleave 16 MFMA instructions with 8 LDS→register loads and 4 global→LDS loads for 16x16 sub-tiles, alongside double buffering and early accumulator stores.

In practice

Implement `__builtin_amdgcn_sched_barrier(0)` for instruction ordering.
Apply LDS swizzling for bank conflict avoidance.
Tune `WINDOW_SIZE`/`CHUNK_SIZE` for cache reuse.

Topics

FP8 GEMM
AMD Instinct MI355X
GPU Optimization
Software Pipelining
LDS Swizzling
Cache Hierarchy

Code references

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.