Debugging NaN Results in CK Tile GEMM: A rocgdb Detective Story
Summary
This post details the debugging process for a NaN (Not-a-Number) output error in AMD's Composable Kernel (CK) Tile GEMM implementation, specifically on an AMD MI300X GPU with ROCm 6.4. The issue manifested as incorrect results, with up to 98% wrong values, and only occurred when both prefetch and instruction scheduling were enabled. Through systematic debugging using `rocgdb`, the problem was traced from incorrect MFMA outputs back to a single-character typo in a type declaration (`ALdsTile` instead of `BLdsTile`) for the `bWarpTile` variable. This typo caused a tensor distribution mismatch, leading to data corruption when B-matrix data was loaded and interpreted with the wrong memory layout, ultimately producing erroneous calculations. The fix involved changing one line of code, resolving the NaN errors and restoring correct functionality with similar performance.
Key takeaway
For Deep Learning Engineers optimizing GPU kernels, this case highlights the critical importance of meticulous type declarations in template-heavy code. If you encounter mysterious NaN or incorrect outputs, simplify your test case and systematically trace data flow with `rocgdb`. A subtle type mismatch, like `ALdsTile` vs. `BLdsTile`, can silently corrupt data distribution, especially when optimizations like instruction scheduling are enabled, making careful validation of memory layouts essential.
Key insights
A single-character typo in a type declaration caused a subtle tensor distribution bug leading to NaN outputs in a GPU kernel.
Principles
- Simplify complex problems before debugging.
- Work backwards from symptoms to pinpoint corruption.
- Type aliases can mask critical type mismatches.
Method
The debugging method involved reducing problem size, simplifying inputs, systematically tracing data flow through GPU memory stages (global, LDS, tile, warp registers), and using `rocgdb` to inspect variables at each stage until data corruption was identified.
In practice
- Use `rocgdb` for GPU kernel variable inspection.
- Set all inputs to 1 for deterministic error reproduction.
- Validate data at each pipeline stage.
Topics
- GPU Kernel Debugging
- Composable Kernel
- GEMM Optimization
- Tensor Distribution
- rocgdb
Code references
Best for: Machine Learning Engineer, Deep Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.