Debugging NaN Results in CK Tile GEMM: A rocgdb Detective Story

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

This post details the debugging process for a NaN (Not-a-Number) output error in AMD's Composable Kernel (CK) Tile GEMM implementation, specifically on an AMD MI300X GPU with ROCm 6.4. The issue manifested as incorrect results, with up to 98% wrong values, and only occurred when both prefetch and instruction scheduling were enabled. Through systematic debugging using `rocgdb`, the problem was traced from incorrect MFMA outputs back to a single-character typo in a type declaration (`ALdsTile` instead of `BLdsTile`) for the `bWarpTile` variable. This typo caused a tensor distribution mismatch, leading to data corruption when B-matrix data was loaded and interpreted with the wrong memory layout, ultimately producing erroneous calculations. The fix involved changing one line of code, resolving the NaN errors and restoring correct functionality with similar performance.

Key takeaway

For Deep Learning Engineers optimizing GPU kernels, this case highlights the critical importance of meticulous type declarations in template-heavy code. If you encounter mysterious NaN or incorrect outputs, simplify your test case and systematically trace data flow with `rocgdb`. A subtle type mismatch, like `ALdsTile` vs. `BLdsTile`, can silently corrupt data distribution, especially when optimizations like instruction scheduling are enabled, making careful validation of memory layouts essential.

Key insights

A single-character typo in a type declaration caused a subtle tensor distribution bug leading to NaN outputs in a GPU kernel.

Principles

Method

The debugging method involved reducing problem size, simplifying inputs, systematically tracing data flow through GPU memory stages (global, LDS, tile, warp registers), and using `rocgdb` to inspect variables at each stage until data corruption was identified.

In practice

Topics

Code references

Best for: Machine Learning Engineer, Deep Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.