Customizing Kernels with hipBLASLt TensileLite GEMM Tuning - Advanced User Guide
Summary
This guide details TensileLite Tuning, an advanced optimization method within AMD's hipBLASLt library for General Matrix Multiply (GEMM) operations on AMD Instinct MI300X GPUs. Unlike Offline and Online tuning, which select from existing kernel pools, TensileLite generates entirely new, custom GEMM kernels tailored to specific problem sizes (M, N, K). The process involves capturing GEMM shapes, generating a configuration YAML, customizing parameters like thread organization, loop control, parallelism strategies (Split-K), memory access, and instruction configurations, then building and executing the tuning workflow. Benchmarks on an AMD Instinct MI300X with ROCm 6.4.3 and hipBLASLt 1.1.0 show TensileLite Tuning delivers an average 119% speedup over Offline Tuning and 225% over the Baseline, significantly improving efficiency for small, medium, and large workloads.
Key takeaway
For AI Engineers optimizing LLM inference on AMD Instinct MI300X GPUs, TensileLite Tuning is crucial for achieving peak GEMM performance. If your workloads have unique or critical (M, N, K) shapes, you should implement this advanced workflow to generate custom kernels, as it can yield up to 3.2x speedups over baseline performance and significantly surpass pre-compiled binaries, ensuring your applications run at maximum efficiency.
Key insights
TensileLite Tuning generates custom GEMM kernels for AMD GPUs, significantly outperforming existing solutions.
Principles
- Custom kernel generation maximizes hardware utilization.
- Parameter tuning balances GPU resource usage.
- Optimized kernels reduce latency and boost throughput.
Method
The TensileLite Tuning workflow involves capturing GEMM shapes, generating and customizing a configuration YAML, executing the tuning script to benchmark kernel candidates, and merging the best-performing kernel into the hipBLASLt library.
In practice
- Use `hipblaslt-bench` to capture target GEMM shapes.
- Adjust `WorkGroup`, `ThreadTile`, `LoopUnroll`, and `SplitU` parameters.
- Rebuild hipBLASLt after merging new kernel logic.
Topics
- hipBLASLt
- TensileLite Tuning
- GEMM Optimization
- AMD Instinct MI300X
- Kernel Generation
Code references
Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.