Customizing Kernels with hipBLASLt TensileLite GEMM Tuning - Advanced User Guide

· Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Hardware Acceleration · Depth: Advanced, extended

Summary

This guide details TensileLite Tuning, an advanced optimization method within AMD's hipBLASLt library for General Matrix Multiply (GEMM) operations on AMD Instinct MI300X GPUs. Unlike Offline and Online tuning, which select from existing kernel pools, TensileLite generates entirely new, custom GEMM kernels tailored to specific problem sizes (M, N, K). The process involves capturing GEMM shapes, generating a configuration YAML, customizing parameters like thread organization, loop control, parallelism strategies (Split-K), memory access, and instruction configurations, then building and executing the tuning workflow. Benchmarks on an AMD Instinct MI300X with ROCm 6.4.3 and hipBLASLt 1.1.0 show TensileLite Tuning delivers an average 119% speedup over Offline Tuning and 225% over the Baseline, significantly improving efficiency for small, medium, and large workloads.

Key takeaway

For AI Engineers optimizing LLM inference on AMD Instinct MI300X GPUs, TensileLite Tuning is crucial for achieving peak GEMM performance. If your workloads have unique or critical (M, N, K) shapes, you should implement this advanced workflow to generate custom kernels, as it can yield up to 3.2x speedups over baseline performance and significantly surpass pre-compiled binaries, ensuring your applications run at maximum efficiency.

Key insights

TensileLite Tuning generates custom GEMM kernels for AMD GPUs, significantly outperforming existing solutions.

Principles

Method

The TensileLite Tuning workflow involves capturing GEMM shapes, generating and customizing a configuration YAML, executing the tuning script to benchmark kernel candidates, and merging the best-performing kernel into the hipBLASLt library.

In practice

Topics

Code references

Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.