hipBLASLt Online GEMM Tuning

2026-03-19 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

AMD has integrated hipBLASLt Online GEMM Tuning into LLM frameworks, exemplified by RTP-LLM, to enhance General Matrix Multiply (GEMM) performance. This new feature, developed by the AMD Quark Team, enables runtime optimization by dynamically selecting the best-performing GEMM kernel without requiring prior offline tuning. The online tuning mechanism evaluates multiple candidate algorithms for new GEMM configurations, caches the optimal solution, and reuses it for subsequent identical operations. Experiments on the Qwen3-30B model using an AMD MI308 GPU showed that online tuning achieved an average of 105.98% performance relative to the baseline and 100.79% relative to offline tuning, with a one-time overhead of 31 seconds for benchmarking. This approach prioritizes ease of use, adaptability to dynamic inputs, and future-proofing against evolving workloads and hardware.

Key takeaway

For NLP Engineers optimizing LLM inference on AMD GPUs, hipBLASLt Online GEMM Tuning offers a user-transparent method to achieve significant performance gains. You can enable this feature by setting a simple environment variable, eliminating the need for complex offline benchmarking. This approach ensures your models adapt to dynamic inputs and evolving hardware, providing robust performance comparable to, or better than, traditional offline tuning with minimal setup overhead.

Key insights

hipBLASLt Online GEMM Tuning dynamically optimizes matrix multiplication at runtime for improved LLM performance.

Principles

Dynamic kernel selection improves GEMM performance.
Caching optimal solutions amortizes tuning overhead.
Online tuning adapts to evolving workloads and hardware.

Method

The hipBLASLt GEMM wrapper dynamically invokes heuristic and tuning APIs, benchmarks candidate algorithms for new GEMM shapes, selects the best, and caches the result for reuse.

In practice

Set HIP_ONLINE_TUNING=1 to enable.
Use VLLM_ROCM_USE_AITER_HIP_ONLINE_TUNING for vLLM.
Integrates into LLM frameworks like RTP-LLM.

Topics

hipBLASLt
Online GEMM Tuning
Large Language Models
GPU Performance Optimization
Matrix Multiplication

Code references

ROCm/aiter

Best for: NLP Engineer, Machine Learning Engineer, Deep Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.