hipBLASLt Online GEMM Tuning
Summary
AMD has integrated hipBLASLt Online GEMM Tuning into LLM frameworks, exemplified by RTP-LLM, to enhance General Matrix Multiply (GEMM) performance. This new feature, developed by the AMD Quark Team, enables runtime optimization by dynamically selecting the best-performing GEMM kernel without requiring prior offline tuning. The online tuning mechanism evaluates multiple candidate algorithms for new GEMM configurations, caches the optimal solution, and reuses it for subsequent identical operations. Experiments on the Qwen3-30B model using an AMD MI308 GPU showed that online tuning achieved an average of 105.98% performance relative to the baseline and 100.79% relative to offline tuning, with a one-time overhead of 31 seconds for benchmarking. This approach prioritizes ease of use, adaptability to dynamic inputs, and future-proofing against evolving workloads and hardware.
Key takeaway
For NLP Engineers optimizing LLM inference on AMD GPUs, hipBLASLt Online GEMM Tuning offers a user-transparent method to achieve significant performance gains. You can enable this feature by setting a simple environment variable, eliminating the need for complex offline benchmarking. This approach ensures your models adapt to dynamic inputs and evolving hardware, providing robust performance comparable to, or better than, traditional offline tuning with minimal setup overhead.
Key insights
hipBLASLt Online GEMM Tuning dynamically optimizes matrix multiplication at runtime for improved LLM performance.
Principles
- Dynamic kernel selection improves GEMM performance.
- Caching optimal solutions amortizes tuning overhead.
- Online tuning adapts to evolving workloads and hardware.
Method
The hipBLASLt GEMM wrapper dynamically invokes heuristic and tuning APIs, benchmarks candidate algorithms for new GEMM shapes, selects the best, and caches the result for reuse.
In practice
- Set HIP_ONLINE_TUNING=1 to enable.
- Use VLLM_ROCM_USE_AITER_HIP_ONLINE_TUNING for vLLM.
- Integrates into LLM frameworks like RTP-LLM.
Topics
- hipBLASLt
- Online GEMM Tuning
- Large Language Models
- GPU Performance Optimization
- Matrix Multiplication
Code references
Best for: NLP Engineer, Machine Learning Engineer, Deep Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.