PyTorch Offline Tuning with TunableOp

2026-02-24 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, long

Summary

PyTorch TunableOp now supports offline tuning, a method that decouples the benchmarking of Basic Linear Algebra Subprograms (BLAS) kernels from the main model execution. This contrasts with online tuning, which benchmarks kernels during model runtime, introducing overhead. Offline tuning allows users to collect GEMM operations during an initial workload run and then tune them separately, potentially on different hardware or with updated software stacks like AMD ROCm. Key updates to TunableOp include support for TF32 and FP8 datatypes on AMD Instinct MI300 series GPUs, improved tuning quality via a rotating buffer (up to 512 MB) and instruction cache flushing, real-time saving of tuning results, and an enhanced numerical check with absolute and relative tolerance. Additionally, TunableOp now supports tuning for batch GEMM and GEMM with bias operations. A practical example using an OPT-125m language model demonstrated a 15% performance improvement on an AMD Radeon PRO W7900 (gfx1100) GPU.

Key takeaway

For MLOps Engineers deploying PyTorch models on AMD GPUs, adopting TunableOp's offline tuning can significantly improve inference throughput by optimizing BLAS kernels independently. You should integrate the three-phase workflow—collection, tuning, and deployment—into your CI/CD pipeline to ensure consistent performance gains. This approach allows for re-tuning with new ROCm versions without re-running full model training or inference, saving compute resources and time.

Key insights

Offline tuning separates BLAS kernel optimization from model execution, improving flexibility and efficiency.

Principles

Decouple tuning from model execution.
Benchmark under simulated runtime conditions.
Validate numerical stability of optimized kernels.

Method

The offline tuning workflow involves recording GEMM operations during a workload run, performing separate benchmarking of these operations, and then deploying the optimized results for accelerated inference.

In practice

Use `PYTORCH_TUNABLEOP_RECORD_UNTUNED=1` for collection.
Set `PYTORCH_TUNABLEOP_TUNING=1` for the tuning phase.
Configure `PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE` for better results.

Topics

PyTorch TunableOp
Offline Tuning
GEMM Optimization
AMD ROCm
Model Acceleration

Code references

ROCm/pytorch

Best for: Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.