PyTorch Offline Tuning with TunableOp
Summary
PyTorch TunableOp now supports offline tuning, a method that decouples the benchmarking of Basic Linear Algebra Subprograms (BLAS) kernels from the main model execution. This contrasts with online tuning, which benchmarks kernels during model runtime, introducing overhead. Offline tuning allows users to collect GEMM operations during an initial workload run and then tune them separately, potentially on different hardware or with updated software stacks like AMD ROCm. Key updates to TunableOp include support for TF32 and FP8 datatypes on AMD Instinct MI300 series GPUs, improved tuning quality via a rotating buffer (up to 512 MB) and instruction cache flushing, real-time saving of tuning results, and an enhanced numerical check with absolute and relative tolerance. Additionally, TunableOp now supports tuning for batch GEMM and GEMM with bias operations. A practical example using an OPT-125m language model demonstrated a 15% performance improvement on an AMD Radeon PRO W7900 (gfx1100) GPU.
Key takeaway
For MLOps Engineers deploying PyTorch models on AMD GPUs, adopting TunableOp's offline tuning can significantly improve inference throughput by optimizing BLAS kernels independently. You should integrate the three-phase workflow—collection, tuning, and deployment—into your CI/CD pipeline to ensure consistent performance gains. This approach allows for re-tuning with new ROCm versions without re-running full model training or inference, saving compute resources and time.
Key insights
Offline tuning separates BLAS kernel optimization from model execution, improving flexibility and efficiency.
Principles
- Decouple tuning from model execution.
- Benchmark under simulated runtime conditions.
- Validate numerical stability of optimized kernels.
Method
The offline tuning workflow involves recording GEMM operations during a workload run, performing separate benchmarking of these operations, and then deploying the optimized results for accelerated inference.
In practice
- Use `PYTORCH_TUNABLEOP_RECORD_UNTUNED=1` for collection.
- Set `PYTORCH_TUNABLEOP_TUNING=1` for the tuning phase.
- Configure `PYTORCH_TUNABLEOP_ROTATING_BUFFER_SIZE` for better results.
Topics
- PyTorch TunableOp
- Offline Tuning
- GEMM Optimization
- AMD ROCm
- Model Acceleration
Code references
Best for: Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.