Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler
Summary
The "Profiling in PyTorch (Part 1)" guide, published May 29, 2026, introduces `torch.profiler` for optimizing PyTorch workloads. It demonstrates setting up and interpreting profiler tables and traces for a basic matrix multiplication and addition. The article explains how to identify overhead-bound versus compute-bound scenarios, showing that a 64x64 matrix operation is CPU-bound (2.314ms CPU, 23.104us CUDA) while a 4096x4096 operation becomes GPU-bound (4.908ms CPU, 4.495ms CUDA) on a NVIDIA A100-SXM4-80GB GPU. It also delves into the CPU-GPU dispatch chain, kernel runtime variance, and the effects of `torch.compile` on operator fusion and CPU overhead, noting that `torch.compile` can increase CPU cost for single operations.
Key takeaway
For AI Engineers optimizing PyTorch model performance, understanding `torch.profiler` is essential. Use it to diagnose whether your operations are CPU-overhead or GPU-compute bound, especially for small matrix operations. Pay close attention to trace gaps and kernel runtime variances, and be aware that `torch.compile` might increase CPU overhead for single operations while fusing at the dispatcher level, not always at the kernel level.
Key insights
PyTorch's `torch.profiler` reveals CPU/GPU bottlenecks and execution details, crucial for optimizing deep learning workloads.
Principles
- Profile to identify bottlenecks; optimize what is measurable.
- GPU kernel runtimes are variable, not constant.
- `torch.compile` can fuse operations at the dispatcher level.
Method
To profile PyTorch code, wrap the target function with `torch.profiler.record_function`, then use `torch.profiler.profile` context manager with `CPU` and `CUDA` activities, and export results as a table and Chrome trace.
In practice
- Use `warmup` iterations to avoid profiling cold-start overheads.
- Increase matrix sizes or batch operations to shift from overhead-bound to compute-bound.
- Check `cudaOccupancyMaxActiveBlocksPerMultiprocessor` for heavyweight kernels.
Topics
- PyTorch Profiling
- torch.profiler
- CUDA Kernels
- GPU Optimization
- torch.compile
- Matrix Multiplication
Code references
Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.