Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

2026-05-25 · Source: Hugging Face - Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

The "Profiling in PyTorch (Part 1)" guide, published May 29, 2026, introduces `torch.profiler` for optimizing PyTorch workloads. It demonstrates setting up and interpreting profiler tables and traces for a basic matrix multiplication and addition. The article explains how to identify overhead-bound versus compute-bound scenarios, showing that a 64x64 matrix operation is CPU-bound (2.314ms CPU, 23.104us CUDA) while a 4096x4096 operation becomes GPU-bound (4.908ms CPU, 4.495ms CUDA) on a NVIDIA A100-SXM4-80GB GPU. It also delves into the CPU-GPU dispatch chain, kernel runtime variance, and the effects of `torch.compile` on operator fusion and CPU overhead, noting that `torch.compile` can increase CPU cost for single operations.

Key takeaway

For AI Engineers optimizing PyTorch model performance, understanding `torch.profiler` is essential. Use it to diagnose whether your operations are CPU-overhead or GPU-compute bound, especially for small matrix operations. Pay close attention to trace gaps and kernel runtime variances, and be aware that `torch.compile` might increase CPU overhead for single operations while fusing at the dispatcher level, not always at the kernel level.

Key insights

PyTorch's `torch.profiler` reveals CPU/GPU bottlenecks and execution details, crucial for optimizing deep learning workloads.

Principles

Profile to identify bottlenecks; optimize what is measurable.
GPU kernel runtimes are variable, not constant.
`torch.compile` can fuse operations at the dispatcher level.

Method

To profile PyTorch code, wrap the target function with `torch.profiler.record_function`, then use `torch.profiler.profile` context manager with `CPU` and `CUDA` activities, and export results as a table and Chrome trace.

In practice

Use `warmup` iterations to avoid profiling cold-start overheads.
Increase matrix sizes or batch operations to shift from overhead-bound to compute-bound.
Check `cudaOccupancyMaxActiveBlocksPerMultiprocessor` for heavyweight kernels.

Topics

PyTorch Profiling
torch.profiler
CUDA Kernels
GPU Optimization
torch.compile
Matrix Multiplication

Code references

pytorch/pytorch

Best for: Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Hugging Face - Blog.