GPU Scaling from 1 to 261120 threads Part 1
Summary
An analysis of GPU scaling on an NVIDIA RTX 5090 (Blackwell architecture, CC 12.0) investigates performance from 1 to 261,120 threads for matrix multiplication. Profiling with ncu revealed that scaling from 1 to 32 threads (one warp) yielded only a 10x speedup, not 32x. This inefficiency stems from memory latency, specifically L2 cache hits taking ~150 cycles, which is not fully hidden by a single active warp. The study found that launching with fewer than 32 threads wastes resources, as a full 128-byte L2 cache sector is fetched regardless, leading to low utilization (e.g., 3% for 1 thread). Stall analysis showed L1TEX (memory latency) dominating at 80% for 32 threads, indicating a need for warp overlap or tiled memory access.
Key takeaway
For AI/ML engineers optimizing GPU kernels, understanding thread scaling is crucial. You should always launch kernels with thread counts that are multiples of 32 to maximize warp utilization and avoid wasting resources. To mitigate memory latency stalls, ensure coalesced memory access patterns, allowing a full 128-byte L2 cache sector to be efficiently consumed by a warp. This approach will significantly improve performance, especially for matrix multiplication workloads.
Key insights
GPU performance scales non-linearly with threads due to memory latency and inefficient warp utilization.
Principles
- Launch GPU kernels with thread counts as multiples of 32.
- Fewer than 32 threads per block wastes resources.
- L2 traffic is algorithm-dependent, not thread-count dependent.
In practice
- Pad input or batch small problems for full warp utilization.
- Ensure threads access consecutive addresses for L2 cache efficiency.
Topics
- GPU Scaling
- NVIDIA RTX 5090
- CUDA Programming
- Matrix Multiplication
- Performance Profiling
- Memory Latency
- Warp Scheduling
Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.