GPU Scaling from 1 to 261120 threads Part 1

· Source: Machine Learning on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

An analysis of GPU scaling on an NVIDIA RTX 5090 (Blackwell architecture, CC 12.0) investigates performance from 1 to 261,120 threads for matrix multiplication. Profiling with ncu revealed that scaling from 1 to 32 threads (one warp) yielded only a 10x speedup, not 32x. This inefficiency stems from memory latency, specifically L2 cache hits taking ~150 cycles, which is not fully hidden by a single active warp. The study found that launching with fewer than 32 threads wastes resources, as a full 128-byte L2 cache sector is fetched regardless, leading to low utilization (e.g., 3% for 1 thread). Stall analysis showed L1TEX (memory latency) dominating at 80% for 32 threads, indicating a need for warp overlap or tiled memory access.

Key takeaway

For AI/ML engineers optimizing GPU kernels, understanding thread scaling is crucial. You should always launch kernels with thread counts that are multiples of 32 to maximize warp utilization and avoid wasting resources. To mitigate memory latency stalls, ensure coalesced memory access patterns, allowing a full 128-byte L2 cache sector to be efficiently consumed by a warp. This approach will significantly improve performance, especially for matrix multiplication workloads.

Key insights

GPU performance scales non-linearly with threads due to memory latency and inefficient warp utilization.

Principles

In practice

Topics

Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning on Medium.