Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization
Summary
NVIDIA's Ampere, Hopper, and Blackwell data center GPUs feature non-uniform memory access (NUMA) behaviors despite exposing a single memory space, which can impact performance and power efficiency as bandwidth increases. This analysis explores the GPU memory hierarchy, detailing how data transfers over die-to-die links affect power and performance, particularly due to increased latency and power limitations on larger GPUs. It then reviews NVIDIA's Multi-Instance GPU (MIG) mode as a strategy to achieve data localization by partitioning a single GPU into multiple instances, thereby minimizing inter-NUMA node data transfers. Experimental results using the Wilson-Dslash stencil operator demonstrate that MIG mode can achieve speedups of up to 2.25x at lower GPU power limits (e.g., 400 W) compared to unlocalized approaches, though its benefits diminish or reverse at higher power limits due to communication overhead.
Key takeaway
For AI Engineers optimizing GPU workloads under strict power limits, implementing MIG-based NUMA node localization can yield substantial performance gains, potentially up to 2.25x faster at 400 W. However, you should carefully evaluate the trade-offs, as MIG's benefits decrease or become negative at higher power limits due to increased inter-process communication overhead. Consider this approach for smaller workloads that fit within MIG instances and have minimal cross-instance communication.
Key insights
Data localization via NVIDIA MIG can significantly boost GPU performance under power constraints.
Principles
- Minimize inter-NUMA node data transfers.
- Coherent L2 caching reduces data refetching.
Method
Partition a GPU into multiple MIG instances, assigning one instance per NUMA node to isolate memory access and eliminate L2 fabric transfers. Use `CUDA_VISIBLE_DEVICES` for process-to-instance mapping.
In practice
- Use MIG for power-constrained GPU workloads.
- Create one MIG instance per NUMA node.
- Map MPI processes to specific MIG instances.
Topics
- NVIDIA GPUs
- NUMA Architecture
- Multi-Instance GPU
- Data Locality
- Performance Optimization
Code references
Best for: AI Engineer, Machine Learning Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.