Accelerating Data Processing with NVIDIA Multi-Instance GPU and NUMA Node Localization

2026-02-19 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

NVIDIA's Ampere, Hopper, and Blackwell data center GPUs feature non-uniform memory access (NUMA) behaviors despite exposing a single memory space, which can impact performance and power efficiency as bandwidth increases. This analysis explores the GPU memory hierarchy, detailing how data transfers over die-to-die links affect power and performance, particularly due to increased latency and power limitations on larger GPUs. It then reviews NVIDIA's Multi-Instance GPU (MIG) mode as a strategy to achieve data localization by partitioning a single GPU into multiple instances, thereby minimizing inter-NUMA node data transfers. Experimental results using the Wilson-Dslash stencil operator demonstrate that MIG mode can achieve speedups of up to 2.25x at lower GPU power limits (e.g., 400 W) compared to unlocalized approaches, though its benefits diminish or reverse at higher power limits due to communication overhead.

Key takeaway

For AI Engineers optimizing GPU workloads under strict power limits, implementing MIG-based NUMA node localization can yield substantial performance gains, potentially up to 2.25x faster at 400 W. However, you should carefully evaluate the trade-offs, as MIG's benefits decrease or become negative at higher power limits due to increased inter-process communication overhead. Consider this approach for smaller workloads that fit within MIG instances and have minimal cross-instance communication.

Key insights

Data localization via NVIDIA MIG can significantly boost GPU performance under power constraints.

Principles

Minimize inter-NUMA node data transfers.
Coherent L2 caching reduces data refetching.

Method

Partition a GPU into multiple MIG instances, assigning one instance per NUMA node to isolate memory access and eliminate L2 fabric transfers. Use `CUDA_VISIBLE_DEVICES` for process-to-instance mapping.

In practice

Use MIG for power-constrained GPU workloads.
Create one MIG instance per NUMA node.
Map MPI processes to specific MIG instances.

Topics

NVIDIA GPUs
NUMA Architecture
Multi-Instance GPU
Data Locality
Performance Optimization

Code references

lattice/quda

Best for: AI Engineer, Machine Learning Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.