Scaling AI Inference Across Multiple GPUs Using NVIDIA TensorRT with Multi-Device Inference Support

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

NVIDIA TensorRT 11.0 introduces multi-device inference support, enabling generative AI workloads to scale beyond single-GPU memory and compute limitations. This new feature integrates with NVIDIA NCCL, leveraging its high-performance collective operations like AllReduce and AllGather for distributed inference. The article details two primary parallelism strategies: tensor parallelism, which partitions model weights across GPUs, and context parallelism, which partitions input sequences. For context parallelism, three specific implementations are discussed: AllGather KV, Ring Attention, and DeepSpeed Ulysses. Benchmarks conducted on a single node with 8 GPUs using NVIDIA Cosmos 3 for video generation and FLUX.1 for image generation demonstrated that the Ulysses strategy consistently delivered the lowest end-to-end latency, especially for models with excessively long context lengths, while Ring Attention also showed good scaling for FLUX.1. This allows developers to deploy massive PyTorch models converted via Torch-TensorRT into C++ production environments.

Key takeaway

For AI Engineers scaling generative AI models that exceed single-GPU capacity, NVIDIA TensorRT 11.0's multi-device inference support provides a critical solution. You should integrate this feature, leveraging its native NCCL-powered distributed collective operations and exploring context parallelism strategies like DeepSpeed Ulysses. This enables efficient deployment of large PyTorch models, converted via Torch-TensorRT, into C++ production environments, significantly reducing latency for long-sequence workloads.

Key insights

TensorRT 11.0 scales generative AI inference across multiple GPUs using NCCL and parallelism strategies.

Principles

Method

Create a TensorRT network with "IDistCollectiveLayer" for collectives, build the engine, set the NCCL communicator, and enqueue inference, running with OpenMPI.

In practice

Topics

Code references

Best for: AI Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.