AI in Multiple GPUs: Point-to-Point and Collective Operations

2026-02-13 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

This article, part two of a series on distributed AI, details PyTorch's `torch.distributed` module for multi-GPU communication, focusing on point-to-point and collective operations. It explains how PyTorch leverages backend frameworks like NVIDIA's `NCCL` for optimized data transfer, which automatically detects and utilizes efficient interconnect topologies such as PCIe, NVLink, and InfiniBand. The discussion covers both synchronous (blocking) and asynchronous (non-blocking) communication, highlighting that asynchronous methods enable overlapping computation with communication for performance gains. Specific collective operations are illustrated, including one-to-all (broadcast, scatter), all-to-one (reduce, gather), and all-to-all (all_reduce, all_gather, reduce_scatter) patterns, along with critical synchronization methods like `request.wait()` and `torch.cuda.synchronize()`.

Key takeaway

For Machine Learning Engineers building multi-GPU models, understanding `torch.distributed` operations is crucial for optimizing performance. Prioritize asynchronous communication with `isend`/`irecv` and `request.wait()` to overlap computation and communication, thereby reducing idle GPU time. Be mindful of `NCCL`'s "warm-up" behavior and the distinction between `request.wait()` and `torch.cuda.synchronize()` to avoid deadlocks and accurately benchmark performance.

Key insights

PyTorch's `torch.distributed` module orchestrates multi-GPU communication via point-to-point and collective operations.

Principles

Asynchronous communication enables computation-communication overlap.
NCCL optimizes data transfer based on GPU interconnect topology.
Synchronization is critical for correct distributed program execution.

Method

PyTorch's `torch.distributed` module uses `NCCL` for NVIDIA GPUs to implement synchronous and asynchronous point-to-point and collective operations like broadcast, scatter, reduce, gather, all_reduce, all_gather, and reduce_scatter.

In practice

Use `isend`/`irecv` with `request.wait()` for non-blocking transfers.
Employ `torch.distributed.broadcast` to replicate data across all ranks.
Utilize `torch.distributed.all_reduce` for global sum on all GPUs.

Topics

PyTorch Distributed
Multi-GPU Communication
NCCL
Collective Operations
Point-to-Point Communication

Best for: Machine Learning Engineer, Deep Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.