Multi-Node Distributed Inference for Diffusion Models with xDiT
Summary
AMD's blog post details how to implement multi-node distributed inference for diffusion models, specifically HunyuanVideo, on AMD Instinct MI300X accelerators. It addresses the computational demands of generative AI models like text-to-video systems, which often suffer from high inference latency. The solution leverages the xDiT library and Unified Sequence Parallelism (USP), combining techniques like DeepSpeed-Ulysses and Ring Attention to distribute attention-dominated workloads across multiple GPUs and nodes. Efficient communication is critical, relying on RCCL for intra-node and RoCE (RDMA over Converged Ethernet) for inter-node GPU communication, with AITER and FlashAttention v3 further optimizing performance. The article provides practical steps for host and container setup, including driver validation and launching inference with `torchrun`.
Key takeaway
For AI Engineers deploying large diffusion models like HunyuanVideo on AMD Instinct MI300X, carefully select your parallelization strategy. Prioritize Ulysses for models where attention heads align with GPU counts to maximize latency reduction. For arbitrary node counts, combine Ulysses with Ring Attention via USP, but be mindful of increased communication overhead. Ensure proper RoCE driver setup and network configuration to avoid performance bottlenecks and silent fallbacks to TCP/IP.
Key insights
Multi-node inference for diffusion models on AMD MI300X reduces latency by distributing computation and memory.
Principles
- Efficient communication is critical for multi-node inference.
- Ulysses scales effectively for low-latency inference.
- Ring Attention enables broader scale-out configurations.
Method
Distribute diffusion Transformer workloads using xDiT's Unified Sequence Parallelism (USP), combining Ulysses and Ring Attention, with RCCL and RoCE for communication, and AITER for kernel optimization.
In practice
- Use `ibv_devices` to verify RDMA device visibility.
- Set `NCCL_IB_HCA` and `NCCL_SOCKET_IFNAME` for RoCE.
- Enable `NCCL_DEBUG=INFO` to confirm RDMA usage.
Topics
- Multi-Node Inference
- Diffusion Models
- Unified Sequence Parallelism
- AMD Instinct MI300X
- RoCE Networking
Best for: MLOps Engineer, AI Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.