Streamlining Recommendation Model Training on AMD Instinct™ GPUs

2026-03-02 · Source: AMD ROCm Blogs · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Intermediate, medium

Summary

AMD has streamlined recommendation model training on its Instinct™ GPUs by integrating essential libraries into the ROCm training docker, specifically for workloads like DLRMv2. Recommendation models, unlike LLMs, often involve complex and imbalanced communication patterns across GPUs and higher CPU-GPU interconnect loads due to large sparse embedding tables. The ROCm docker, pre-installed with TorchRec and FBGEMM, facilitates high-performance computation and communication. A key aspect is configuring TorchRec's sharding planner, which uses a performance model based on system specifications (e.g., 192GB HBM on MI300X GPUs, 5.3 TB/s HBM bandwidth) to optimize table distribution. This allows for a higher fraction of embedding tables to be placed locally via data parallel (DP) sharding, reducing communication bottlenecks and improving end-to-end training performance, even in multi-node deployments.

Key takeaway

For MLOps Engineers deploying recommendation systems on AMD Instinct GPUs, leveraging the ROCm training docker with TorchRec and FBGEMM is crucial. You should meticulously configure the TorchRec sharding planner with your system's specific HBM and interconnect bandwidths to maximize local data placement. This approach will significantly reduce communication overhead, ensuring stable and efficient training convergence for DLRMv2 and similar models, even in multi-node environments.

Key insights

ROCm training docker simplifies recommendation model training on AMD Instinct GPUs by optimizing sparse embedding handling.

Principles

Large HBM on GPUs reduces communication bottlenecks.
System-aware sharding optimizes distributed training performance.

Method

Configure TorchRec's sharding planner with system topology (HBM capacity, bandwidths) to optimize embedding table distribution across GPUs, favoring local placement to minimize communication.

In practice

Use ROCm training docker for DLRMv2 model training.
Clone AMD-AGI/DLRMBenchmark for DLRM training examples.
Update train_config.sh for Criteo-1B dataset.

Topics

Deep Learning Recommendation Models
AMD Instinct GPUs
ROCm
Sparse Embeddings
Embedding Sharding

Code references

Best for: Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AMD ROCm Blogs.