Achieving Near-Linear Training Scalability for Pinterest’s Foundation Models
Summary
Pinterest achieved near-linear training scalability for its foundation models, which power recommendations for over 600 million monthly active users. Initially, multi-node training was inefficient, with a 2-node scaling factor of 0.2x without AWS EFA, and still poor at 1.13x (2 nodes) and 1.21x (4 nodes) even with EFA. Through a series of optimizations, Pinterest improved 2-node scaling to 2.0x and 4-node scaling to 3.9x (97.5% of ideal), extending to 8 nodes at 7.5x (93.75% of ideal) with 490k examples/sec throughput. Key techniques included Quantized Communications (QComms) reducing NCCL communication by 75%, Balanced Sharding, Bandwidth-Aware Embedding Optimization, and a "2D Parallel (All-to-All Optimized)" topology that reduced all-to-all latency from 78ms to 13ms. These improvements enabled larger models, driving significant engagement gains on Pinterest's Homefeed and Related Pins.
Key takeaway
For Machine Learning Engineers optimizing large-scale foundation models, prioritize profiling communication bottlenecks, especially for embedding-heavy architectures. You should systematically apply techniques like quantized communications, balanced sharding, and 2D parallel topologies to reduce cross-node data transfer. This approach can transform multi-node training from inefficient to near-linear, significantly shortening experimentation cycles and enabling larger, more performant models.
Key insights
For embedding-heavy models, communication, not compute, dominates multi-node scaling, requiring direct optimization of data transfer.
Principles
- Profile to identify bottlenecks.
- Optimizations compound across layers.
- Communication dominates at scale.
Method
Pinterest achieved near-linear scaling by applying Quantized Communications, Balanced Sharding, Bandwidth-Aware Embedding Optimization, and a 2D Parallel (All-to-All Optimized) topology.
In practice
- Use FBGEMM's QComms for FP8 compression.
- Match hash partitions to GPUs for balance.
- Flip 2D parallel topology for intra-node All-to-All.
Topics
- Foundation Models
- Distributed Training
- Training Scalability
- Embedding Optimization
- Quantized Communications
- 2D Parallelism
- AWS EFA
Best for: AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.