Achieving Near-Linear Training Scalability for Pinterest’s Foundation Models

2026-06-25 · Source: Pinterest Engineering Blog - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, long

Summary

Pinterest achieved near-linear training scalability for its foundation models, which power recommendations for over 600 million monthly active users. Initially, multi-node training was inefficient, with a 2-node scaling factor of 0.2x without AWS EFA, and still poor at 1.13x (2 nodes) and 1.21x (4 nodes) even with EFA. Through a series of optimizations, Pinterest improved 2-node scaling to 2.0x and 4-node scaling to 3.9x (97.5% of ideal), extending to 8 nodes at 7.5x (93.75% of ideal) with 490k examples/sec throughput. Key techniques included Quantized Communications (QComms) reducing NCCL communication by 75%, Balanced Sharding, Bandwidth-Aware Embedding Optimization, and a "2D Parallel (All-to-All Optimized)" topology that reduced all-to-all latency from 78ms to 13ms. These improvements enabled larger models, driving significant engagement gains on Pinterest's Homefeed and Related Pins.

Key takeaway

For Machine Learning Engineers optimizing large-scale foundation models, prioritize profiling communication bottlenecks, especially for embedding-heavy architectures. You should systematically apply techniques like quantized communications, balanced sharding, and 2D parallel topologies to reduce cross-node data transfer. This approach can transform multi-node training from inefficient to near-linear, significantly shortening experimentation cycles and enabling larger, more performant models.

Key insights

For embedding-heavy models, communication, not compute, dominates multi-node scaling, requiring direct optimization of data transfer.

Principles

Profile to identify bottlenecks.
Optimizations compound across layers.
Communication dominates at scale.

Method

Pinterest achieved near-linear scaling by applying Quantized Communications, Balanced Sharding, Bandwidth-Aware Embedding Optimization, and a 2D Parallel (All-to-All Optimized) topology.

In practice

Use FBGEMM's QComms for FP8 compression.
Match hash partitions to GPUs for balance.
Flip 2D parallel topology for intra-node All-to-All.

Topics

Foundation Models
Distributed Training
Training Scalability
Embedding Optimization
Quantized Communications
2D Parallelism
AWS EFA

Best for: AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.