Scaling Recommendation Systems with Request-Level Deduplication
Summary
Pinterest implemented request-level deduplication to manage the infrastructure demands of scaling its recommendation models, such as the Foundation Model, which saw a 100x increase in transformer dense parameter counts and a 10x increase in model dimension. This family of techniques ensures request-level data is processed and stored once, significantly reducing costs across the ML lifecycle. For storage, leveraging Apache Iceberg with user ID and request ID sorting achieved 10-50x compression on user-heavy feature columns. In training, solutions like Synchronized Batch Normalization (SyncBatchNorm) and user-level masking addressed issues arising from non-IID data, leading to a 4x end-to-end training speedup for retrieval and a ~2.8x speedup for ranking. For ranking serving, the Deduplicated Cross-Attention Transformer (DCAT) architecture, implemented with custom Triton kernels, delivered a 7x increase in throughput.
Key takeaway
For MLOps Engineers or AI Architects scaling recommendation systems, you should prioritize implementing request-level deduplication. This approach significantly reduces storage, training, and serving costs, enabling larger models without proportional infrastructure increases. Consider adopting techniques like Apache Iceberg for data storage, SyncBatchNorm and user-level masking for training, and the DCAT architecture for ranking models to achieve substantial throughput gains and cost efficiencies.
Key insights
Request-level deduplication is a cross-cutting technique that optimizes storage, training, and serving for large-scale recommendation systems.
Principles
- Request-level deduplication is a cross-cutting technique.
- Simple fixes can unlock significant performance gains.
- Impact compounds across the ML stack.
Method
Implement request-level deduplication by sorting data with Apache Iceberg, applying SyncBatchNorm and user-level masking for training, and using a two-tower architecture for retrieval and DCAT for ranking/serving.
In practice
- Use Apache Iceberg for 10-50x storage compression.
- Apply SyncBatchNorm to restore IID statistics.
- Implement user-level masking for InfoNCE loss.
Topics
- Recommendation Systems
- Data Deduplication
- ML Infrastructure
- Apache Iceberg
- Batch Normalization
- Transformer Architectures
- MLOps
Best for: Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.