Scaling Recommendation Systems with Request-Level Deduplication

· Source: Pinterest Engineering Blog - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Pinterest implemented request-level deduplication to manage the infrastructure demands of scaling its recommendation models, such as the Foundation Model, which saw a 100x increase in transformer dense parameter counts and a 10x increase in model dimension. This family of techniques ensures request-level data is processed and stored once, significantly reducing costs across the ML lifecycle. For storage, leveraging Apache Iceberg with user ID and request ID sorting achieved 10-50x compression on user-heavy feature columns. In training, solutions like Synchronized Batch Normalization (SyncBatchNorm) and user-level masking addressed issues arising from non-IID data, leading to a 4x end-to-end training speedup for retrieval and a ~2.8x speedup for ranking. For ranking serving, the Deduplicated Cross-Attention Transformer (DCAT) architecture, implemented with custom Triton kernels, delivered a 7x increase in throughput.

Key takeaway

For MLOps Engineers or AI Architects scaling recommendation systems, you should prioritize implementing request-level deduplication. This approach significantly reduces storage, training, and serving costs, enabling larger models without proportional infrastructure increases. Consider adopting techniques like Apache Iceberg for data storage, SyncBatchNorm and user-level masking for training, and the DCAT architecture for ranking models to achieve substantial throughput gains and cost efficiencies.

Key insights

Request-level deduplication is a cross-cutting technique that optimizes storage, training, and serving for large-scale recommendation systems.

Principles

Method

Implement request-level deduplication by sorting data with Apache Iceberg, applying SyncBatchNorm and user-level masking for training, and using a two-tower architecture for retrieval and DCAT for ranking/serving.

In practice

Topics

Best for: Machine Learning Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.