How Snapchat Serves a Billion Predictions Per Second
Summary
Snapchat's machine learning platform, Bento, serves over a billion predictions per second for 477 million daily users, handling critical decisions for content feeds, ad auctions, friend suggestions, and AR lenses within 100 milliseconds. The platform is engineered to manage immense latency, scale, freshness, and iteration pressures. Bento's training half utilizes Kubeflow, a layered code structure for hundreds of daily experiments, and a unique model export that splits compute graphs for GPU/CPU optimization. The serving half, powered by Robusta (Apache Spark), processes 10 trillion events/day, serving 1 TB/second of reads from 800 TB of online feature data. It employs strategies like feature collocation or a dedicated Retrieval service for high fanout, alongside inference optimizations such as compute graph splitting and data plane optimization, which achieved 2x lower latency and 10x cheaper data plane costs. A continuous feedback loop with robust monitoring and a Kubernetes-like deployment control plane ensures system stability and continuous model improvement.
Key takeaway
For MLOps Engineers scaling high-volume recommendation or ranking systems, you must architect for asymmetric workloads where one request expands into many model evaluations. Prioritize robust feature store synchronization to prevent train/serve skew, a common operational pitfall. Consider advanced optimizations like splitting compute graphs for GPU/CPU and redesigning inference APIs for raw byte feature transfer, which can yield 2x lower latency and 10x cheaper data plane costs, ensuring your platform can absorb significant growth.
Key insights
Snapchat's Bento platform scales ML ranking to a billion predictions/second by optimizing for asymmetric workloads and continuous iteration.
Principles
- Asymmetric ranking workloads demand specialized architectural solutions.
- Feature store consistency prevents train/serve skew.
- Data plane optimization can significantly reduce latency and cost.
Method
Bento's training pipeline uses Kubeflow to orchestrate data generation, model training (GPU/TPU), evaluation, and a split compute graph export.
In practice
- Collocate document features on inference instances for low latency.
- Split compute graphs for GPU/CPU to optimize resource use.
- Redesign inference APIs for raw byte feature transfer.
Topics
- Machine Learning Platforms
- Recommendation Systems
- Feature Stores
- Real-time Inference
- MLOps
- Data Plane Optimization
Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.