How Snapchat Serves a Billion Predictions Per Second

2025-12-15 · Source: ByteByteGo Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Advanced, long

Summary

Snapchat's machine learning platform, Bento, serves over a billion predictions per second for 477 million daily users, handling critical decisions for content feeds, ad auctions, friend suggestions, and AR lenses within 100 milliseconds. The platform is engineered to manage immense latency, scale, freshness, and iteration pressures. Bento's training half utilizes Kubeflow, a layered code structure for hundreds of daily experiments, and a unique model export that splits compute graphs for GPU/CPU optimization. The serving half, powered by Robusta (Apache Spark), processes 10 trillion events/day, serving 1 TB/second of reads from 800 TB of online feature data. It employs strategies like feature collocation or a dedicated Retrieval service for high fanout, alongside inference optimizations such as compute graph splitting and data plane optimization, which achieved 2x lower latency and 10x cheaper data plane costs. A continuous feedback loop with robust monitoring and a Kubernetes-like deployment control plane ensures system stability and continuous model improvement.

Key takeaway

For MLOps Engineers scaling high-volume recommendation or ranking systems, you must architect for asymmetric workloads where one request expands into many model evaluations. Prioritize robust feature store synchronization to prevent train/serve skew, a common operational pitfall. Consider advanced optimizations like splitting compute graphs for GPU/CPU and redesigning inference APIs for raw byte feature transfer, which can yield 2x lower latency and 10x cheaper data plane costs, ensuring your platform can absorb significant growth.

Key insights

Snapchat's Bento platform scales ML ranking to a billion predictions/second by optimizing for asymmetric workloads and continuous iteration.

Principles

Asymmetric ranking workloads demand specialized architectural solutions.
Feature store consistency prevents train/serve skew.
Data plane optimization can significantly reduce latency and cost.

Method

Bento's training pipeline uses Kubeflow to orchestrate data generation, model training (GPU/TPU), evaluation, and a split compute graph export.

In practice

Collocate document features on inference instances for low latency.
Split compute graphs for GPU/CPU to optimize resource use.
Redesign inference APIs for raw byte feature transfer.

Topics

Machine Learning Platforms
Recommendation Systems
Feature Stores
Real-time Inference
MLOps
Data Plane Optimization

Best for: AI Architect, CTO, VP of Engineering/Data, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by ByteByteGo Newsletter.