Snap’s GPU-Accelerated Secret to Processing 10 Petabytes a Day | NVIDIA AI Podcast Ep. 298
Summary
Snap, a social platform with over 940 million active users, significantly accelerated its data processing pipelines by migrating to NVIDIA Spark RAPIDS on Google Cloud. Prudhvi Vatala, Snap's Head of Engineering Platforms, reported a 76% reduction in job costs, a 62% decrease in required CPU cores, and an 80% drop in memory footprint. The migration, which involved zero code changes for PySpark workloads, leveraged idle GPU capacity on Google Kubernetes Engine (GKE) during off-peak hours. Snap's experimentation platform processes over ten petabytes of data daily, requiring strict SLAs for A/B testing results. The company developed a new data platform to manage this shared GPU capacity, incorporating preemption logic to prioritize online inference needs over batch processing.
Key takeaway
For AI Engineers and ML Platform Leads managing large-scale data pipelines, consider integrating NVIDIA Spark RAPIDS to significantly reduce operational costs and resource consumption. Your team can achieve substantial performance gains and cost savings by strategically utilizing idle GPU capacity, even without extensive code changes. Prioritize building robust fallback mechanisms to ensure pipeline reliability when GPU resources are temporarily unavailable.
Key insights
Leveraging idle GPU capacity with NVIDIA Spark RAPIDS dramatically cuts data processing costs and resource consumption.
Principles
- Prioritize statistical rigor in A/B testing.
- Optimize resource utilization by repurposing idle capacity.
Method
Migrate PySpark workloads to GPU-accelerated environments like Google Dataproc or GKE with NVIDIA Spark RAPIDS, building in fallback mechanisms for CPU or Dataproc clusters if GPU capacity is constrained.
In practice
- Explore NVIDIA Spark RAPIDS for PySpark acceleration.
- Utilize NVIDIA Aether for Spark tuning across environments.
- Implement graceful fallbacks for GPU capacity fluctuations.
Topics
- Snap Inc.
- Data Processing
- GPU Acceleration
- NVIDIA Spark RAPIDS
- Google Cloud
Best for: CTO, Director of AI/ML, AI Engineer, Machine Learning Engineer, Data Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA.