Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer
Summary
Pinterest's "Feature Trimmer" system addresses a critical network bottleneck in its online ML serving root-leaf architecture, where the root component processes features and fans them out to leaf partitions for model inference. Initially, LZ4 compression reduced root-leaf network usage by 20% but increased p90 latency by 5ms. The Feature Trimmer implements a "Send What You Use" strategy, utilizing model signatures ("module_info.json") to create feature allowlists. These allowlists are aggregated into bundle-level artifacts and deployed to root clusters, synchronized with model rollouts. This system significantly reduced network bandwidth: Ads root cluster usage dropped from 4GBPS to under 1.5GBPS, enabling a 27% cluster downsizing. Homefeed root outbound usage decreased by 50-60%, leading to a 33% fleet size reduction. Client P90 latency for Ads dropped from over 90ms to below 80ms, and Related Pins model p99 latency decreased by 25-30%. Overall, this optimization saved Pinterest over \$4M in annual infrastructure costs.
Key takeaway
For MLOps Engineers or AI Architects optimizing large-scale ML serving systems, you should evaluate implementing a "Send What You Use" feature trimming strategy. By utilizing model signatures to precisely define and transmit only necessary features between services, you can significantly reduce network bandwidth, improve inference latency, and achieve substantial infrastructure cost savings. Consider integrating this approach into your model deployment pipelines to ensure synchronization and robust fallback mechanisms.
Key insights
Sending only necessary features, guided by model signatures, drastically cuts network overhead in ML serving.
Principles
- Model signatures define required features.
- Synchronize feature allowlists with model rollouts.
- Allowlisting is superior to blocklisting for dynamic ML.
Method
Export model signatures as standalone artifacts, aggregate them into bundle-level mappings, and deploy to root clusters via staged delivery. Use file watchers and atomic map updates with read-write locks for continuous refresh.
In practice
- Implement versioned feature allowlists.
- Integrate trimming into model deployment.
- Use read-write locks for config updates.
Topics
- ML Serving Architecture
- Feature Trimming
- Network Optimization
- Model Deployment
- Infrastructure Cost Savings
- Latency Reduction
Best for: MLOps Engineer, AI Architect, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.