Optimizing ML Workload Network Efficiency (Part I): Feature Trimmer

· Source: Pinterest Engineering Blog - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, long

Summary

Pinterest's "Feature Trimmer" system addresses a critical network bottleneck in its online ML serving root-leaf architecture, where the root component processes features and fans them out to leaf partitions for model inference. Initially, LZ4 compression reduced root-leaf network usage by 20% but increased p90 latency by 5ms. The Feature Trimmer implements a "Send What You Use" strategy, utilizing model signatures ("module_info.json") to create feature allowlists. These allowlists are aggregated into bundle-level artifacts and deployed to root clusters, synchronized with model rollouts. This system significantly reduced network bandwidth: Ads root cluster usage dropped from 4GBPS to under 1.5GBPS, enabling a 27% cluster downsizing. Homefeed root outbound usage decreased by 50-60%, leading to a 33% fleet size reduction. Client P90 latency for Ads dropped from over 90ms to below 80ms, and Related Pins model p99 latency decreased by 25-30%. Overall, this optimization saved Pinterest over \$4M in annual infrastructure costs.

Key takeaway

For MLOps Engineers or AI Architects optimizing large-scale ML serving systems, you should evaluate implementing a "Send What You Use" feature trimming strategy. By utilizing model signatures to precisely define and transmit only necessary features between services, you can significantly reduce network bandwidth, improve inference latency, and achieve substantial infrastructure cost savings. Consider integrating this approach into your model deployment pipelines to ensure synchronization and robust fallback mechanisms.

Key insights

Sending only necessary features, guided by model signatures, drastically cuts network overhead in ML serving.

Principles

Method

Export model signatures as standalone artifacts, aggregate them into bundle-level mappings, and deploy to root clusters via staged delivery. Use file watchers and atomic map updates with read-write locks for continuous refresh.

In practice

Topics

Best for: MLOps Engineer, AI Architect, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.