Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks
Summary
Pinterest engineers successfully resolved intermittent CPU starvation and network failures impacting machine learning training jobs on their Kubernetes-based PinCompute platform. The issue, which caused training job success rates to drop over 25% for some use cases, was traced to "zombie" memory cgroups (memcgs) leaked by a crashlooping Amazon ECS agent enabled by default in their AWS Deep Learning AMI. Despite healthy aggregate CPU utilization, per-core analysis using mpstat revealed individual cores hitting 100% system CPU, starving Elastic Network Adapter (ENA) network interrupt handling. Using perf captures visualized in Netflix's Flamescope, they identified kubelet process spikes in the kernel function mem_cgroup_nr_lru_pages, ultimately linking it to nearly 70,000 accumulated zombie memcgs. Disabling the unused ECS agent and rebooting nodes restored stability.
Key takeaway
For MLOps Engineers managing distributed computing platforms, this case highlights the critical need to scrutinize base image configurations. You should proactively disable unused default agents and implement continuous, temporally indexed profiling with tools like gProfiler or Parca to quickly identify and resolve hidden kernel-level resource contention before it impacts critical workloads like Ray clusters.
Key insights
Unused default agents can leak kernel state, causing hidden CPU bottlenecks and system instability.
Principles
- Aggregate metrics can mask critical per-core performance issues.
- System defaults can significantly impact production performance.
- Abstractions can obscure root causes of system failures.
Method
Diagnose intermittent CPU starvation by moving from aggregate to per-core analysis, using tools like mpstat and perf captures visualized in Flamescope to pinpoint kernel function hotspots.
In practice
- Audit base images for unused, default-enabled agents.
- Implement continuous profiling for fleet-wide observability.
- Master low-level diagnostic tools like mpstat and perf.
Topics
- CPU Starvation
- Memory Cgroups
- Kubernetes
- Machine Learning Workloads
- Performance Profiling
Code references
Best for: MLOps Engineer, AI Engineer, Software Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.