Pinterest Engineers Eliminate CPU Zombies to Resolve Production Bottlenecks

· Source: InfoQ · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, quick

Summary

Pinterest engineers successfully resolved intermittent CPU starvation and network failures impacting machine learning training jobs on their Kubernetes-based PinCompute platform. The issue, which caused training job success rates to drop over 25% for some use cases, was traced to "zombie" memory cgroups (memcgs) leaked by a crashlooping Amazon ECS agent enabled by default in their AWS Deep Learning AMI. Despite healthy aggregate CPU utilization, per-core analysis using mpstat revealed individual cores hitting 100% system CPU, starving Elastic Network Adapter (ENA) network interrupt handling. Using perf captures visualized in Netflix's Flamescope, they identified kubelet process spikes in the kernel function mem_cgroup_nr_lru_pages, ultimately linking it to nearly 70,000 accumulated zombie memcgs. Disabling the unused ECS agent and rebooting nodes restored stability.

Key takeaway

For MLOps Engineers managing distributed computing platforms, this case highlights the critical need to scrutinize base image configurations. You should proactively disable unused default agents and implement continuous, temporally indexed profiling with tools like gProfiler or Parca to quickly identify and resolve hidden kernel-level resource contention before it impacts critical workloads like Ray clusters.

Key insights

Unused default agents can leak kernel state, causing hidden CPU bottlenecks and system instability.

Principles

Method

Diagnose intermittent CPU starvation by moving from aggregate to per-core analysis, using tools like mpstat and perf captures visualized in Flamescope to pinpoint kernel function hotspots.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by InfoQ.