Finding zombies in our systems: A real-world story of CPU bottlenecks

· Source: Pinterest Engineering Blog - Medium · Field: Technology & Digital — Cloud Computing & IT Infrastructure, Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

Pinterest's PinCompute team, in early 2025, resolved intermittent network connectivity loss and crashes affecting Ray-based ML training jobs on AWS EC2 instances. These issues, impacting GPU hardware, were linked to AWS ENA network driver resets. Initial observations of high system CPU usage and page faulting led to failed mitigation attempts with Huge pages, jemalloc, and CPU affinity. The problem was uniquely observed in one AWS Availability Zone (us-east-1a) and temporarily alleviated by machine reboots. Detailed profiling with mpstat showed single CPU cores reaching 100% utilization. Temporal analysis using perf and Flamescope revealed kubelet consuming ~6.5% CPU before resets, specifically in the mem_cgroup_nr_lru_pages system call. This pointed to "zombie memory cgroups," with nearly 70,000 tracked versus 240 in use. The root cause was the Amazon ECS Agent, pre-installed on the AWS Deep Learning AMI (Ubuntu 20.04), repeatedly crashing and accumulating these cgroups. Disabling the ECS agent systemd unit and rebooting machines resolved the issue, restoring high success rates for ML training jobs.

Key takeaway

For MLOps Engineers managing distributed ML workloads on Kubernetes, you must scrutinize base OS images for unintended background processes. Repeatedly crashing agents, like the Amazon ECS Agent on AWS Deep Learning AMIs, can accumulate "zombie memory cgroups," leading to CPU starvation and critical network driver resets. Proactively disable unnecessary systemd units and implement temporal profiling to diagnose sporadic performance bottlenecks, ensuring stable, high-throughput training environments and preventing costly job failures.

Key insights

Intermittent network issues on Ray ML jobs stemmed from CPU starvation caused by zombie memory cgroups from a crashing ECS agent.

Principles

Method

The team used mpstat for per-core CPU utilization, then perf with a custom script for 2-minute snapshots, and Flamescope for temporal visualization to pinpoint kubelet's CPU spikes.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Engineer, Software Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.