Finding zombies in our systems: A real-world story of CPU bottlenecks

2026-04-15 · Source: Pinterest Engineering Blog - Medium · Field: Technology & Digital — Cloud Computing & IT Infrastructure, Software Development & Engineering, Artificial Intelligence & Machine Learning · Depth: Advanced, long

Summary

Pinterest's PinCompute team, in early 2025, resolved intermittent network connectivity loss and crashes affecting Ray-based ML training jobs on AWS EC2 instances. These issues, impacting GPU hardware, were linked to AWS ENA network driver resets. Initial observations of high system CPU usage and page faulting led to failed mitigation attempts with Huge pages, jemalloc, and CPU affinity. The problem was uniquely observed in one AWS Availability Zone (us-east-1a) and temporarily alleviated by machine reboots. Detailed profiling with mpstat showed single CPU cores reaching 100% utilization. Temporal analysis using perf and Flamescope revealed kubelet consuming ~6.5% CPU before resets, specifically in the mem_cgroup_nr_lru_pages system call. This pointed to "zombie memory cgroups," with nearly 70,000 tracked versus 240 in use. The root cause was the Amazon ECS Agent, pre-installed on the AWS Deep Learning AMI (Ubuntu 20.04), repeatedly crashing and accumulating these cgroups. Disabling the ECS agent systemd unit and rebooting machines resolved the issue, restoring high success rates for ML training jobs.

Key takeaway

For MLOps Engineers managing distributed ML workloads on Kubernetes, you must scrutinize base OS images for unintended background processes. Repeatedly crashing agents, like the Amazon ECS Agent on AWS Deep Learning AMIs, can accumulate "zombie memory cgroups," leading to CPU starvation and critical network driver resets. Proactively disable unnecessary systemd units and implement temporal profiling to diagnose sporadic performance bottlenecks, ensuring stable, high-throughput training environments and preventing costly job failures.

Key insights

Intermittent network issues on Ray ML jobs stemmed from CPU starvation caused by zombie memory cgroups from a crashing ECS agent.

Principles

CPU starvation can cause network driver resets.
System defaults may introduce hidden performance issues.
Temporal profiling is crucial for sporadic issues.

Method

The team used mpstat for per-core CPU utilization, then perf with a custom script for 2-minute snapshots, and Flamescope for temporal visualization to pinpoint kubelet's CPU spikes.

In practice

Track fleet-wide metrics for transient issues.
Profile base OS images for unintended processes.
Use temporal profiling tools like gProfiler.

Topics

Kubernetes
Ray ML
AWS ENA Driver
CPU Bottlenecks
Performance Profiling
Memory Cgroups

Code references

Best for: MLOps Engineer, AI Engineer, Software Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Pinterest Engineering Blog - Medium.