Fixing GPU Starvation in Large-Scale Distributed Training

2026-04-10 · Source: MLOps.community · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Intermediate, extended

Summary

Uber's ML infrastructure team, led by Kash, addressed a critical GPU underutilization problem, with models running at 15-20% on A100 chips, costing significant time and money. Initial diagnostics, including loading all data into RAM, confirmed that the issue was not model architecture but rather data feeding bottlenecks, boosting utilization to 85%. Standard fixes like increasing threads or parallelism failed. Tracing revealed slow data reading from remote Parquet files into the queue, causing GPU starvation. A caching solution, storing data locally on the GPU CPU host, was implemented but surprisingly yielded no improvement. The root cause was a hidden bottleneck: the on-the-fly translation from PyArrow format (optimized for Parquet) to NumPy tensors, which GPUs require. By caching the transformed NumPy output directly, the team achieved 85% utilization, reducing training times from a day to an hour or two with the same resources.

Key takeaway

For MLOps Engineers optimizing GPU-intensive workloads, prioritize data pipeline efficiency over model architecture tweaks. If you are experiencing low GPU utilization, investigate data I/O and format conversion steps, as these often starve GPUs. Implement local caching of pre-transformed data (e.g., NumPy arrays) to ensure a continuous, optimized data flow, significantly reducing training times and resource waste.

Key insights

GPU underutilization often stems from data I/O and format transformation bottlenecks, not model complexity.

Principles

Data I/O is a primary bottleneck in ML scalability.
GPU utilization is key to cost-efficiency and productivity.
Data format translation can be a hidden performance killer.

Method

Profile ML workloads by tracing data flow from producer to consumer. Isolate bottlenecks by eliminating variables (e.g., loading data into RAM). Implement caching for transformed data to ensure GPUs receive ready-to-process tensors.

In practice

Cache transformed NumPy tensors for direct GPU feeding.
Warm-start inference models with synthetic QPS to avoid cold starts.
Add TL;DR comments and READMEs to codebases for agent context.

Topics

GPU Starvation
Data I/O Bottlenecks
Distributed ML Training
PyArrow to NumPy Transformation
Model Reproducibility

Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by MLOps.community.