GPU Time-Slicing for Concurrent LLM Agents on Kubernetes
Summary
An analysis of GPU time-slicing for concurrent LLM agents on Kubernetes reveals that while median latencies and throughput remain largely unchanged (FFT throughput down 7.3%, GEMM down 1.4%), the p99 latency for the small, latency-sensitive FFT agent jumped from 3.68 ms to 6.10 ms (a 1.66x increase), and its jitter (p99/p50) rose from 1.02 to 1.70. This degradation occurs because CUDA time-slicing, enabled by the NVIDIA device plugin's "replicas: 4" ConfigMap setting, allows multiple pods to request "nvidia.com/gpu: 1" and be scheduled on a single physical GTX 1080 GPU without true hardware partitioning or isolation. The article emphasizes that Kubernetes reports pods as `Running` despite this hidden performance cost, particularly for tail latencies, with the small, latency-critical agent suffering the most.
Key takeaway
For MLOps Engineers deploying concurrent LLM agents on shared Kubernetes GPUs, relying solely on `Running` pod statuses or average throughput metrics is insufficient. Your latency-sensitive agents will experience significant, hidden p99 latency degradation, even if medians appear stable. You must implement granular GPU execution timing using CUDA events and monitor tail latencies to accurately assess performance and prevent silent service level objective (SLO) breaches.
Key insights
GPU time-slicing on Kubernetes hides significant tail latency degradation for latency-sensitive agents, despite stable medians and throughput.
Principles
- GPU sharing via time-slicing offers capacity, not isolation.
- Median metrics obscure critical tail latency degradation.
- Latency-sensitive workloads suffer most from GPU contention.
Method
Measure GPU execution time using CUDA events and `torch.cuda.synchronize()` to capture true kernel retirement, then aggregate into percentiles (p50/p95/p99) and degradation factors.
In practice
- Use CUDA events for accurate GPU kernel timing.
- Monitor p99 latency, not just averages, for shared GPUs.
- Deploy `Kube-TimeSlice-Profiler` to quantify GPU contention.
Topics
- GPU Time-Slicing
- Kubernetes Scheduling
- LLM Agent Performance
- Tail Latency Monitoring
- NVIDIA Device Plugin
- CUDA Events
Code references
Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.