GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

2026-06-14 · Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, extended

Summary

An analysis of GPU time-slicing for concurrent LLM agents on Kubernetes reveals that while median latencies and throughput remain largely unchanged (FFT throughput down 7.3%, GEMM down 1.4%), the p99 latency for the small, latency-sensitive FFT agent jumped from 3.68 ms to 6.10 ms (a 1.66x increase), and its jitter (p99/p50) rose from 1.02 to 1.70. This degradation occurs because CUDA time-slicing, enabled by the NVIDIA device plugin's "replicas: 4" ConfigMap setting, allows multiple pods to request "nvidia.com/gpu: 1" and be scheduled on a single physical GTX 1080 GPU without true hardware partitioning or isolation. The article emphasizes that Kubernetes reports pods as `Running` despite this hidden performance cost, particularly for tail latencies, with the small, latency-critical agent suffering the most.

Key takeaway

For MLOps Engineers deploying concurrent LLM agents on shared Kubernetes GPUs, relying solely on `Running` pod statuses or average throughput metrics is insufficient. Your latency-sensitive agents will experience significant, hidden p99 latency degradation, even if medians appear stable. You must implement granular GPU execution timing using CUDA events and monitor tail latencies to accurately assess performance and prevent silent service level objective (SLO) breaches.

Key insights

GPU time-slicing on Kubernetes hides significant tail latency degradation for latency-sensitive agents, despite stable medians and throughput.

Principles

GPU sharing via time-slicing offers capacity, not isolation.
Median metrics obscure critical tail latency degradation.
Latency-sensitive workloads suffer most from GPU contention.

Method

Measure GPU execution time using CUDA events and `torch.cuda.synchronize()` to capture true kernel retirement, then aggregate into percentiles (p50/p95/p99) and degradation factors.

In practice

Use CUDA events for accurate GPU kernel timing.
Monitor p99 latency, not just averages, for shared GPUs.
Deploy `Kube-TimeSlice-Profiler` to quantify GPU contention.

Topics

GPU Time-Slicing
Kubernetes Scheduling
LLM Agent Performance
Tail Latency Monitoring
NVIDIA Device Plugin
CUDA Events

Code references

AnubhabBanerjee/Kube-Timeslice-Profiler

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.