Service-Induced Congestion in Memory-Constrained LLM Serving
Summary
A new study identifies "service-induced congestion" in large language model (LLM) serving, where persistent GPU memory accumulation from key-value (KV) caches during autoregressive decoding leads to endogenous capacity pressure. Under high concurrency, exceeding memory capacity forces active request eviction, wasting computation and reducing throughput. The research develops a discrete-time dynamical model, revealing that for homogeneous workloads, the eviction-free equilibrium is unstable, converging to a worst-case limit cycle with up to 50% throughput loss (when decoding lengths are large relative to input lengths). For heterogeneous workloads, stability depends on decoding length coprimality; coprime lengths stabilize the system, while non-coprime lengths cause synchronized instability. The work proposes rate-limited admission and request mixing as scheduling design principles, validated by model-based, Vidur, and real-GPU simulations.
Key takeaway
For MLOps Engineers optimizing LLM serving, recognize that continuous KV cache growth creates a unique, dynamic memory constraint. Your admission policies must anticipate future memory pressure, not just instantaneous fit. Avoid homogeneous workloads where possible, as they are structurally unstable and can lead to 50% throughput loss. Instead, prioritize mixing heterogeneous requests with coprime decoding lengths to desynchronize memory release, or implement rate-limited admission to prevent eviction cascades.
Key insights
LLM KV cache growth creates service-induced congestion, with stability determined by workload homogeneity and decoding length coprimality.
Principles
- LLM requests progressively consume GPU memory, unlike stateless inference.
- Homogeneous LLM workloads are prone to synchronized memory growth and throughput collapse.
- Coprime decoding lengths desynchronize memory release, stabilizing heterogeneous LLM systems.
Method
A discrete-time dynamical model captures LLM admission, KV cache growth, and eviction under continuous batching, analyzed via linear recurrence and spectral theory.
In practice
- Implement rate-limited admission to regulate concurrency and prevent memory overflow.
- Mix heterogeneous requests with coprime decoding lengths to desynchronize completions.
- Prioritize retaining later-stage requests during eviction (Least-Progressed-First rule).
Topics
- LLM Serving
- GPU Memory Management
- Continuous Batching
- Dynamical Systems
- Admission Control
- Workload Heterogeneity
- KV Cache
Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.