Service-Induced Congestion in Memory-Constrained LLM Serving

2026-06-16 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, extended

Summary

A new study identifies "service-induced congestion" in large language model (LLM) serving, where persistent GPU memory accumulation from key-value (KV) caches during autoregressive decoding leads to endogenous capacity pressure. Under high concurrency, exceeding memory capacity forces active request eviction, wasting computation and reducing throughput. The research develops a discrete-time dynamical model, revealing that for homogeneous workloads, the eviction-free equilibrium is unstable, converging to a worst-case limit cycle with up to 50% throughput loss (when decoding lengths are large relative to input lengths). For heterogeneous workloads, stability depends on decoding length coprimality; coprime lengths stabilize the system, while non-coprime lengths cause synchronized instability. The work proposes rate-limited admission and request mixing as scheduling design principles, validated by model-based, Vidur, and real-GPU simulations.

Key takeaway

For MLOps Engineers optimizing LLM serving, recognize that continuous KV cache growth creates a unique, dynamic memory constraint. Your admission policies must anticipate future memory pressure, not just instantaneous fit. Avoid homogeneous workloads where possible, as they are structurally unstable and can lead to 50% throughput loss. Instead, prioritize mixing heterogeneous requests with coprime decoding lengths to desynchronize memory release, or implement rate-limited admission to prevent eviction cascades.

Key insights

LLM KV cache growth creates service-induced congestion, with stability determined by workload homogeneity and decoding length coprimality.

Principles

LLM requests progressively consume GPU memory, unlike stateless inference.
Homogeneous LLM workloads are prone to synchronized memory growth and throughput collapse.
Coprime decoding lengths desynchronize memory release, stabilizing heterogeneous LLM systems.

Method

A discrete-time dynamical model captures LLM admission, KV cache growth, and eviction under continuous batching, analyzed via linear recurrence and spectral theory.

In practice

Implement rate-limited admission to regulate concurrency and prevent memory overflow.
Mix heterogeneous requests with coprime decoding lengths to desynchronize completions.
Prioritize retaining later-stage requests during eviction (Least-Progressed-First rule).

Topics

LLM Serving
GPU Memory Management
Continuous Batching
Dynamical Systems
Admission Control
Workload Heterogeneity
KV Cache

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.