Service-Induced Congestion in Memory-Constrained LLM Serving

· Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Expert, extended

Summary

A new study identifies "service-induced congestion" in large language model (LLM) serving, where persistent GPU memory accumulation from key-value (KV) caches during autoregressive decoding leads to endogenous capacity pressure. Under high concurrency, exceeding memory capacity forces active request eviction, wasting computation and reducing throughput. The research develops a discrete-time dynamical model, revealing that for homogeneous workloads, the eviction-free equilibrium is unstable, converging to a worst-case limit cycle with up to 50% throughput loss (when decoding lengths are large relative to input lengths). For heterogeneous workloads, stability depends on decoding length coprimality; coprime lengths stabilize the system, while non-coprime lengths cause synchronized instability. The work proposes rate-limited admission and request mixing as scheduling design principles, validated by model-based, Vidur, and real-GPU simulations.

Key takeaway

For MLOps Engineers optimizing LLM serving, recognize that continuous KV cache growth creates a unique, dynamic memory constraint. Your admission policies must anticipate future memory pressure, not just instantaneous fit. Avoid homogeneous workloads where possible, as they are structurally unstable and can lead to 50% throughput loss. Instead, prioritize mixing heterogeneous requests with coprime decoding lengths to desynchronize memory release, or implement rate-limited admission to prevent eviction cascades.

Key insights

LLM KV cache growth creates service-induced congestion, with stability determined by workload homogeneity and decoding length coprimality.

Principles

Method

A discrete-time dynamical model captures LLM admission, KV cache growth, and eviction under continuous batching, analyzed via linear recurrence and spectral theory.

In practice

Topics

Best for: Research Scientist, AI Architect, AI Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.