Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints

2025-04-15 · Source: stat.ML updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

A new scheduling framework, "Fluid-Guided Online Scheduling with Memory Constraints," addresses the significant daily costs exceeding \$700,000 incurred by large language model providers due to inefficient GPU scheduling during token-by-token inference. The core challenge is endogenous memory growth from the Key-Value (KV) cache, which can cause evictions and wasted computation. Researchers Ruicheng Ao, Gan Luo, David Simchi-Levi, and Xinshang Wang formulated this as a multi-stage online scheduling problem. They developed a fluid model to characterize equilibrium batch composition, memory requirements, and stability regions. Guided by this model, they designed two algorithms: WAIT, for requests with known output lengths, and Nested WAIT, which extends the approach to unknown output lengths by regulating request advancement across decode stages and using a moderate safety buffer. Both algorithms asymptotically approximate the fluid benchmark. Vidur simulations using Llama-2-7B on an A100 GPU, supported by real-GPU validation, demonstrate that these policies expand the stable operating range and reduce latency, particularly in near-overloaded and overloaded conditions.

Key takeaway

For AI Architects or Machine Learning Engineers tasked with optimizing LLM inference, adopting fluid-guided online scheduling offers a robust solution to mitigate high operational costs and latency. You should consider implementing the WAIT or Nested WAIT algorithms to manage Key-Value cache memory growth effectively. This approach can significantly expand your stable operating range and reduce inference latency, especially under high load, directly impacting your infrastructure efficiency and user experience.

Key insights

Fluid-guided online scheduling optimizes LLM inference by managing KV cache memory growth to reduce latency and expand stable operation.

Principles

Endogenous memory growth is central to LLM inference scheduling.
Fluid models can characterize optimal batch composition and stability.
Threshold-based admission rules improve GPU utilization.

Method

Formulate LLM inference as a multi-stage online scheduling problem with KV-cache constraints. Design a fluid model to guide WAIT (known output lengths) and Nested WAIT (unknown output lengths with safety buffer) algorithms.

In practice

Implement WAIT for LLM inference with known output lengths.
Use Nested WAIT for variable-length LLM inference requests.
Apply a safety buffer to mitigate KV cache overflow risks.

Topics

LLM Inference Optimization
GPU Scheduling
Key-Value Cache
Fluid Models
Online Scheduling Algorithms
Llama-2-7B

Best for: MLOps Engineer, Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by stat.ML updates on arXiv.org.