Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning

2026-06-12 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

Prism is a multi-LLM serving system designed to enhance cost efficiency and Service Level Objective (SLO) attainment by optimizing GPU sharing. It addresses limitations in existing systems that struggle with dynamic workloads and lack cross-model memory coordination. Prism achieves this through two core designs: on-demand memory allocation, which dynamically maps physical to virtual memory pages for flexible redistribution, and a two-level scheduling policy that adjusts sharing strategies based on runtime demands. Evaluations using real-world traces show Prism delivers over 2x cost savings and 3.3x SLO attainment compared to state-of-the-art systems like MuxServe++ and static partitioning, supporting up to 3.5x more requests. It activates 8B models in 0.7s and 70B models in 1.5s.

Key takeaway

For MLOps Engineers managing multi-LLM inference infrastructure, consider adopting systems like Prism to significantly reduce operational costs while meeting stringent latency SLOs. Its dynamic memory coordination and two-level scheduling approach enable efficient GPU utilization, preventing underutilization during idle periods and adapting to rapid workload fluctuations. Implementing on-demand memory allocation and SLO-aware scheduling can yield substantial cost savings and improve overall service reliability for diverse LLM deployments.

Key insights

Dynamic, demand-aware cross-model memory coordination is crucial for cost-efficient multi-LLM serving with strict SLOs.

Principles

Combine space and time GPU sharing flexibly.
Dynamically adjust resource allocation to workload.
Prioritize resource allocation by SLO diversity.

Method

Prism uses a kvcached shim layer for on-demand memory allocation, decoupling virtual and physical GPU memory. It employs a two-level scheduler: a global scheduler for model-to-GPU placement using KV pressure ratio, and a local scheduler for priority-based request dispatch.

In practice

Implement on-demand memory allocation for KV caches.
Use a two-level scheduling for LLM placement.
Prioritize requests by TTFT SLOs.

Topics

LLM Serving
GPU Sharing
Memory Management
SLO Attainment
Dynamic Scheduling
KV Cache

Code references

Best for: AI Architect, NLP Engineer, CTO, AI Engineer, Machine Learning Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.