Prism: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning
Summary
Prism is a multi-LLM serving system designed to enhance cost efficiency and Service Level Objective (SLO) attainment by optimizing GPU sharing. It addresses limitations in existing systems that struggle with dynamic workloads and lack cross-model memory coordination. Prism achieves this through two core designs: on-demand memory allocation, which dynamically maps physical to virtual memory pages for flexible redistribution, and a two-level scheduling policy that adjusts sharing strategies based on runtime demands. Evaluations using real-world traces show Prism delivers over 2x cost savings and 3.3x SLO attainment compared to state-of-the-art systems like MuxServe++ and static partitioning, supporting up to 3.5x more requests. It activates 8B models in 0.7s and 70B models in 1.5s.
Key takeaway
For MLOps Engineers managing multi-LLM inference infrastructure, consider adopting systems like Prism to significantly reduce operational costs while meeting stringent latency SLOs. Its dynamic memory coordination and two-level scheduling approach enable efficient GPU utilization, preventing underutilization during idle periods and adapting to rapid workload fluctuations. Implementing on-demand memory allocation and SLO-aware scheduling can yield substantial cost savings and improve overall service reliability for diverse LLM deployments.
Key insights
Dynamic, demand-aware cross-model memory coordination is crucial for cost-efficient multi-LLM serving with strict SLOs.
Principles
- Combine space and time GPU sharing flexibly.
- Dynamically adjust resource allocation to workload.
- Prioritize resource allocation by SLO diversity.
Method
Prism uses a kvcached shim layer for on-demand memory allocation, decoupling virtual and physical GPU memory. It employs a two-level scheduler: a global scheduler for model-to-GPU placement using KV pressure ratio, and a local scheduler for priority-based request dispatch.
In practice
- Implement on-demand memory allocation for KV caches.
- Use a two-level scheduling for LLM placement.
- Prioritize requests by TTFT SLOs.
Topics
- LLM Serving
- GPU Sharing
- Memory Management
- SLO Attainment
- Dynamic Scheduling
- KV Cache
Code references
Best for: AI Architect, NLP Engineer, CTO, AI Engineer, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.