CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation
Summary
CrossPool is a serving engine designed to efficiently host multiple sparse Mixture-of-Experts (MoE) models, particularly those receiving infrequent "cold" requests, by addressing GPU memory inefficiencies. Traditional systems struggle because static model weights compete with transient KV-cache demand, leading to low GPU utilization and poor long-context support. CrossPool tackles this by disaggregating FFN weights and KV-cache into two distinct GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a dynamic KV-cache pool. It integrates a KV-cache planner and virtualizer, a layer-wise pipeline scheduler to hide hidden-state transfers, and persistent kernels with control lowering. This architecture enables efficient GPU memory pooling, supports bursty long-context requests, and significantly outperforms kvcached-based multi-LLM serving systems, reducing P99 TBT by up to \$10.4\times$.
Key takeaway
For MLOps Engineers managing multi-LLM serving infrastructure with sparse MoE models, consider adopting architectures that disaggregate KV-cache and model weights. Your current monolithic GPU memory pools likely waste resources on cold models. Implementing a system like CrossPool, which pools FFN weights and dynamically manages KV-cache separately, can drastically improve GPU memory utilization and reduce P99 Tail Batch Time by up to \$10.4\times$, especially for bursty long-context requests. Evaluate disaggregated memory pooling to optimize your serving costs and performance.
Key insights
Disaggregating KV-cache and FFN weights into separate GPU pools optimizes multi-LLM serving for cold MoE models.
Principles
- Separate static weights from dynamic KV-cache for efficiency.
- Pool KV-cache globally for aggregate active demand.
- Localize attention to KV-cache for better utilization.
Method
CrossPool uses a KV-cache planner/virtualizer, a layer-wise pipeline scheduler, and persistent kernels to manage disaggregated weight and KV-cache pools.
In practice
- Implement separate GPU memory pools for weights and KV-cache.
- Design a scheduler to hide hidden-state transfer overheads.
- Utilize persistent kernels to reduce CPU-GPU control overhead.
Topics
- MoE Models
- LLM Serving
- KV-Cache Optimization
- Weight Disaggregation
- GPU Memory Pooling
- Performance Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.