CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation

2026-06-23 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, quick

Summary

CrossPool is a serving engine designed to efficiently host multiple sparse Mixture-of-Experts (MoE) models, particularly those receiving infrequent "cold" requests, by addressing GPU memory inefficiencies. Traditional systems struggle because static model weights compete with transient KV-cache demand, leading to low GPU utilization and poor long-context support. CrossPool tackles this by disaggregating FFN weights and KV-cache into two distinct GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a dynamic KV-cache pool. It integrates a KV-cache planner and virtualizer, a layer-wise pipeline scheduler to hide hidden-state transfers, and persistent kernels with control lowering. This architecture enables efficient GPU memory pooling, supports bursty long-context requests, and significantly outperforms kvcached-based multi-LLM serving systems, reducing P99 TBT by up to \$10.4\times$.

Key takeaway

For MLOps Engineers managing multi-LLM serving infrastructure with sparse MoE models, consider adopting architectures that disaggregate KV-cache and model weights. Your current monolithic GPU memory pools likely waste resources on cold models. Implementing a system like CrossPool, which pools FFN weights and dynamically manages KV-cache separately, can drastically improve GPU memory utilization and reduce P99 Tail Batch Time by up to \$10.4\times$, especially for bursty long-context requests. Evaluate disaggregated memory pooling to optimize your serving costs and performance.

Key insights

Disaggregating KV-cache and FFN weights into separate GPU pools optimizes multi-LLM serving for cold MoE models.

Principles

Separate static weights from dynamic KV-cache for efficiency.
Pool KV-cache globally for aggregate active demand.
Localize attention to KV-cache for better utilization.

Method

CrossPool uses a KV-cache planner/virtualizer, a layer-wise pipeline scheduler, and persistent kernels to manage disaggregated weight and KV-cache pools.

In practice

Implement separate GPU memory pools for weights and KV-cache.
Design a scheduler to hide hidden-state transfer overheads.
Utilize persistent kernels to reduce CPU-GPU control overhead.

Topics

MoE Models
LLM Serving
KV-Cache Optimization
Weight Disaggregation
GPU Memory Pooling
Performance Optimization

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.