Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Expert, extended

Summary

An empirical study investigates multi-model Large Language Model (LLM) scheduling challenges on shared, heterogeneous hardware, focusing on CPU-GPU offloading and preemption. The research, conducted on NVIDIA RTX 5000 Ada Generation (32 GB VRAM) and RTX A6000 (48 GB VRAM) GPUs with an AMD Threadripper PRO 5995WX CPU, analyzed Llama 3 8B, Qwen3-32B, Llama 2 70B (for offloading in Q4 format) and Qwen2.5-3B, Qwen3-8B, Qwen2.5-14B (for preemption in FP16). Findings reveal that offloading causes non-linear, model-dependent decode throughput degradation, with smaller models showing sharper sensitivity. Preemption overhead, ranging from approximately 2.6 s to 7.3 s, is overwhelmingly dominated by model weight reload (over 98.5%) rather than KV cache transfer, and this cost is largely constant regardless of preemption point. The study identifies critical features for next-generation schedulers, including model-specific offloading sensitivity, workload characteristics, and hardware-aware preemption cost structures.

Key takeaway

For AI Architects designing multi-model LLM serving systems, prioritize hardware-aware scheduling that accounts for model-specific offloading sensitivities. Your system should treat preemption costs as a fixed penalty per model and hardware pair, as model reload dominates overhead, not KV cache transfer. This allows for more predictable resource allocation and efficient management of heterogeneous workloads, even under memory constraints.

Key insights

Multi-model LLM schedulers must account for non-linear offloading sensitivity and model reload-dominated preemption costs.

Principles

Method

The study empirically evaluated LLM behavior by sweeping GPU layer allocation for offloading and instrumenting the preempt-resume cycle, measuring decode throughput and overhead decomposition across diverse models and hardware.

In practice

Topics

Best for: MLOps Engineer, AI Scientist, Research Scientist, Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.