Deploying Disaggregated LLM Inference Workloads on Kubernetes
Summary
Disaggregated serving for large language models (LLMs) splits the inference pipeline into distinct stages like prefill, decode, and routing, each running as an independent, scalable service. This contrasts with traditional aggregated serving, where a single process handles the entire inference lifecycle, leading to GPU underutilization due to differing compute profiles for prefill (compute-intensive) and decode (memory-bandwidth-bound). Disaggregation allows matching GPU resources and scaling independently for each stage, improving utilization. Orchestrating this on Kubernetes requires advanced scheduling capabilities such as gang scheduling, hierarchical gang scheduling, and topology-aware placement, which are provided by schedulers like KAI Scheduler. Higher-level abstractions like LeaderWorkerSet (LWS) and NVIDIA Grove enable declarative expression of these complex inference application structures, facilitating coordinated deployment and scaling.
Key takeaway
For MLOps Engineers deploying LLM inference, consider adopting disaggregated serving architectures to optimize GPU utilization and scaling flexibility. Evaluate whether managing separate LeaderWorkerSet resources for each stage or using NVIDIA Grove's integrated PodCliqueSet API better suits your operational model, especially for coordinating complex scaling and topology requirements across prefill, decode, and router components.
Key insights
Disaggregated LLM inference optimizes resource use and scaling by separating prefill, decode, and routing stages.
Principles
- Match GPU resources to stage-specific needs.
- Scale inference stages independently based on demand.
- Saturate target resources for better GPU utilization.
Method
Deploy disaggregated LLM inference on Kubernetes using LeaderWorkerSet or NVIDIA Grove APIs to define roles, scaling, and topology constraints, leveraging advanced schedulers like KAI Scheduler for optimal placement.
In practice
- Use LWS for independent role management.
- Employ Grove for integrated cross-role coordination.
- Configure `topologyConstraint` for rack-level colocation.
Topics
- Disaggregated LLM Serving
- Kubernetes Scheduling
- GPU Optimization
- Tensor Parallelism
- NVIDIA Grove
Code references
Best for: MLOps Engineer, AI Architect, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.