Deploying Disaggregated LLM Inference Workloads on Kubernetes

2026-03-23 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, long

Summary

Disaggregated serving for large language models (LLMs) splits the inference pipeline into distinct stages like prefill, decode, and routing, each running as an independent, scalable service. This contrasts with traditional aggregated serving, where a single process handles the entire inference lifecycle, leading to GPU underutilization due to differing compute profiles for prefill (compute-intensive) and decode (memory-bandwidth-bound). Disaggregation allows matching GPU resources and scaling independently for each stage, improving utilization. Orchestrating this on Kubernetes requires advanced scheduling capabilities such as gang scheduling, hierarchical gang scheduling, and topology-aware placement, which are provided by schedulers like KAI Scheduler. Higher-level abstractions like LeaderWorkerSet (LWS) and NVIDIA Grove enable declarative expression of these complex inference application structures, facilitating coordinated deployment and scaling.

Key takeaway

For MLOps Engineers deploying LLM inference, consider adopting disaggregated serving architectures to optimize GPU utilization and scaling flexibility. Evaluate whether managing separate LeaderWorkerSet resources for each stage or using NVIDIA Grove's integrated PodCliqueSet API better suits your operational model, especially for coordinating complex scaling and topology requirements across prefill, decode, and router components.

Key insights

Disaggregated LLM inference optimizes resource use and scaling by separating prefill, decode, and routing stages.

Principles

Match GPU resources to stage-specific needs.
Scale inference stages independently based on demand.
Saturate target resources for better GPU utilization.

Method

Deploy disaggregated LLM inference on Kubernetes using LeaderWorkerSet or NVIDIA Grove APIs to define roles, scaling, and topology constraints, leveraging advanced schedulers like KAI Scheduler for optimal placement.

In practice

Use LWS for independent role management.
Employ Grove for integrated cross-role coordination.
Configure `topologyConstraint` for rack-level colocation.

Topics

Disaggregated LLM Serving
Kubernetes Scheduling
GPU Optimization
Tensor Parallelism
NVIDIA Grove

Code references

Best for: MLOps Engineer, AI Architect, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.