Deploying Disaggregated LLM Inference Workloads on Kubernetes

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Software Development & Engineering · Depth: Advanced, long

Summary

Disaggregated serving for large language models (LLMs) splits the inference pipeline into distinct stages like prefill, decode, and routing, each running as an independent, scalable service. This contrasts with traditional aggregated serving, where a single process handles the entire inference lifecycle, leading to GPU underutilization due to differing compute profiles for prefill (compute-intensive) and decode (memory-bandwidth-bound). Disaggregation allows matching GPU resources and scaling independently for each stage, improving utilization. Orchestrating this on Kubernetes requires advanced scheduling capabilities such as gang scheduling, hierarchical gang scheduling, and topology-aware placement, which are provided by schedulers like KAI Scheduler. Higher-level abstractions like LeaderWorkerSet (LWS) and NVIDIA Grove enable declarative expression of these complex inference application structures, facilitating coordinated deployment and scaling.

Key takeaway

For MLOps Engineers deploying LLM inference, consider adopting disaggregated serving architectures to optimize GPU utilization and scaling flexibility. Evaluate whether managing separate LeaderWorkerSet resources for each stage or using NVIDIA Grove's integrated PodCliqueSet API better suits your operational model, especially for coordinating complex scaling and topology requirements across prefill, decode, and router components.

Key insights

Disaggregated LLM inference optimizes resource use and scaling by separating prefill, decode, and routing stages.

Principles

Method

Deploy disaggregated LLM inference on Kubernetes using LeaderWorkerSet or NVIDIA Grove APIs to define roles, scaling, and topology constraints, leveraging advanced schedulers like KAI Scheduler for optimal placement.

In practice

Topics

Code references

Best for: MLOps Engineer, AI Architect, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.