NVIDIA Dynamo on AKS - Autoscaling LLM Inference
Summary
Microsoft's Azure Kubernetes Service (AKS) integrates with NVIDIA Dynamo to provide scalable, production-ready inference for Large Language Models (LLMs), addressing challenges like variable compute costs, multi-phase inference pipelines, and GPU autoscaling constraints. NVIDIA Dynamo offers smart routing, KV cache management, low-latency communication, and a GPU Planner to coordinate capacity decisions. This architecture leverages AKS GPU node pool autoscaling and integrates with Azure Managed Prometheus and Grafana for observability. The system supports various autoscaling strategies, including Kubernetes HPA, KEDA, and Dynamo Planner, which is specifically designed for LLM-aware scaling with adaptive capacity and SLA-driven goals like TTFT < 500ms. An example demonstrates KEDA-driven autoscaling for aggregated serving, using TTFT p95 latency as the signal to scale Qwen3-0.6B workers on NC-H100 GPUs.
Key takeaway
For MLOps Engineers deploying LLM inference, understanding the nuances of GPU autoscaling with NVIDIA Dynamo on AKS is critical. You should prioritize LLM-aware autoscaling strategies like Dynamo Planner or KEDA with latency-based metrics (e.g., TTFT) over generic CPU/memory signals. Be mindful of cold-start penalties for GPU nodes and configure stabilization windows to prevent thrashing, ensuring a balance between cost efficiency and user experience.
Key insights
Effective LLM inference autoscaling requires specialized tools like NVIDIA Dynamo on AKS to manage variable loads and GPU constraints.
Principles
- GPU utilization alone is insufficient for LLM load.
- Metric alignment is crucial for stable scaling.
- Warm capacity reduces scale-up latency.
Method
NVIDIA Dynamo integrates with Kubernetes scale subresources, enabling HPA, KEDA, or Dynamo Planner to manage GPU node pools on AKS, using metrics like TTFT for adaptive scaling.
In practice
- Use KEDA with TTFT p95 for latency-sensitive LLMs.
- Implement Dynamo Planner for adaptive LLM-aware scaling.
- Avoid single autoscalers per service to prevent conflicts.
Topics
- NVIDIA Dynamo
- Azure Kubernetes Service
- LLM Inference Autoscaling
- GPU Node Autoscaling
- KEDA
Code references
Best for: MLOps Engineer, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.