NVIDIA Dynamo on AKS - Autoscaling LLM Inference

2026-05-13 · Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, medium

Summary

Microsoft's Azure Kubernetes Service (AKS) integrates with NVIDIA Dynamo to provide scalable, production-ready inference for Large Language Models (LLMs), addressing challenges like variable compute costs, multi-phase inference pipelines, and GPU autoscaling constraints. NVIDIA Dynamo offers smart routing, KV cache management, low-latency communication, and a GPU Planner to coordinate capacity decisions. This architecture leverages AKS GPU node pool autoscaling and integrates with Azure Managed Prometheus and Grafana for observability. The system supports various autoscaling strategies, including Kubernetes HPA, KEDA, and Dynamo Planner, which is specifically designed for LLM-aware scaling with adaptive capacity and SLA-driven goals like TTFT < 500ms. An example demonstrates KEDA-driven autoscaling for aggregated serving, using TTFT p95 latency as the signal to scale Qwen3-0.6B workers on NC-H100 GPUs.

Key takeaway

For MLOps Engineers deploying LLM inference, understanding the nuances of GPU autoscaling with NVIDIA Dynamo on AKS is critical. You should prioritize LLM-aware autoscaling strategies like Dynamo Planner or KEDA with latency-based metrics (e.g., TTFT) over generic CPU/memory signals. Be mindful of cold-start penalties for GPU nodes and configure stabilization windows to prevent thrashing, ensuring a balance between cost efficiency and user experience.

Key insights

Effective LLM inference autoscaling requires specialized tools like NVIDIA Dynamo on AKS to manage variable loads and GPU constraints.

Principles

GPU utilization alone is insufficient for LLM load.
Metric alignment is crucial for stable scaling.
Warm capacity reduces scale-up latency.

Method

NVIDIA Dynamo integrates with Kubernetes scale subresources, enabling HPA, KEDA, or Dynamo Planner to manage GPU node pools on AKS, using metrics like TTFT for adaptive scaling.

In practice

Use KEDA with TTFT p95 for latency-sensitive LLMs.
Implement Dynamo Planner for adaptive LLM-aware scaling.
Avoid single autoscalers per service to prevent conflicts.

Topics

NVIDIA Dynamo
Azure Kubernetes Service
LLM Inference Autoscaling
GPU Node Autoscaling
KEDA

Code references

maljazaery/Dynamo_on_AKS

Best for: MLOps Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.