NVIDIA Dynamo on AKS - Autoscaling LLM Inference

· Source: Microsoft Foundry Blog articles · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Intermediate, medium

Summary

Microsoft's Azure Kubernetes Service (AKS) integrates with NVIDIA Dynamo to provide scalable, production-ready inference for Large Language Models (LLMs), addressing challenges like variable compute costs, multi-phase inference pipelines, and GPU autoscaling constraints. NVIDIA Dynamo offers smart routing, KV cache management, low-latency communication, and a GPU Planner to coordinate capacity decisions. This architecture leverages AKS GPU node pool autoscaling and integrates with Azure Managed Prometheus and Grafana for observability. The system supports various autoscaling strategies, including Kubernetes HPA, KEDA, and Dynamo Planner, which is specifically designed for LLM-aware scaling with adaptive capacity and SLA-driven goals like TTFT < 500ms. An example demonstrates KEDA-driven autoscaling for aggregated serving, using TTFT p95 latency as the signal to scale Qwen3-0.6B workers on NC-H100 GPUs.

Key takeaway

For MLOps Engineers deploying LLM inference, understanding the nuances of GPU autoscaling with NVIDIA Dynamo on AKS is critical. You should prioritize LLM-aware autoscaling strategies like Dynamo Planner or KEDA with latency-based metrics (e.g., TTFT) over generic CPU/memory signals. Be mindful of cold-start penalties for GPU nodes and configure stabilization windows to prevent thrashing, ensuring a balance between cost efficiency and user experience.

Key insights

Effective LLM inference autoscaling requires specialized tools like NVIDIA Dynamo on AKS to manage variable loads and GPU constraints.

Principles

Method

NVIDIA Dynamo integrates with Kubernetes scale subresources, enabling HPA, KEDA, or Dynamo Planner to manage GPU node pools on AKS, using metrics like TTFT for adaptive scaling.

In practice

Topics

Code references

Best for: MLOps Engineer, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Microsoft Foundry Blog articles.