NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes

· Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, long

Summary

NVIDIA Dynamo Snapshot is a new checkpoint/restore solution designed to drastically reduce cold-start latency for AI inference workloads on Kubernetes, which typically takes several minutes for a single-GPU vLLM (v0.20.0) workload. It combines CRIU for host-side state and "cuda-checkpoint" for GPU device state, managed by a "snapshot-agent" DaemonSet on Kubernetes. The system uses quiesce/resume hooks, allowing workloads to prepare for checkpointing after engine initialization but before distributed runtime startup. Key optimizations include KV cache unmap and release, which reduced Qwen3-0.6B artifact size from ~190 GiB to ~6 GiB. Further enhancements to CRIU, such as parallel memfd restore and Linux native AIO, significantly accelerate memory restoration. The GPU Memory Service (GMS) decouples large model weights, enabling concurrent restoration and achieving a 21x start-time reduction for gpt-oss-120b, with sub-5-second restores using striped local NVMe SSDs. The experimental release supports single-GPU vLLM and SGLang.

Key takeaway

For MLOps Engineers managing elastic AI inference workloads on Kubernetes, NVIDIA Dynamo Snapshot offers a critical solution to the cold-start problem. If your deployments face minutes-long startup delays and SLA risks during traffic spikes, you should evaluate Dynamo Snapshot. Its checkpoint/restore mechanism, especially with GMS, can reduce gpt-oss-120b startup times by 21x, enabling sub-5-second restores and ensuring your GPU resources are utilized efficiently. Consider integrating this for vLLM and SGLang workloads to improve responsiveness and cost-effectiveness.

Key insights

NVIDIA Dynamo Snapshot significantly reduces AI inference cold-start times on Kubernetes through optimized checkpoint/restore.

Principles

Method

A "snapshot-agent" DaemonSet orchestrates "cuda-checkpoint" and CRIU. Workloads signal readiness post-engine initialization, then poll for restore completion, allowing external checkpointing and seamless resumption.

In practice

Topics

Code references

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.