NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes

2026-05-27 · Source: NVIDIA Technical Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Expert, long

Summary

NVIDIA Dynamo Snapshot is a new checkpoint/restore solution designed to drastically reduce cold-start latency for AI inference workloads on Kubernetes, which typically takes several minutes for a single-GPU vLLM (v0.20.0) workload. It combines CRIU for host-side state and "cuda-checkpoint" for GPU device state, managed by a "snapshot-agent" DaemonSet on Kubernetes. The system uses quiesce/resume hooks, allowing workloads to prepare for checkpointing after engine initialization but before distributed runtime startup. Key optimizations include KV cache unmap and release, which reduced Qwen3-0.6B artifact size from ~190 GiB to ~6 GiB. Further enhancements to CRIU, such as parallel memfd restore and Linux native AIO, significantly accelerate memory restoration. The GPU Memory Service (GMS) decouples large model weights, enabling concurrent restoration and achieving a 21x start-time reduction for gpt-oss-120b, with sub-5-second restores using striped local NVMe SSDs. The experimental release supports single-GPU vLLM and SGLang.

Key takeaway

For MLOps Engineers managing elastic AI inference workloads on Kubernetes, NVIDIA Dynamo Snapshot offers a critical solution to the cold-start problem. If your deployments face minutes-long startup delays and SLA risks during traffic spikes, you should evaluate Dynamo Snapshot. Its checkpoint/restore mechanism, especially with GMS, can reduce gpt-oss-120b startup times by 21x, enabling sub-5-second restores and ensuring your GPU resources are utilized efficiently. Consider integrating this for vLLM and SGLang workloads to improve responsiveness and cost-effectiveness.

Key insights

NVIDIA Dynamo Snapshot significantly reduces AI inference cold-start times on Kubernetes through optimized checkpoint/restore.

Principles

Decouple device and host state for comprehensive checkpointing.
Quiesce workloads before checkpointing to optimize artifact size.
Parallelize memory restoration to maximize I/O throughput.

Method

A "snapshot-agent" DaemonSet orchestrates "cuda-checkpoint" and CRIU. Workloads signal readiness post-engine initialization, then poll for restore completion, allowing external checkpointing and seamless resumption.

In practice

Utilize "cuMemUnmap" and "cuMemRelease" to shrink KV cache in checkpoints.
Implement quiesce/resume hooks for non-checkpointable resource management.
Explore GMS for large models to enable concurrent weight and process state restoration.

Topics

NVIDIA Dynamo Snapshot
Kubernetes
AI Inference
Cold Start Latency
Checkpoint/Restore
GPU Memory Service

Code references

checkpoint-restore/criu

Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.