NVIDIA Dynamo Snapshot: Fast Startup for Inference Workloads on Kubernetes
Summary
NVIDIA Dynamo Snapshot is a new checkpoint/restore solution designed to drastically reduce cold-start latency for AI inference workloads on Kubernetes, which typically takes several minutes for a single-GPU vLLM (v0.20.0) workload. It combines CRIU for host-side state and "cuda-checkpoint" for GPU device state, managed by a "snapshot-agent" DaemonSet on Kubernetes. The system uses quiesce/resume hooks, allowing workloads to prepare for checkpointing after engine initialization but before distributed runtime startup. Key optimizations include KV cache unmap and release, which reduced Qwen3-0.6B artifact size from ~190 GiB to ~6 GiB. Further enhancements to CRIU, such as parallel memfd restore and Linux native AIO, significantly accelerate memory restoration. The GPU Memory Service (GMS) decouples large model weights, enabling concurrent restoration and achieving a 21x start-time reduction for gpt-oss-120b, with sub-5-second restores using striped local NVMe SSDs. The experimental release supports single-GPU vLLM and SGLang.
Key takeaway
For MLOps Engineers managing elastic AI inference workloads on Kubernetes, NVIDIA Dynamo Snapshot offers a critical solution to the cold-start problem. If your deployments face minutes-long startup delays and SLA risks during traffic spikes, you should evaluate Dynamo Snapshot. Its checkpoint/restore mechanism, especially with GMS, can reduce gpt-oss-120b startup times by 21x, enabling sub-5-second restores and ensuring your GPU resources are utilized efficiently. Consider integrating this for vLLM and SGLang workloads to improve responsiveness and cost-effectiveness.
Key insights
NVIDIA Dynamo Snapshot significantly reduces AI inference cold-start times on Kubernetes through optimized checkpoint/restore.
Principles
- Decouple device and host state for comprehensive checkpointing.
- Quiesce workloads before checkpointing to optimize artifact size.
- Parallelize memory restoration to maximize I/O throughput.
Method
A "snapshot-agent" DaemonSet orchestrates "cuda-checkpoint" and CRIU. Workloads signal readiness post-engine initialization, then poll for restore completion, allowing external checkpointing and seamless resumption.
In practice
- Utilize "cuMemUnmap" and "cuMemRelease" to shrink KV cache in checkpoints.
- Implement quiesce/resume hooks for non-checkpointable resource management.
- Explore GMS for large models to enable concurrent weight and process state restoration.
Topics
- NVIDIA Dynamo Snapshot
- Kubernetes
- AI Inference
- Cold Start Latency
- Checkpoint/Restore
- GPU Memory Service
Code references
Best for: MLOps Engineer, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by NVIDIA Technical Blog.