NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU
Summary
NVIDIA has released SANA-WM, a 2.6B-parameter open-source world model capable of generating one-minute, 720p videos with precise 6-DoF camera control from a single image and camera trajectory. This model performs inference on a single GPU, eliminating multi-GPU dependencies common in other open-source world models that often require 8 GPUs or reduce resolution to 480p. SANA-WM achieves this efficiency through a hybrid Gated DeltaNet + softmax backbone, which maintains a constant recurrent state size, and a dual-branch camera control system. A second-stage refiner, utilizing a 17B LTX-2 model with rank-384 LoRA, significantly reduces visual drift, achieving 22.0 videos/hour on 8 H100s, a 36x throughput increase over LingBot-World. A distilled variant can generate a 60-second 720p clip in 34 seconds on a single RTX 5090 using NVFP4 quantization.
Key takeaway
For Computer Vision Engineers developing high-resolution video generation models, SANA-WM offers a significant architectural advancement. Its single-GPU inference capability for 720p video and superior throughput compared to existing models means you can achieve higher quality and efficiency. Consider integrating SANA-WM's design principles, particularly its Gated DeltaNet backbone and dual-branch camera control, to overcome memory and performance bottlenecks in your own projects.
Key insights
SANA-WM is an efficient 2.6B-parameter open-source world model generating 720p video on a single GPU.
Principles
- Constant recurrent state avoids quadratic memory growth.
- Dual-branch camera control improves trajectory precision.
Method
SANA-WM employs a Hybrid Gated DeltaNet + softmax backbone for constant recurrent state and dual-branch camera control, refined by a 17B LTX-2 + LoRA model.
In practice
- Generate 720p video on a single GPU.
- Achieve 36x higher throughput than LingBot-World.
- Use NVFP4 quantization for faster generation.
Topics
- SANA-WM
- World Models
- Video Generation
- Single GPU Inference
- Hybrid Gated DeltaNet
Code references
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.