NVIDIA Introduces SANA-WM: A 2.6B-Parameter Open-Source World Model That Generates Minute-Scale 720p Video on a Single GPU

2026-05-16 · Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

NVIDIA has released SANA-WM, a 2.6B-parameter open-source world model capable of generating one-minute, 720p videos with precise 6-DoF camera control from a single image and camera trajectory. This model performs inference on a single GPU, eliminating multi-GPU dependencies common in other open-source world models that often require 8 GPUs or reduce resolution to 480p. SANA-WM achieves this efficiency through a hybrid Gated DeltaNet + softmax backbone, which maintains a constant recurrent state size, and a dual-branch camera control system. A second-stage refiner, utilizing a 17B LTX-2 model with rank-384 LoRA, significantly reduces visual drift, achieving 22.0 videos/hour on 8 H100s, a 36x throughput increase over LingBot-World. A distilled variant can generate a 60-second 720p clip in 34 seconds on a single RTX 5090 using NVFP4 quantization.

Key takeaway

For Computer Vision Engineers developing high-resolution video generation models, SANA-WM offers a significant architectural advancement. Its single-GPU inference capability for 720p video and superior throughput compared to existing models means you can achieve higher quality and efficiency. Consider integrating SANA-WM's design principles, particularly its Gated DeltaNet backbone and dual-branch camera control, to overcome memory and performance bottlenecks in your own projects.

Key insights

SANA-WM is an efficient 2.6B-parameter open-source world model generating 720p video on a single GPU.

Principles

Constant recurrent state avoids quadratic memory growth.
Dual-branch camera control improves trajectory precision.

Method

SANA-WM employs a Hybrid Gated DeltaNet + softmax backbone for constant recurrent state and dual-branch camera control, refined by a 17B LTX-2 + LoRA model.

In practice

Generate 720p video on a single GPU.
Achieve 36x higher throughput than LingBot-World.
Use NVFP4 quantization for faster generation.

Topics

SANA-WM
World Models
Video Generation
Single GPU Inference
Hybrid Gated DeltaNet

Code references

NVlabs/Sana

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.