SANA-WM: Efficient Minute-Scale World Modeling with Hybrid Linear Diffusion Transformer
Summary
SANA-WM is an efficient 2.6B-parameter open-source world model designed for generating high-fidelity, 720p, minute-scale videos with precise camera control. It achieves visual quality comparable to industrial baselines like LingBot-World and HY-WorldPlay, while significantly enhancing efficiency. Key architectural innovations include Hybrid Linear Attention for memory-efficient long-context modeling, Dual-Branch Camera Control for 6-DoF trajectory adherence, and a Two-Stage Generation Pipeline with a long-video refiner. SANA-WM also utilizes a Robust Annotation Pipeline to extract accurate metric-scale 6-DoF camera poses from public videos. The model was trained in 15 days on 64 H100s using approximately $213K public video clips and can generate a 60s clip on a single GPU, with a distilled variant running on an RTX 5090 using NVFP4 quantization.
Key takeaway
For research scientists developing world models, SANA-WM demonstrates that combining hybrid linear attention with a two-stage generation pipeline can significantly improve efficiency and throughput without sacrificing visual quality. You should consider these architectural designs to reduce training compute and enable single-GPU inference for minute-scale video generation, especially when precise camera control is critical.
Key insights
SANA-WM offers efficient, high-fidelity minute-scale video generation with precise camera control via hybrid attention and a two-stage pipeline.
Principles
- Hybrid attention improves long-context modeling efficiency.
- Dual-branch control ensures precise camera trajectory.
- Two-stage refinement enhances video quality and consistency.
Method
SANA-WM employs Hybrid Linear Attention, Dual-Branch Camera Control, and a Two-Stage Generation Pipeline, supported by a Robust Annotation Pipeline for 6-DoF camera pose extraction from public videos.
In practice
- Generate 60s 720p videos on a single GPU.
- Deploy distilled variant on RTX 5090 with NVFP4.
- Achieve 36x higher throughput for world modeling.
Topics
- SANA-WM
- World Modeling
- Hybrid Linear Diffusion Transformer
- Minute-Scale Video Generation
- Camera Control
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.