minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
Summary
minWM is a full-stack open-source framework designed to transform existing bidirectional T2V/TI2V video diffusion foundation models into real-time interactive video world models. It addresses the challenge of achieving controllable, causal, and low-latency rollout required for interactive applications. The framework provides an end-to-end pipeline, starting with fine-tuning a bidirectional video diffusion model for camera control. Subsequently, it applies the Causal Forcing / Causal Forcing++ pipeline, which includes AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill the model into a few-step autoregressive generator for efficient, low-latency inference. minWM is modular and extensible, demonstrated with backbones such as Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, supporting both cross-attention and MMDiT-style architectures. It also facilitates adapting models like HY-WorldPlay to new data distributions and latency targets, offering runnable scripts, checkpoints, documentation, and practical ablations.
Key takeaway
For AI Engineers developing interactive video applications, minWM offers a complete framework to achieve real-time, controllable video generation. You should explore its pipeline for fine-tuning and distilling existing T2V/TI2V models, or adapting models like HY-WorldPlay, to meet low-latency and camera-control requirements. This framework provides a robust, open-source foundation to accelerate your interactive world model development.
Key insights
minWM is an open-source framework converting video diffusion models into real-time, interactive, camera-controllable world models via a full-stack pipeline.
Principles
- Interactive video models need controllable, causal, low-latency rollout.
- Full-stack pipelines are crucial for real-time interactive world models.
- Distillation converts complex models into efficient autoregressive generators.
Method
Fine-tune bidirectional video diffusion models with camera control, then apply Causal Forcing / Causal Forcing++ (AR diffusion, causal ODE/consistency distillation, asymmetric DMD) to create few-step autoregressive generators.
In practice
- Convert T2V/TI2V models into interactive world models.
- Adapt existing video world models like HY-WorldPlay.
- Utilize provided scripts, checkpoints, and documentation.
Topics
- Video World Models
- Real-time Video Generation
- Model Distillation
- Camera Control
- Open-source Frameworks
- Video Diffusion Models
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.