minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, quick

Summary

minWM is a full-stack open-source framework designed to transform existing bidirectional T2V/TI2V video diffusion foundation models into real-time interactive video world models. It addresses the challenge of achieving controllable, causal, and low-latency rollout required for interactive applications. The framework provides an end-to-end pipeline, starting with fine-tuning a bidirectional video diffusion model for camera control. Subsequently, it applies the Causal Forcing / Causal Forcing++ pipeline, which includes AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill the model into a few-step autoregressive generator for efficient, low-latency inference. minWM is modular and extensible, demonstrated with backbones such as Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, supporting both cross-attention and MMDiT-style architectures. It also facilitates adapting models like HY-WorldPlay to new data distributions and latency targets, offering runnable scripts, checkpoints, documentation, and practical ablations.

Key takeaway

For AI Engineers developing interactive video applications, minWM offers a complete framework to achieve real-time, controllable video generation. You should explore its pipeline for fine-tuning and distilling existing T2V/TI2V models, or adapting models like HY-WorldPlay, to meet low-latency and camera-control requirements. This framework provides a robust, open-source foundation to accelerate your interactive world model development.

Key insights

minWM is an open-source framework converting video diffusion models into real-time, interactive, camera-controllable world models via a full-stack pipeline.

Principles

Method

Fine-tune bidirectional video diffusion models with camera control, then apply Causal Forcing / Causal Forcing++ (AR diffusion, causal ODE/consistency distillation, asymmetric DMD) to create few-step autoregressive generators.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.