minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Software Development & Engineering · Depth: Expert, quick

Summary

minWM is a full-stack open-source framework designed to transform existing bidirectional T2V/TI2V video diffusion foundation models into real-time interactive video world models. It addresses the challenge of achieving controllable, causal, and low-latency rollout required for interactive applications. The framework provides an end-to-end pipeline, starting with fine-tuning a bidirectional video diffusion model for camera control. Subsequently, it applies the Causal Forcing / Causal Forcing++ pipeline, which includes AR diffusion training, causal ODE or causal consistency distillation, and asymmetric DMD, to distill the model into a few-step autoregressive generator for efficient, low-latency inference. minWM is modular and extensible, demonstrated with backbones such as Wan2.1-T2V-1.3B and HY1.5-TI2V-8B, supporting both cross-attention and MMDiT-style architectures. It also facilitates adapting models like HY-WorldPlay to new data distributions and latency targets, offering runnable scripts, checkpoints, documentation, and practical ablations.

Key takeaway

For AI Engineers developing interactive video applications, minWM offers a complete framework to achieve real-time, controllable video generation. You should explore its pipeline for fine-tuning and distilling existing T2V/TI2V models, or adapting models like HY-WorldPlay, to meet low-latency and camera-control requirements. This framework provides a robust, open-source foundation to accelerate your interactive world model development.

Key insights

minWM is an open-source framework converting video diffusion models into real-time, interactive, camera-controllable world models via a full-stack pipeline.

Principles

Interactive video models need controllable, causal, low-latency rollout.
Full-stack pipelines are crucial for real-time interactive world models.
Distillation converts complex models into efficient autoregressive generators.

Method

Fine-tune bidirectional video diffusion models with camera control, then apply Causal Forcing / Causal Forcing++ (AR diffusion, causal ODE/consistency distillation, asymmetric DMD) to create few-step autoregressive generators.

In practice

Convert T2V/TI2V models into interactive world models.
Adapt existing video world models like HY-WorldPlay.
Utilize provided scripts, checkpoints, and documentation.

Topics

Video World Models
Real-time Video Generation
Model Distillation
Camera Control
Open-source Frameworks
Video Diffusion Models

Code references

shengshu-ai/minWM

Best for: AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.