LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

LoomVideo is a highly efficient 5B-parameter unified architecture designed for both video generation and editing, addressing the computational overhead of existing large models (typically 13B+ parameters). It tackles the issue of token concatenation, which doubles sequence length and quadruples self-attention complexity in current frameworks. LoomVideo integrates a Multimodal Large Language Model (MLLM) by replacing the standard text encoder and employs a Deepstack injection mechanism to align MLLM features with the Diffusion Transformer (DiT). A crucial innovation is its zero-overhead Scale-and-Add conditioning approach for video editing, which scales and directly adds the clean source video latent to the noised target latent, eliminating token concatenation and enabling complex, non-rigid edits. The model also incorporates a Negative Temporal RoPE strategy for handling multiple reference images. Extensive experiments show LoomVideo achieves state-of-the-art or competitive performance, particularly excelling in e-commerce and fashion generation, while delivering at least a 5.41x acceleration in inference speed.

Key takeaway

For AI Engineers evaluating video foundation models, LoomVideo offers a compelling alternative to larger, slower architectures. Its 5B-parameter design and zero-overhead Scale-and-Add conditioning provide a 5.41x inference speedup, making it ideal for applications requiring efficient, complex video generation and non-rigid editing. You should consider integrating LoomVideo for e-commerce or fashion scenarios where rapid, high-quality multimodal video output is critical.

Key insights

LoomVideo unifies multimodal video generation and editing with a 5B-parameter model, achieving significant speedup via zero-overhead conditioning.

Principles

Method

LoomVideo replaces the text encoder with an MLLM, aligning its multi-layer features with a Diffusion Transformer via Deepstack injection. It uses a Scale-and-Add conditioning for editing and Negative Temporal RoPE for multiple references.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.