LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing
Summary
LoomVideo is a highly efficient 5B-parameter unified architecture designed for both video generation and editing, addressing the computational overhead of existing large models (typically 13B+ parameters). It tackles the issue of token concatenation, which doubles sequence length and quadruples self-attention complexity in current frameworks. LoomVideo integrates a Multimodal Large Language Model (MLLM) by replacing the standard text encoder and employs a Deepstack injection mechanism to align MLLM features with the Diffusion Transformer (DiT). A crucial innovation is its zero-overhead Scale-and-Add conditioning approach for video editing, which scales and directly adds the clean source video latent to the noised target latent, eliminating token concatenation and enabling complex, non-rigid edits. The model also incorporates a Negative Temporal RoPE strategy for handling multiple reference images. Extensive experiments show LoomVideo achieves state-of-the-art or competitive performance, particularly excelling in e-commerce and fashion generation, while delivering at least a 5.41x acceleration in inference speed.
Key takeaway
For AI Engineers evaluating video foundation models, LoomVideo offers a compelling alternative to larger, slower architectures. Its 5B-parameter design and zero-overhead Scale-and-Add conditioning provide a 5.41x inference speedup, making it ideal for applications requiring efficient, complex video generation and non-rigid editing. You should consider integrating LoomVideo for e-commerce or fashion scenarios where rapid, high-quality multimodal video output is critical.
Key insights
LoomVideo unifies multimodal video generation and editing with a 5B-parameter model, achieving significant speedup via zero-overhead conditioning.
Principles
- Efficient conditioning reduces computational cost.
- Multimodal LLMs enhance video generation.
- Direct latent manipulation avoids token overhead.
Method
LoomVideo replaces the text encoder with an MLLM, aligning its multi-layer features with a Diffusion Transformer via Deepstack injection. It uses a Scale-and-Add conditioning for editing and Negative Temporal RoPE for multiple references.
In practice
- Generate videos from diverse multimodal inputs.
- Perform complex, non-rigid video edits.
- Accelerate video inference by 5.41x.
Topics
- Video Generation
- Video Editing
- Multimodal LLM
- Diffusion Transformer
- Scale-and-Add Conditioning
- E-commerce Applications
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.