VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
Summary
VideoMLA introduces a novel approach to long-rollout causal video diffusion, addressing the significant memory and latency overheads of traditional fixed-size sliding-window KV caches. This method replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, resulting in a 92.7% reduction in per-token KV memory at every cached layer. The research also explores why Multi-Head Latent Attention (MLA) is effective in video diffusion, noting that its success is not due to the spectral assumption often used in language models, as pretrained video attention is not inherently low-rank. Instead, the MLA bottleneck itself dictates the effective rank. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score for long horizons, and improves throughput by 1.23x on a single B200.
Key takeaway
For AI Architects and Machine Learning Engineers developing long-rollout video diffusion models, VideoMLA offers a critical memory optimization. You should consider implementing its low-rank latent KV cache to achieve a 92.7% memory reduction and improve throughput by 1.23x on hardware like the B200. This approach allows for superior long-horizon video generation without sacrificing quality, enabling more efficient and scalable deployments.
Key insights
VideoMLA significantly reduces KV cache memory in video diffusion by using low-rank latent attention, improving long-horizon performance and throughput.
Principles
- Low-rank latent attention can drastically cut KV memory.
- MLA bottleneck determines effective rank, not pretrained spectrum.
- Shared content and positional keys optimize video diffusion.
Method
VideoMLA replaces per-head keys/values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key to reduce KV memory.
In practice
- Achieve 92.7% KV memory reduction in video diffusion.
- Improve throughput by 1.23x on B200 GPUs.
- Enhance long-horizon video diffusion quality.
Topics
- VideoMLA
- Video Diffusion
- KV Cache
- Multi-Head Latent Attention
- Memory Optimization
- 3D-RoPE
Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.