VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

VideoMLA introduces a novel approach to long-rollout causal video diffusion, addressing the significant memory and latency overheads of traditional fixed-size sliding-window KV caches. This method replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, resulting in a 92.7% reduction in per-token KV memory at every cached layer. The research also explores why Multi-Head Latent Attention (MLA) is effective in video diffusion, noting that its success is not due to the spectral assumption often used in language models, as pretrained video attention is not inherently low-rank. Instead, the MLA bottleneck itself dictates the effective rank. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score for long horizons, and improves throughput by 1.23x on a single B200.

Key takeaway

For AI Architects and Machine Learning Engineers developing long-rollout video diffusion models, VideoMLA offers a critical memory optimization. You should consider implementing its low-rank latent KV cache to achieve a 92.7% memory reduction and improve throughput by 1.23x on hardware like the B200. This approach allows for superior long-horizon video generation without sacrificing quality, enabling more efficient and scalable deployments.

Key insights

VideoMLA significantly reduces KV cache memory in video diffusion by using low-rank latent attention, improving long-horizon performance and throughput.

Principles

Low-rank latent attention can drastically cut KV memory.
MLA bottleneck determines effective rank, not pretrained spectrum.
Shared content and positional keys optimize video diffusion.

Method

VideoMLA replaces per-head keys/values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key to reduce KV memory.

In practice

Achieve 92.7% KV memory reduction in video diffusion.
Improve throughput by 1.23x on B200 GPUs.
Enhance long-horizon video diffusion quality.

Topics

VideoMLA
Video Diffusion
KV Cache
Multi-Head Latent Attention
Memory Optimization
3D-RoPE

Best for: Research Scientist, Computer Vision Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.