Veda: Scalable Video Diffusion via Distilled Sparse Attention

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

Veda introduces a distilled sparse attention framework designed to scale video diffusion transformers for high-resolution, long video generation. It addresses the quadratic cost of self-attention and the degradation issues of existing sparse methods by demonstrating that generation quality is tied to how well the sparse mask aligns with the tile-wise geometry of full attention, rather than just the sparsity ratio. Veda formulates tile selection as an explicit reconstruction problem, integrating statistics-aware tile scoring with head-aware tiling to enable aggressive sparsity. A hardware-efficient tile-skipping kernel translates this theoretical sparsity into practical wall-clock speedups. Experiments on large video diffusion models like Waver and Wan2.1 show substantial acceleration without quality degradation. For 720P 10-second videos on Waver-T2V-12B, Veda achieves a 5.1x end-to-end speedup and a 10.5x self-attention speedup, reducing attention overhead from 92% to 50%.

Key takeaway

For Machine Learning Engineers developing high-resolution video diffusion models, Veda offers a critical solution to the quadratic cost of self-attention. You can achieve substantial speedups, like 5.1x end-to-end and 10.5x for self-attention, without compromising generation quality. Consider integrating Veda's distilled sparse attention to scale your models efficiently, especially for longer or higher-resolution video outputs.

Key insights

Generation quality in sparse attention depends on mask alignment with full attention's tile-wise geometry, not just sparsity ratio.

Principles

Sparse attention quality hinges on mask-to-tile alignment.
Aggressive sparsity is achievable with proper alignment.
Speedup gains increase with sequence length.

Method

Veda formulates tile selection as an explicit reconstruction problem from full attention, integrating statistics-aware tile scoring with head-aware tiling to reduce estimation error and structural mismatch.

In practice

Generate 720P 10-second videos faster.
Accelerate Waver and Wan2.1 models.
Reduce attention overhead significantly.

Topics

Video Diffusion
Sparse Attention
Diffusion Transformers
Model Acceleration
Waver
Wan2.1

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.