TeDiO: Temporal Diagonal Optimization for Training-Free Coherent Video Diffusion
Summary
TeDiO, or Temporal Diagonal Optimization, is a novel training-free, inference-time method designed to enhance temporal coherence in text-to-video diffusion transformers. Recent models like Wan2.1 and CogVideoX often produce videos with flickering, drifting, or unstable motion despite generating visually compelling frames. TeDiO addresses this by observing that incoherent videos exhibit irregular, fragmented temporal diagonals in their intermediate self-attention maps, while stable motion corresponds to smooth, band-diagonal patterns. The method regularizes these internal attention patterns by estimating diagonal smoothness, identifying unstable regions, and performing lightweight latent updates. This process promotes coherent frame-to-frame dynamics without modifying model weights or requiring external motion supervision, ultimately delivering smoother motion while preserving per-frame visual quality.
Key takeaway
For research scientists developing or deploying text-to-video diffusion models, TeDiO offers a plug-and-play solution to significantly improve temporal coherence and reduce artifacts like flickering. You can integrate this training-free method at inference time to achieve smoother motion in generated videos, enhancing dynamic realism without the need for model retraining or additional datasets. Consider TeDiO as a crucial post-processing step for production-ready video generation systems.
Key insights
Temporal coherence in video diffusion models correlates with smooth, band-diagonal self-attention patterns.
Principles
- Incoherent video manifests as fragmented temporal diagonals.
- Regularizing internal attention patterns improves video stability.
Method
TeDiO estimates diagonal smoothness in self-attention maps, identifies unstable regions, and applies lightweight latent updates to promote coherent frame-to-frame dynamics, all without training or external supervision.
In practice
- Apply TeDiO to existing video diffusion models.
- Improve motion stability in generated videos.
- Preserve per-frame visual quality.
Topics
- TeDiO
- Video Diffusion
- Temporal Coherence
- Self-Attention Maps
- Training-Free Optimization
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.