TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment
Summary
TetherCache is a novel, training-free cache management strategy designed to stabilize autoregressive long-form video generation, addressing challenges like limited KV-cache budgets and context distribution shifts that cause visual artifacts and temporal drift in minute-level videos. Developed by researchers from Tsinghua University and ETH Zürich, TetherCache organizes the KV cache into Sink, Memory, and Recent regions. It employs two core mechanisms: GRAB (Gated Recall with Attention-Diversity Balancing), which selects informative and diverse long-range memory frames, and TAME (Trusted Alignment via Memory Editing), which statistically aligns newly recalled memory tokens to a stable context distribution derived from trusted sink frames. Implemented on the Self-Forcing model, TetherCache demonstrated consistent quality improvements on VBench-Long across 30s, 60s, and 240s generation settings. Notably, for 240s videos, it significantly enhanced overall and semantic scores while reducing quality drift from 7.84 to 1.33, with less than 6% latency overhead.
Key takeaway
For Machine Learning Engineers developing or deploying autoregressive video diffusion models for minute-level content, you should consider TetherCache. This training-free cache management strategy offers a robust solution to mitigate quality degradation and temporal drift, significantly improving long-video generation stability. By selectively recalling diverse historical context and statistically aligning memory tokens, TetherCache enhances visual quality and semantic consistency. Integrate its GRAB and TAME mechanisms to achieve superior long-horizon video outputs without the cost of model retraining.
Key insights
Stabilizing long-form autoregressive video generation requires selective memory recall and statistical alignment of cached features.
Principles
- Cache management needs relevance and diversity.
- Early frames provide trusted distributional priors.
- Statistical alignment mitigates context shift.
Method
TetherCache divides the KV cache into Sink, Memory, and Recent regions. GRAB selects memory frames by balancing attention relevance and temporal diversity. TAME aligns recalled memory token statistics to trusted sink frames.
In practice
- Organize KV cache into Sink, Memory, Recent.
- Use attention and temporal diversity for recall.
- Align recalled token statistics to trusted context.
Topics
- Autoregressive Video Generation
- KV-Cache Management
- Video Diffusion Models
- Temporal Drift Mitigation
- Gated Recall
- Memory Editing
Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.