TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

TetherCache is a novel, training-free cache management strategy designed to stabilize autoregressive long-form video generation, addressing challenges like limited KV-cache budgets and context distribution shifts that cause visual artifacts and temporal drift in minute-level videos. Developed by researchers from Tsinghua University and ETH Zürich, TetherCache organizes the KV cache into Sink, Memory, and Recent regions. It employs two core mechanisms: GRAB (Gated Recall with Attention-Diversity Balancing), which selects informative and diverse long-range memory frames, and TAME (Trusted Alignment via Memory Editing), which statistically aligns newly recalled memory tokens to a stable context distribution derived from trusted sink frames. Implemented on the Self-Forcing model, TetherCache demonstrated consistent quality improvements on VBench-Long across 30s, 60s, and 240s generation settings. Notably, for 240s videos, it significantly enhanced overall and semantic scores while reducing quality drift from 7.84 to 1.33, with less than 6% latency overhead.

Key takeaway

For Machine Learning Engineers developing or deploying autoregressive video diffusion models for minute-level content, you should consider TetherCache. This training-free cache management strategy offers a robust solution to mitigate quality degradation and temporal drift, significantly improving long-video generation stability. By selectively recalling diverse historical context and statistically aligning memory tokens, TetherCache enhances visual quality and semantic consistency. Integrate its GRAB and TAME mechanisms to achieve superior long-horizon video outputs without the cost of model retraining.

Key insights

Stabilizing long-form autoregressive video generation requires selective memory recall and statistical alignment of cached features.

Principles

Method

TetherCache divides the KV cache into Sink, Memory, and Recent regions. GRAB selects memory frames by balancing attention relevance and temporal diversity. TAME aligns recalled memory token statistics to trusted sink frames.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.