TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization
Summary
TivTok (Time-Invariant Tokenizer) is a novel video tokenization method designed to enhance scalability in video generation by making persistent content reusable across time. Unlike existing tokenizers that repeatedly represent static backgrounds or consistent object appearances, TivTok factorizes video clips into Time-Invariant (TIV) tokens, encoding shared information, and Time-Variant (TV) tokens, capturing frame-specific residuals. This factorization is achieved through Scope-Induced Factorization (SIF), which assigns distinct attention scopes. Invariant Broadcasting (IB) then reuses TIV tokens for parallel reconstruction and long-video tokenization. Experiments demonstrate TivTok achieves an rFVD of 12.65 on the 16x256x256 benchmark, improves compression efficiency by 2.91x for 128-frame videos, and uses only 1.1% of tokens compared to downsample-based tokenizers.
Key takeaway
For Machine Learning Engineers developing video generation models, TivTok offers a significant advancement in managing computational costs and enabling longer video sequences. By drastically reducing the number of tokens required—using only 1.1% compared to traditional methods and improving compression by 2.91x—you can achieve greater scalability. Consider integrating TivTok's reuse-aware tokenization to build more efficient and capable video generation systems.
Key insights
TivTok reuses persistent video content information across frames to reduce token count and computational cost for scalable video generation.
Principles
- Factorize video content into time-invariant and time-variant components.
- Assign distinct attention scopes for efficient token processing.
- Broadcast invariant tokens for parallel reconstruction.
Method
TivTok uses Scope-Induced Factorization (SIF) to separate Time-Invariant (TIV) and Time-Variant (TV) tokens. TIV tokens attend to full clips; TV tokens access their frame and TIVs. Invariant Broadcasting (IB) reuses TIVs for reconstruction.
In practice
- Scalable video generation.
- Long-video tokenization.
- Reduced computational cost.
Topics
- Video Tokenization
- TivTok
- Time-Invariant Tokens
- Video Generation
- Compression Efficiency
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.