TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

2026-06-16 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

TivTok (Time-Invariant Tokenizer) is a novel video tokenization method designed to enhance scalability in video generation by making persistent content reusable across time. Unlike existing tokenizers that repeatedly represent static backgrounds or consistent object appearances, TivTok factorizes video clips into Time-Invariant (TIV) tokens, encoding shared information, and Time-Variant (TV) tokens, capturing frame-specific residuals. This factorization is achieved through Scope-Induced Factorization (SIF), which assigns distinct attention scopes. Invariant Broadcasting (IB) then reuses TIV tokens for parallel reconstruction and long-video tokenization. Experiments demonstrate TivTok achieves an rFVD of 12.65 on the 16x256x256 benchmark, improves compression efficiency by 2.91x for 128-frame videos, and uses only 1.1% of tokens compared to downsample-based tokenizers.

Key takeaway

For Machine Learning Engineers developing video generation models, TivTok offers a significant advancement in managing computational costs and enabling longer video sequences. By drastically reducing the number of tokens required—using only 1.1% compared to traditional methods and improving compression by 2.91x—you can achieve greater scalability. Consider integrating TivTok's reuse-aware tokenization to build more efficient and capable video generation systems.

Key insights

TivTok reuses persistent video content information across frames to reduce token count and computational cost for scalable video generation.

Principles

Factorize video content into time-invariant and time-variant components.
Assign distinct attention scopes for efficient token processing.
Broadcast invariant tokens for parallel reconstruction.

Method

TivTok uses Scope-Induced Factorization (SIF) to separate Time-Invariant (TIV) and Time-Variant (TV) tokens. TIV tokens attend to full clips; TV tokens access their frame and TIVs. Invariant Broadcasting (IB) reuses TIVs for reconstruction.

In practice

Scalable video generation.
Long-video tokenization.
Reduced computational cost.

Topics

Video Tokenization
TivTok
Time-Invariant Tokens
Video Generation
Compression Efficiency
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.