Forget, Anticipate and Adapt: Test Time Training for Long Videos

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

The Frame Forgetting Network (FFN) introduces a novel approach to Test Time Training (TTT) for long videos, addressing computational intractability in existing methods. TTT allows models to adapt during inference via self-supervised tasks without requiring test-time labels. Current TTT techniques struggle with hours-long videos due to compute increasing linearly with sliding window size and redundant updates for temporally similar frames. FFN overcomes these limitations by operating on only three frames within the sliding window—the exiting, current, and next frames—while effectively retaining temporal context. It also defines a "surprise metric" to quantify new information in incoming frames, facilitating an adaptive windowing algorithm that modifies the effective window size. To support this research, the EpicTours dataset was curated, featuring up to 3-hour long city tour videos, a significant increase from previous 5-minute datasets. FFN demonstrates empirical effectiveness across dense-segmentation, video classification, and depth-estimation tasks on multi-hour long videos.

Key takeaway

For Computer Vision Engineers developing models for long-duration video analysis, FFN offers a critical solution to the computational challenges of Test Time Training. You should consider integrating FFN's frame-forgetting and adaptive windowing mechanisms to achieve efficient, real-time model adaptation without sacrificing temporal context. This approach allows your models to process multi-hour videos effectively, significantly expanding the scope of deployable TTT applications.

Key insights

FFN enables efficient Test Time Training for long videos by selectively processing frames and adapting window size based on information novelty.

Principles

Method

FFN processes three frames (exiting, current, next) in a sliding window. It uses a "surprise metric" to adaptively adjust window size based on new information in incoming frames.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.