Towards Consistent Video Geometry Estimation
Summary
ViGeo is a novel feed-forward foundation model designed for recovering spatially dense and temporally consistent geometry from video sequences. Utilizing a plain transformer architecture, ViGeo supports streaming, full-sequence, and long-video inference within a unified framework. Its core innovation is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and enables adaptive attention patterns at test time without retraining. The model also incorporates a completion-based data refinement framework, which trains a video depth completion teacher to generate dense, temporally coherent, and geometrically reliable training targets from sparse and noisy annotations. ViGeo predicts depth, point maps, and surface normals, and, trained exclusively on public datasets, achieves leading performance in online, offline, and long-video depth, surface normal, and video point map estimation.
Key takeaway
For Computer Vision Engineers developing video geometry estimation systems, ViGeo presents a significant advancement. Its unified transformer architecture and dynamic chunking attention deliver leading performance across depth, point map, and surface normal estimation from video. You should evaluate ViGeo for projects requiring consistent, high-quality geometry recovery, especially for streaming or long-video applications, to streamline your model architecture and improve accuracy.
Key insights
ViGeo uses dynamic chunking attention and data refinement for consistent video geometry estimation across various inference modes.
Principles
- Dynamic chunking attention enables flexible temporal context.
- Data refinement improves supervision quality for video geometry.
- Unified transformer architecture supports diverse video inference.
Method
ViGeo employs dynamic chunking attention for adaptive temporal context and a completion-based data refinement framework. This framework trains a depth completion teacher to generate dense, coherent training targets from sparse video annotations.
In practice
- Estimate depth, point maps, and surface normals.
- Perform streaming or full-sequence video inference.
- Adapt attention patterns without model retraining.
Topics
- Video Geometry Estimation
- Transformer Models
- Dynamic Chunking Attention
- Depth Estimation
- Surface Normal Estimation
- Computer Vision
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.