Towards Consistent Video Geometry Estimation

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

ViGeo is a novel feed-forward foundation model designed for recovering spatially dense and temporally consistent geometry from video sequences. Utilizing a plain transformer architecture, ViGeo supports streaming, full-sequence, and long-video inference within a unified framework. Its core innovation is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and enables adaptive attention patterns at test time without retraining. The model also incorporates a completion-based data refinement framework, which trains a video depth completion teacher to generate dense, temporally coherent, and geometrically reliable training targets from sparse and noisy annotations. ViGeo predicts depth, point maps, and surface normals, and, trained exclusively on public datasets, achieves leading performance in online, offline, and long-video depth, surface normal, and video point map estimation.

Key takeaway

For Computer Vision Engineers developing video geometry estimation systems, ViGeo presents a significant advancement. Its unified transformer architecture and dynamic chunking attention deliver leading performance across depth, point map, and surface normal estimation from video. You should evaluate ViGeo for projects requiring consistent, high-quality geometry recovery, especially for streaming or long-video applications, to streamline your model architecture and improve accuracy.

Key insights

ViGeo uses dynamic chunking attention and data refinement for consistent video geometry estimation across various inference modes.

Principles

Method

ViGeo employs dynamic chunking attention for adaptive temporal context and a completion-based data refinement framework. This framework trains a depth completion teacher to generate dense, coherent training targets from sparse video annotations.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.