Towards Consistent Video Geometry Estimation

2026-05-28 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

ViGeo is a novel feed-forward foundation model designed for recovering spatially dense and temporally consistent geometry from video sequences. Utilizing a plain transformer architecture, ViGeo supports streaming, full-sequence, and long-video inference within a unified framework. Its core innovation is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and enables adaptive attention patterns at test time without retraining. The model also incorporates a completion-based data refinement framework, which trains a video depth completion teacher to generate dense, temporally coherent, and geometrically reliable training targets from sparse and noisy annotations. ViGeo predicts depth, point maps, and surface normals, and, trained exclusively on public datasets, achieves leading performance in online, offline, and long-video depth, surface normal, and video point map estimation.

Key takeaway

For Computer Vision Engineers developing video geometry estimation systems, ViGeo presents a significant advancement. Its unified transformer architecture and dynamic chunking attention deliver leading performance across depth, point map, and surface normal estimation from video. You should evaluate ViGeo for projects requiring consistent, high-quality geometry recovery, especially for streaming or long-video applications, to streamline your model architecture and improve accuracy.

Key insights

ViGeo uses dynamic chunking attention and data refinement for consistent video geometry estimation across various inference modes.

Principles

Dynamic chunking attention enables flexible temporal context.
Data refinement improves supervision quality for video geometry.
Unified transformer architecture supports diverse video inference.

Method

ViGeo employs dynamic chunking attention for adaptive temporal context and a completion-based data refinement framework. This framework trains a depth completion teacher to generate dense, coherent training targets from sparse video annotations.

In practice

Estimate depth, point maps, and surface normals.
Perform streaming or full-sequence video inference.
Adapt attention patterns without model retraining.

Topics

Video Geometry Estimation
Transformer Models
Dynamic Chunking Attention
Depth Estimation
Surface Normal Estimation
Computer Vision

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.