Honey, I Shrunk the Arc de Triomphe!

· Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

A new dataset, MetricScenes, has been developed to address the "scale-collapse" phenomenon in metric scale monocular geometry estimation, where foundation models underestimate the size of distant landmarks and vast landscapes. Researchers hypothesize this issue stems from training data limitations, such as hardware-constrained LiDAR, short-range indoor scans, or synthetic data lacking real-world complexity. MetricScenes is curated from diverse sources like Internet photo collections and stereo imagery, with camera poses and initial depth maps estimated using off-the-shelf methods. Absolute scale is recovered from geo-tagged metadata and known stereo camera baselines. The dataset's depth map quality is further enhanced by a two-stage Poisson completion method. Fine-tuning the MoGe-2 model on MetricScenes significantly mitigates scale-collapse, achieving superior metric accuracy in unconstrained, open-domain scenes while preserving strong performance on standard benchmarks.

Key takeaway

For computer vision engineers developing monocular depth estimation systems, this research indicates that your models' "scale-collapse" issues in open-domain scenes can be significantly reduced. You should consider curating and fine-tuning on more diverse, metrically-grounded datasets like MetricScenes, leveraging geo-tagged metadata for absolute scale. This approach improves accuracy for distant objects and vast landscapes, enhancing real-world application performance without sacrificing benchmark results.

Key insights

A new dataset and method mitigate scale-collapse in monocular depth estimation for distant, unconstrained scenes.

Principles

Method

Curate diverse internet photo/stereo data, estimate initial depth/poses, recover absolute scale from geo-tags/baselines, then refine depth maps via two-stage Poisson completion.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.