Honey, I Shrunk the Arc de Triomphe!

2026-06-01 · Source: Takara TLDR - Daily AI Papers · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Expert, medium

Summary

A new dataset, MetricScenes, has been developed to address the "scale-collapse" phenomenon in metric scale monocular geometry estimation, where foundation models underestimate the size of distant landmarks and vast landscapes. Researchers hypothesize this issue stems from training data limitations, such as hardware-constrained LiDAR, short-range indoor scans, or synthetic data lacking real-world complexity. MetricScenes is curated from diverse sources like Internet photo collections and stereo imagery, with camera poses and initial depth maps estimated using off-the-shelf methods. Absolute scale is recovered from geo-tagged metadata and known stereo camera baselines. The dataset's depth map quality is further enhanced by a two-stage Poisson completion method. Fine-tuning the MoGe-2 model on MetricScenes significantly mitigates scale-collapse, achieving superior metric accuracy in unconstrained, open-domain scenes while preserving strong performance on standard benchmarks.

Key takeaway

For computer vision engineers developing monocular depth estimation systems, this research indicates that your models' "scale-collapse" issues in open-domain scenes can be significantly reduced. You should consider curating and fine-tuning on more diverse, metrically-grounded datasets like MetricScenes, leveraging geo-tagged metadata for absolute scale. This approach improves accuracy for distant objects and vast landscapes, enhancing real-world application performance without sacrificing benchmark results.

Key insights

A new dataset and method mitigate scale-collapse in monocular depth estimation for distant, unconstrained scenes.

Principles

Training data diversity is crucial for robust metric scale estimation.
Geo-tagged metadata can provide absolute scale anchors.
Combining diverse data sources improves model generalization.

Method

Curate diverse internet photo/stereo data, estimate initial depth/poses, recover absolute scale from geo-tags/baselines, then refine depth maps via two-stage Poisson completion.

In practice

Utilize geo-tagged metadata for absolute scale recovery.
Employ Poisson completion for depth map refinement.
Fine-tune existing models on diverse, metrically-grounded datasets.

Topics

Monocular Depth Estimation
Metric Scale Geometry
Dataset Curation
Scale Collapse Mitigation
Geo-tagged Metadata
Poisson Completion

Code references

InternLM/ARC-VL

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Takara TLDR - Daily AI Papers.