GGPT: Geometry Grounded Point Transformer

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, 3D Reconstruction · Depth: Advanced, extended

Summary

The Geometry-Grounded Point Transformer (GGPT) is a novel framework that enhances sparse-view 3D reconstruction by integrating reliable sparse geometric guidance with dense feed-forward predictions. It addresses geometric inconsistencies and limited fine-grained accuracy common in existing feed-forward networks like DUSt3R, MASt3R, and VGGT. GGPT first employs an improved Structure-from-Motion (SfM) pipeline, utilizing dense feature matching and lightweight geometric optimization, to efficiently estimate accurate camera poses and partial 3D point clouds from sparse RGB input views. Building on this, a geometry-guided 3D point transformer refines dense point maps under explicit partial-geometry supervision using an optimized guidance encoding. Trained on ScanNet++ with VGGT predictions, GGPT demonstrates strong generalization across architectures and datasets, significantly outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings, including challenging human body and surgical scenes.

Key takeaway

For AI Scientists developing 3D reconstruction systems, GGPT offers a robust method to overcome geometric inconsistencies in feed-forward models. You should consider integrating its improved SfM pipeline and 3D point transformer for enhanced accuracy and generalization, especially in out-of-domain applications like medical imaging or human body reconstruction. This approach provides a principled way to combine the completeness of dense predictions with the geometric precision of SfM.

Key insights

GGPT refines dense 3D reconstructions by fusing feed-forward predictions with accurate sparse geometric guidance from an improved SfM pipeline.

Principles

Method

GGPT uses an improved SfM pipeline for sparse geometric guidance, then a 3D Point Transformer (PTv3) refines dense feed-forward point maps by predicting residual corrections in a global 3D coordinate space.

In practice

Topics

Code references

Best for: AI Scientist, AI Researcher, Computer Vision Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.