GGPT: Geometry Grounded Point Transformer

2026-03-13 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision, 3D Reconstruction · Depth: Advanced, extended

Summary

The Geometry-Grounded Point Transformer (GGPT) is a novel framework that enhances sparse-view 3D reconstruction by integrating reliable sparse geometric guidance with dense feed-forward predictions. It addresses geometric inconsistencies and limited fine-grained accuracy common in existing feed-forward networks like DUSt3R, MASt3R, and VGGT. GGPT first employs an improved Structure-from-Motion (SfM) pipeline, utilizing dense feature matching and lightweight geometric optimization, to efficiently estimate accurate camera poses and partial 3D point clouds from sparse RGB input views. Building on this, a geometry-guided 3D point transformer refines dense point maps under explicit partial-geometry supervision using an optimized guidance encoding. Trained on ScanNet++ with VGGT predictions, GGPT demonstrates strong generalization across architectures and datasets, significantly outperforming state-of-the-art feed-forward 3D reconstruction models in both in-domain and out-of-domain settings, including challenging human body and surgical scenes.

Key takeaway

For AI Scientists developing 3D reconstruction systems, GGPT offers a robust method to overcome geometric inconsistencies in feed-forward models. You should consider integrating its improved SfM pipeline and 3D point transformer for enhanced accuracy and generalization, especially in out-of-domain applications like medical imaging or human body reconstruction. This approach provides a principled way to combine the completeness of dense predictions with the geometric precision of SfM.

Key insights

GGPT refines dense 3D reconstructions by fusing feed-forward predictions with accurate sparse geometric guidance from an improved SfM pipeline.

Principles

Geometric priors enhance dense feed-forward 3D reconstructions.
3D-space reasoning improves multi-view geometric consistency.
Patch-based processing balances efficiency and fine-grained fidelity.

Method

GGPT uses an improved SfM pipeline for sparse geometric guidance, then a 3D Point Transformer (PTv3) refines dense feed-forward point maps by predicting residual corrections in a global 3D coordinate space.

In practice

Integrate dense matchers (RoMa, UFM) for robust SfM.
Use sparse Bundle Adjustment for efficient camera pose estimation.
Apply DLT for fast, accurate 3D point reconstruction.

Topics

3D Reconstruction
Point Transformers
Structure-from-Motion
Multi-view Geometry
Deep Learning for 3D Vision

Code references

yyfz/Pi3

Best for: AI Scientist, AI Researcher, Computer Vision Engineer, Research Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.