Déjà View: Looping Transformers for Multi-View 3D Reconstruction

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision & Pattern Recognition, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

DéjàView, a novel 3D reconstruction model, applies a single looped transformer block recurrently for K refinement steps, challenging the trend of increasing model capacity in computer vision. It posits that traditional feed-forward transformer depth often inefficiently simulates iteration, which DéjàView makes explicit in its architecture. Trained once, DéjàView exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks, including indoor, outdoor, object-centric, and driving scenes. It achieves this while using a fraction of their parameters and comparable or lower compute, suggesting explicit iteration provides a stronger inductive bias for multi-view 3D reconstruction.

Key takeaway

For Machine Learning Engineers optimizing multi-view 3D reconstruction models, you should consider recurrent architectures like DéjàView over deep feed-forward designs. This approach allows you to achieve comparable or superior performance with significantly fewer parameters and adjustable inference compute. Explore explicit iteration in your model designs to potentially gain a stronger inductive bias and improve efficiency for complex 3D tasks.

Key insights

DéjàView uses a single looped transformer block recurrently for efficient multi-view 3D reconstruction, outperforming larger feed-forward models.

Principles

Method

DéjàView applies a single transformer block recurrently to per-view features for K refinement steps, exposing K as an inference-time compute knob.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.