Scaling View Synthesis Transformers

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

A systematic study of scaling laws for view synthesis transformers introduces the Scalable View Synthesis Model (SVSM), an encoder-decoder architecture that achieves superior performance and compute efficiency in Novel View Synthesis (NVS). Contrary to prior findings that favored decoder-only models like LVSM, SVSM demonstrates equivalent scaling behavior while requiring 2-3x less training compute and offering significantly faster rendering speeds. The research identifies "effective batch size" (product of scenes per batch and target views per scene) as a critical factor for compute-optimal training. For multiview NVS (V_C > 2), the integration of relative camera pose embeddings, specifically PRoPE, is shown to be essential for SVSM to maintain its favorable scaling. Even with fixed-size latent representations, SVSM's unidirectional decoder remains more compute-efficient than LVSM's encoder-decoder, though both bottlenecked designs scale less effectively than their unbottlenecked counterparts.

Key takeaway

For Computer Vision Engineers developing Novel View Synthesis models, this research indicates that adopting an encoder-decoder architecture like SVSM, rather than a decoder-only model, can dramatically reduce training compute by 2-3x while improving rendering speed and maintaining state-of-the-art performance. You should prioritize optimizing for "effective batch size" and integrate relative camera pose embeddings (e.g., PRoPE) to ensure scalable performance, especially in multiview scenarios, challenging the previous assumption that bidirectional attention is critical for high-fidelity view synthesis.

Key insights

Encoder-decoder architectures can be compute-optimal for view synthesis transformers with proper design and training strategies.

Principles

Method

The Scalable View Synthesis Model (SVSM) uses a unidirectional encoder-decoder architecture, processes context images once to create a scene latent, and decodes target views in parallel via cross-attention.

In practice

Topics

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.