Scaling View Synthesis Transformers
Summary
A systematic study of scaling laws for view synthesis transformers introduces the Scalable View Synthesis Model (SVSM), an encoder-decoder architecture that achieves superior performance and compute efficiency in Novel View Synthesis (NVS). Contrary to prior findings that favored decoder-only models like LVSM, SVSM demonstrates equivalent scaling behavior while requiring 2-3x less training compute and offering significantly faster rendering speeds. The research identifies "effective batch size" (product of scenes per batch and target views per scene) as a critical factor for compute-optimal training. For multiview NVS (V_C > 2), the integration of relative camera pose embeddings, specifically PRoPE, is shown to be essential for SVSM to maintain its favorable scaling. Even with fixed-size latent representations, SVSM's unidirectional decoder remains more compute-efficient than LVSM's encoder-decoder, though both bottlenecked designs scale less effectively than their unbottlenecked counterparts.
Key takeaway
For Computer Vision Engineers developing Novel View Synthesis models, this research indicates that adopting an encoder-decoder architecture like SVSM, rather than a decoder-only model, can dramatically reduce training compute by 2-3x while improving rendering speed and maintaining state-of-the-art performance. You should prioritize optimizing for "effective batch size" and integrate relative camera pose embeddings (e.g., PRoPE) to ensure scalable performance, especially in multiview scenarios, challenging the previous assumption that bidirectional attention is critical for high-fidelity view synthesis.
Key insights
Encoder-decoder architectures can be compute-optimal for view synthesis transformers with proper design and training strategies.
Principles
- Effective batch size (B * V_T) governs NVS model performance.
- Unidirectional decoding improves rendering efficiency and training throughput.
- Relative camera pose embeddings are crucial for multiview scaling.
Method
The Scalable View Synthesis Model (SVSM) uses a unidirectional encoder-decoder architecture, processes context images once to create a scene latent, and decodes target views in parallel via cross-attention.
In practice
- Prioritize encoder-decoder designs for NVS compute efficiency.
- Optimize training by balancing batch size and target views (B_eff).
- Implement PRoPE embeddings for robust multiview NVS scaling.
Topics
- Novel View Synthesis
- View Synthesis Transformers
- Scaling Laws
- Encoder-Decoder Architectures
- Relative Camera Attention
Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.