DVSM: Decoder-only View Synthesis Model Done Right
Summary
DVSM, a novel decoder-only view synthesis model, re-examines the common encoder-decoder architecture in Large View Synthesis Models (LVSMs). Through controlled experiments, DVSM demonstrates that a decoder-only design, which implicitly represents scenes as a KV-cache, outperforms encoder-decoder variants. This is achieved with fewer parameters and identical rendering complexity. Further analysis reveals that sharing weights between the color-input reconstruction network and the camera-only rendering network improves feature alignment, facilitating image synthesis. DVSM integrates foundation model priors and stage-wise patch sizing to enhance its efficiency-quality tradeoff, establishing a new state of the art for novel-view synthesis across multiple benchmarks, even surpassing per-scene-optimized 3DGS under dense input views.
Key takeaway
For computer vision engineers designing novel-view synthesis models, you should consider the DVSM's decoder-only architecture. This approach offers superior efficiency and quality compared to traditional encoder-decoder designs. Evaluate integrating shared weights and foundation model priors into your models to achieve state-of-the-art performance, potentially outperforming even 3DGS in dense input view scenarios.
Key insights
Decoder-only view synthesis models, like DVSM, can outperform encoder-decoder architectures with fewer parameters and aligned features.
Principles
- Decoder-only architectures can be more efficient for view synthesis.
- Sharing weights between networks aligns features for improved image synthesis.
Method
DVSM employs a decoder-only architecture representing scenes as a KV-cache, incorporating foundation model priors and stage-wise patch sizing for efficiency and quality.
In practice
- Consider decoder-only designs for novel-view synthesis tasks.
- Explore weight sharing in reconstruction and rendering networks.
Topics
- Novel View Synthesis
- Decoder-only Models
- DVSM
- Encoder-Decoder Architectures
- Foundation Models
- 3D Gaussian Splatting
Best for: Research Scientist, AI Scientist, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.