DVSM: Decoder-only View Synthesis Model Done Right

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

DVSM, a novel decoder-only view synthesis model, re-examines the common encoder-decoder architecture in Large View Synthesis Models (LVSMs). Through controlled experiments, DVSM demonstrates that a decoder-only design, which implicitly represents scenes as a KV-cache, outperforms encoder-decoder variants. This is achieved with fewer parameters and identical rendering complexity. Further analysis reveals that sharing weights between the color-input reconstruction network and the camera-only rendering network improves feature alignment, facilitating image synthesis. DVSM integrates foundation model priors and stage-wise patch sizing to enhance its efficiency-quality tradeoff, establishing a new state of the art for novel-view synthesis across multiple benchmarks, even surpassing per-scene-optimized 3DGS under dense input views.

Key takeaway

For computer vision engineers designing novel-view synthesis models, you should consider the DVSM's decoder-only architecture. This approach offers superior efficiency and quality compared to traditional encoder-decoder designs. Evaluate integrating shared weights and foundation model priors into your models to achieve state-of-the-art performance, potentially outperforming even 3DGS in dense input view scenarios.

Key insights

Decoder-only view synthesis models, like DVSM, can outperform encoder-decoder architectures with fewer parameters and aligned features.

Principles

Method

DVSM employs a decoder-only architecture representing scenes as a KV-cache, incorporating foundation model priors and stage-wise patch sizing for efficiency and quality.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.