
[00:00] Intro to LVSM: Novel transformer for view synthesis
[00:14] Problems with existing 3D synthesis methods
[00:59] LVSM architecture: encoder-decoder vs decoder-only
[01:41] Performance trade-offs between architectures
[02:13] Using Pluecker rays for implicit 3D geometry
[02:49] Zero-shot capabilities with varying input views
[03:23] Training stability and technical solutions
[03:59] Training & evaluation datasets
[04:23] Insights from architecture ablation studies
[05:00] Achieving SOTA with limited GPU resources
[05:25] Future work and research directions
[06:05] Parallels with language models
[06:38] Limitations in aspect ratio handling
Authors: Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, Zexiang Xu
Affiliations: Cornell University, The University of Texas at Austin, Adobe Research, Massachusetts Institute of Technology
Abstract: We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods -- from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps) -- addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). Please see our website for more details: this https URL .
Link: https://arxiv.org/abs/2410.17242