3D Reconstruction with Generalizable Neural Fields using Scene Priors

1University of California San Diego 2NVIDIA


High-fidelity 3D scene reconstruction has been substantially advanced by recent progress in neural fields. However, most existing methods require per-scene optimization by training one network from scratch each time. This is not scalable, inefficient, and unable to yield good results given limited views. While learning-based multi-view stereo methods alleviate this issue to some extent, their multi-view setting makes it less flexible to scale up and to broad applications. Instead, we introduce training generalizable Neural Fields incorporating scene Priors (NFPs). The NFP network maps any single-view RGB-D image into signed distance and radiance values. A complete scene can be reconstructed by merging individual frames in the volumetric space WITHOUT a fusion module, which provides better flexibility. The scene priors can be trained on large-scale datasets, allowing fast adaptation to the reconstruction of a new scene with fewer views. NFP not only demonstrates OTA scene reconstruction performance and efficiency, but it also supports single-image novel-view synthesis, which is under-explored in neural fields.

Given the RGBD input, we first extract the geometric and texture pixel feature using two encoders. Then, we construct the continuous surface representation upon the discrete surface feature. Next, we introduce a two-stage paradigm to learn the generalizable geometric and texture prior, optimized via multiple objectives. Finally, the learnt prior can be further optimized on a specific scene to obatin a high-fidelity reconstruction.

Qualitative Resutls

Per-scene Reconstruction with Textures

Here we show the per-scene reconstruction results on ScanNet. With the learnt NFP, we can achieve the high-fidelity reconstruction with photo-realisitc texture within 20 mins.

Zoom in the model viewer by scrolling. You can toggle the “Single Sided” option in Model Inspector (pressing I key) to enable back-face culling (see through walls). Select “Matcap” to inspect the geometry without textures.

Comparision with State-of-the-arts

ManhattanSDF* MonoSDF* Ours (NFP)

Note that the ManhattanSDF* and MonoSDF* are trained under the same setting with ground truth depths as ours.

Feed-forwarding Reconstruction

We additionally show the feed-forwarding reconstruction results without any optimizations, which could further demonstrate the generalizability of our learnt priors. Comparing with the existing works which require time-consuming per-scene optimization, our method can achieve comparable results in around 10 sceonds. Feed-forwarding reconstruction results are shown in the left column, while the per-scene optimizaiton results are shown in the right column as the reference.

Feed-forwarding Reconstruction
Per-scene optimization

Reconstruction on self-captured living room

The video is captured from the living room of Isaac Deutsch@NVIDIA. We would like to thank Isaac Deutsch for sharing this data.

Single-view Novel View Synthesis

Given a sinlge RGB-D image, our approach can also generate some nearby views via learnt neural priors. The following results are generated from the input images shown below.

Input Image
Novel View Synthesis


Given an RGB-D image, we propose to decompose the NFPs into the geometric neural prior and the texuture neural prior, and contruct the signed distance fields and the radiance fields from a continuous surface representation,

With the pretrained NFPs, we could achieve the reconstruction by a single feedwarding step without any optimization. To obtain high-fidelity reconstruction, we further optimize the priors along with the pretrained decoders.