Bridging 2D Vision Language Models to 3D World via Feature Distillation

Yang Fu¹, Sifei Liu² Hongxu Yin² Benjamin Eckart² Jan Kautz², Xiaolong Wang¹, Arash Vahdat², Chao Liu²

¹UC San Diego ²NVIDIA

Abstract

3D Gaussian Splatting (3DGS) has emerged as an efficient and interpretable representation for 3D reconstruction and rendering, and more recently, semantic understanding. However, current methods typically rely on per-scene optimization, leading to long processing times and limited generalizability. In this paper, we introduce a feedforward network that processes 3DGS representations to produce semantic features, enabling generalizable semantic understanding across various scenes without per-scene optimization. Our Transformer-based 3D feature extraction network attaches semantic features — such as dense CLIP, DINOv2, and SAM — to the 3DGS representations. We distill feature knowledge from large-scale 2D foundation models and align the attached 3DGS features with their 2D counterparts, bridging the gap between 2D and 3D representations. Our approach offers an efficient and scalable solution for semantic understanding in 3DGS-based reconstruction. Furthermore, by connecting the 3DGS representation to pre-trained 2D LMMs, our model exhibits strong 3D reasoning capabilities.

Bridging 2D Vision Language Models to 3D World via Feature Distillation

Abstract

Approach

Results

3D Scene Understanding

3D Scene Reasoning and Grounding