3D Gaussian Splatting (3DGS) has emerged as an efficient and interpretable representation for 3D reconstruction and rendering, and more recently, semantic understanding. However, current methods typically rely on per-scene optimization, leading to long processing times and limited generalizability. In this paper, we introduce a feedforward network that processes 3DGS representations to produce semantic features, enabling generalizable semantic understanding across various scenes without per-scene optimization. Our Transformer-based 3D feature extraction network attaches semantic features — such as dense CLIP, DINOv2, and SAM — to the 3DGS representations. We distill feature knowledge from large-scale 2D foundation models and align the attached 3DGS features with their 2D counterparts, bridging the gap between 2D and 3D representations. Our approach offers an efficient and scalable solution for semantic understanding in 3DGS-based reconstruction. Furthermore, by connecting the 3DGS representation to pre-trained 2D LMMs, our model exhibits strong 3D reasoning capabilities.
3D object recognition and open vocabulary 3D instance segmentation results on the ScanNet dataset.