SOLE 🐾

Segment Any 3D Object with Language

arXiv 2024

1Korea University 2National University of Singapore
*Equal contribution

Where can I was my hands?

Device to play game.

Place I can pee.


SOLE is highly generalizable and can segment corresponding instances with various language instructions.

Abstract

In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works that rely on only annotated base categories for training suffer from limited generalization to unseen novel categories. Recent works mitigate poor generalizability to novel categories by generating class-agnostic masks or projecting generalized masks from 2D to 3D, but disregard semantic or geometry information, leading to sub-optimal performance. Instead, generating generalizable but semantic-related masks directly from 3D point clouds would result in superior outcomes. In this paper, we introduce Segment any 3D Object with LanguagE (SOLE), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. In addition, to align the 3D segmentation model with various language instructions and enhance the mask quality, we introduce three types of multimodal associations as supervision. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks, and the results are even closed to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions.

Overall Framework


Overall framework of SOLE. SOLE is built on transformer-based instance segmentation model with multimodal adaptations. For model architecture, backbone features are integrated with per-point CLIP features and subsequently fed into the cross-modality decoder (CMD). CMD aggregates the point-wise features and textual features into the instance queries, finally segmenting the instances, which are supervised by multimodal associations. During inference, predicted mask features are combined with the per-point CLIP features, enhancing the open-vocabulary performance.


Three types of multimodal association instance. For each ground truth instance mask, we first pool the per-point CLIP features to obtain Mask-Visual Association $\mathbf{f}^{\mathrm{MVA}}$. Subsequently, $\mathbf{f}^{\mathrm{MVA}}$ is fed into CLIP space captioning model to generate caption and corresponding textual feature $\mathbf{f}^{\mathrm{MCA}}$ for each mask, termed as Mask-Caption Association. Finally, noun phrases are extracted from mask caption and the embeddings of them are aggregated via multimodal attention to get Mask-Entity Association $\mathbf{f}^{\mathrm{MEA}}$. The three multimodal associations are used for supervising SOLE to acquire the ability to segment 3D objects with free-form language instructions.

Quantitative Results


The comparison of closed-set 3D instance segmentation setting on ScanNetv2. SOLE is compared with class-split methods, mask-training methods and the full-supervised counterpart (upper bound). SOLE outperforms all the OV-DIS methods and achieves competitive results with the fully-supervised model.


The comparison of closed-set 3D instance segmentation setting on ScanNet200. SOLE is compared with OpenMask3D on the overall segmentation performance and on each subset. SOLE significantly outperforms OpenMask3D on five out of the six evaluation metrics.


The comparison of hierarchical open-set 3D instance segmentation setting on ScanNetv2→ScanNet200. SOLE is compared with OpenMask3D on both base and novel classes and achieves the best results.


The comparison of open-set 3D instance segmentation setting on ScanNet200→Replica. SOLE outperforms OpenMask3D on all the evaluation metrics.

Qualitative results


Our SOLE demonstrates open-vocabulary capability by effectively responding to free-form language queries, including visual questions, attributes description and functional description.

I want to watch movie.

I wanna see outside.

Chairs near by the window.

Brown furnitures.

Throwing away the garbage.

I'm hungry.

BibTeX

@article{lee2024segment,
      title = {Segment Any 3D Object with Language}, 
      author = {Lee, Seungjun and Zhao, Yuyang and Lee, Gim Hee},
      year = {2024},
      journal   = {arXiv preprint arXiv:2404.02157},
}