Uni3DL: A unified model for 3D vision-language understanding

Li, Xiang; Ding, Jian; Chen, Zhaoyang; Elhoseiny, Mohamed

Uni3DL: A unified model for 3D vision-language understanding

Tools

Lists

Li, X. ORCID: https://orcid.org/0000-0002-9946-7000, Ding, J., Chen, Z. and Elhoseiny, M. (2024) Uni3DL: A unified model for 3D vision-language understanding. In: ECCV 2024, 29 Sep — 4 Oct 2024, Milan, Italy, pp. 74-92, 10.1007/978-3-031-73337-6_5.

Text - Accepted Version
· Restricted to Repository staff only until 31 October 2025.
9MB

It is advisable to refer to the publisher's version if you intend to cite from this work. See Guidance on citing.

To link to this item DOI: 10.1007/978-3-031-73337-6_5

Abstract/Summary

We present Uni3DL, a unified model for 3D Vision-Language understanding. Distinct from existing unified 3D vision-language models that mostly rely on projected multi-view images and support limited tasks, Uni3DL operates directly on point clouds and significantly broadens the spectrum of tasks in the 3D domain, encompassing both vision and vision-language tasks. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively produce task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic segmentation, object detection, instance segmentation, visual grounding, 3D captioning, and text-3D cross-modal retrieval. It demonstrates performance on par with or surpassing state-of-the-art (SOTA) task-specific models. We hope our benchmark and Uni3DL model will serve as a solid step to ease future research in unified models in the realm of 3D vision-language understanding. Project page: https://uni3dl.github.io/.

Item Type:	Conference or Workshop Item (Paper)
Refereed:	Yes
Divisions:	No Reading authors. Back catalogue items Science > School of Mathematical, Physical and Computational Sciences > Department of Computer Science
ID Code:	119818
Publisher:	Springer Nature Switzerland

Altmetric

Deposit Details

University Staff: Request a correction | Centaur Editors: Update this record

University of Reading

CentAUR: Central Archive at the University of Reading

Accessibility navigation

Uni3DL: A unified model for 3D vision-language understanding

Abstract/Summary

Page navigation

See also

Footer navigation