“三维视觉—语言”推理技术的前沿研究与最新趋势
Comprehensive survey on 3D visual-language understanding techniques
- 2024年29卷第6期 页码:1747-1764
纸质出版日期: 2024-06-16
DOI: 10.11834/jig.240029
移动端阅览
浏览全部资源
扫码关注微信
纸质出版日期: 2024-06-16 ,
移动端阅览
雷印杰, 徐凯, 郭裕兰, 杨鑫, 武玉伟, 胡玮, 杨佳琪, 汪汉云. 2024. “三维视觉—语言”推理技术的前沿研究与最新趋势. 中国图象图形学报, 29(06):1747-1764
Lei Yinjie, Xu Kai, Guo Yulan, Yang Xin, Wu Yuwei, Hu Wei, Yang Jiaqi, Wang Hanyun. 2024. Comprehensive survey on 3D visual-language understanding techniques. Journal of Image and Graphics, 29(06):1747-1764
三维视觉推理的核心思想是对点云场景中的视觉主体间的关系进行理解。非专业用户难以向计算机传达自己的意图,从而限制了该技术的普及与推广。为此,研究人员以自然语言作为语义背景和查询条件反映用户意图,进而与点云的信息进行交互以完成相应的任务。此种范式称做“三维视觉—语言”推理,在自动驾驶、机器人导航以及人机交互等众多领域广泛应用,已经成为计算机视觉领域中备受瞩目的研究方向。过去几年间,“三维视觉—语言”推理技术迅猛发展,呈现出百花齐放的趋势,但是目前依然缺乏对最新研究进展的全面总结。本文聚焦于两类最具代表性的研究工作,锚框预测和内容生成类的“三维视觉—语言”推理技术,系统性概括领域内研究的最新进展。首先,本文总结了“三维视觉—语言”推理的问题定义和现存挑战,同时概述了一些常见的骨干网络。其次,本文按照方法所关注的下游场景,对两类“三维视觉—语言”推理技术做了进一步细分,并深入探讨了各方法的优缺点。接下来,本文对比分析了各类方法在不同基准数据集上的性能。最后,本文展望了“三维视觉—语言”推理技术的未来发展前景,以期促进该领域的深入研究与广泛应用。
The core of 3D visual reasoning is to understand the relationships among different visual entities in point cloud scenes. Traditional 3D visual reasoning typically requires users to possess professional expertise. However, nonprofessional users face difficulty conveying their intentions to computers, which hinders the popularization and advancement of this technology. Users now anticipate a more convenient way to convey their intentions to the computer for achieving information exchange and gaining personalized results. Researchers utilize natural language as a semantic background or query criteria to reflect user intentions for addressing the aforementioned issue. They further accomplish various missions by interacting such natural language with 3D point clouds. By multimodal interaction, often employing techniques such as the Transformer or graph neural network, current approaches not only can locate the entities mentioned by users (e.g., visual grounding and open-vocabulary recognition) but also can generate user-required content (e.g., dense captioning, visual question answering, and scene generation). Specifically, 3D visual grounding is intended to locate desired objects or regions in the 3D point cloud scene based on the object-related linguistic query. Open-vocabulary 3D recognition aims to identify and localize 3D objects of novel classes defined by an unbounded (open) vocabulary at inference, which can generalize beyond the limited number of base classes labeled during the training phase. 3D dense captioning aims to identify all possible instances within the 3D point cloud scene and generate the corresponding natural language description for each instance. The goal of 3D visual question answering is to comprehend an entire 3D scene and provide an appropriate answer. Text-guided scene generation is to synthesize a realistic 3D scene composed of complex background and multiple objects from natural language descriptions. The aforementioned paradigm, which is known as 3D visual-language understanding, has gained significant traction in various fields, such as autonomous driving, robot navigation, and human-computer interaction, in recent years. Consequently, it has become a highly anticipated research direction within the computer vision domain. Over the past 3 years, 3D visual-language understanding technology has rapidly developed and showcased a blossoming trend. However, comprehensive summaries regarding the latest research progress remain lacking. Therefore, the necessary tasks are to systematically summarize recent studies, comprehensively evaluate the performance of different approaches, and prospectively point out future research directions. This situation motivates this survey to fill this gap. For this purpose, this study aims to focus on two of the most representative works of 3D visual-language understanding technologies and systematically summarizes their latest research advancements: anchor box prediction and content generation. First, the study provides an overview of the problem definition and existing challenges in 3D visual-language understanding, and it also outlines some common backbones used in this area. The challenges in 3D visual-language understanding include 3D-language alignment and complex scene understanding. Meanwhile, some common backbones involve priori rules, multilayer perceptrons, graph neural networks, and Transformer architectures. Subsequently, the study delves into downstream scenarios, which emphasize two types of 3D visual-language understanding techniques, including bounding box predation and content generation. This study thoroughly explores the advantages and disadvantages of each method. Furthermore, the study compares and analyzes the performance of various methods on different benchmark datasets. Finally, the study concludes by looking ahead to the future prospects of 3D visual-language reasoning technology, which can promote profound research and widespread application in this field. The major contributions of this study can be summarized as follows: 1) Systematic survey of 3D visual-language understanding. To the best of our knowledge, this survey is the first to thoroughly discuss the recent advances in 3D visual-language understanding. We categorize algorithms into different taxonomies from the perspective of downstream scenarios to provide readers with a clear comprehension of our article. 2) Comprehensive performance evaluation and analysis. We compare the existing 3D visual-language understanding approaches on several publicly available datasets. Our in-depth analysis can help researchers in selecting the baseline suitable for their specific applications while also offering valuable insights on the modification of existing methods. 3) Insightful discussion of future prospects. Based on the systematic survey and comprehensive performance comparison, some promising future research directions are discussed, including large-scale 3D foundation model, computational efficiency of 3D modeling, and incorporation of additional modalities.
深度学习计算机视觉“三维视觉—语言”推理跨模态学习视觉定位密集字幕生成视觉问答场景生成
deep learningcomputer vision3D visual-language understandingcross-modal learningvisual groundingdense captioningvisual question answeringscene generation
Achlioptas P, Abdelreheem A, Xia F, Elhoseiny M and Guibas L. 2020. ReferIt3D: neural listeners for fine-grained 3D object identification in real-world scenes//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 422-440 [DOI: 10.1007/978-3-030-58452-8_25http://dx.doi.org/10.1007/978-3-030-58452-8_25]
Anderson P, Fernando B, Johnson M and Gould S. 2016. SPICE: semantic propositional image caption evaluation//Proceedings of the 14th European Conference on Computer Vision. Amsterdam, the Netherlands: Springer: 382-398 [DOI: 10.1007/978-3-319-46454-1_24http://dx.doi.org/10.1007/978-3-319-46454-1_24]
Armeni I, Sener O, Zamir A R, Jiang H, Brilakis I, Fischer M and Savarese S. 2016. 3D semantic parsing of large-scale indoor spaces//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 1534-1543 [DOI: 10.1109/cvpr.2016.170http://dx.doi.org/10.1109/cvpr.2016.170]
Azuma D, Miyanishi T, Kurita S and Kawanabe M. 2022. ScanQA: 3D question answering for spatial scene understanding//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 19107-19117 [DOI: 10.1109/cvpr52688.2022.01854http://dx.doi.org/10.1109/cvpr52688.2022.01854]
Banerjee S and Lavie A. 2005. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan, USA: Association for Computational Linguistics: 65-72 [DOI: 10.3115/1626355.1626389http://dx.doi.org/10.3115/1626355.1626389]
Bautista M A, Guo P S, Abnar S, Talbott W, Toshev A, Chen Z Y, Dinh L, Zhai S F, Goh H, Ulbricht D, Dehghan A and Susskind J. 2022. GAUDI: a neural architect for immersive 3D scene generation [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2207.13751.pdfhttps://arxiv.org/pdf/2207.13751.pdf
Bermejo C, Lee L H, Chojecki P, Przewozny D and Hui P. 2021. Exploring button designs for mid-air interaction in virtual reality: a hexa-metric evaluation of key representations and multi-modal cues. Proceedings of the ACM on Human-Computer Interaction, 5: 1-26 [DOI: 10.1145/3457141http://dx.doi.org/10.1145/3457141]
Brown T B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I and Amodei D. 2020. Language models are few-shot learners//Proceedings of the 34th International Conference on Neural Information Processing Systems. Vancouver, Canada: Curran Associates Inc.: 1877-1901
Cai D G, Zhao L C, Zhang J, Sheng L and Xu D. 2022. 3DJCG: a unified framework for joint dense captioning and visual grounding on 3D point clouds//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 16443-16452 [DOI: 10.1109/cvpr52688.2022.01597http://dx.doi.org/10.1109/cvpr52688.2022.01597]
Chen D Z, Chang A X and Nießner M. 2020. ScanRefer: 3D object localization in RGB-D scans using natural language//Proceedings of the 16th European Conference on Computer Vision. Glasgow, UK: Springer: 202-221 [DOI: 10.1007/978-3-030-58565-5_13http://dx.doi.org/10.1007/978-3-030-58565-5_13]
Chen D Z, Gholami A, Nießner M and Chang A X. 2021. Scan2Cap: context-aware dense captioning in RGB-D scans//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Nashville, USA: IEEE: 3192-3202 [DOI: 10.1109/cvpr46437.2021.00321http://dx.doi.org/10.1109/cvpr46437.2021.00321]
Chen D Z, Wu Q R, Nießner M and Chang A X. 2022. D3Net: a unified speaker-listener architecture for 3D dense captioning and visual grounding//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 487-505 [DOI: 10.1007/978-3-031-19824-3_29http://dx.doi.org/10.1007/978-3-031-19824-3_29]
Chen R N, Liu Y Q, Kong L D, Zhu X G, Ma Y X, Li Y K, Hou Y N, Qiao Y and Wang W P. 2023a. CLIP2Scene: towards label-efficient 3D scene understanding by CLIP//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 7020-7030 [DOI: 10.1109/cvpr52729.2023.00678http://dx.doi.org/10.1109/cvpr52729.2023.00678]
Chen S J, Zhu H Y, Chen X, Lei Y J, Yu G and Chen T. 2023b. End-to-end 3D dense captioning with Vote2Cap-DETR//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 11124-11133 [DOI: 10.1109/cvpr52729.2023.01070http://dx.doi.org/10.1109/cvpr52729.2023.01070]
Chen X L and Zitnick C L. 2015. Mind’s eye: a recurrent visual representation for image caption generation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 2422-2431 [DOI: 10.1109/cvpr.2015.7298856http://dx.doi.org/10.1109/cvpr.2015.7298856]
Choy C, Gwak J and Savarese S. 2019. 4D spatio-temporal ConvNets: Minkowski convolutional neural networks//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 3070-3079 [DOI: 10.1109/cvpr.2019.00319http://dx.doi.org/10.1109/cvpr.2019.00319]
Cohen-Bar D, Richardson E, Metzer G, Giryes R and Cohen-Or D. 2023. Set-the-scene: global-local training for generating controllable NeRF scenes [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2303.13450.pdfhttps://arxiv.org/pdf/2303.13450.pdf
Dai A, Chang A X, Savva M, Halber M, Funkhouser T and Nießner M. 2017. ScanNet: richly-annotated 3D reconstructions of indoor scenes//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA: IEEE: 2432-2443 [DOI: 10.1109/cvpr.2017.261http://dx.doi.org/10.1109/cvpr.2017.261]
Dao T, Fu D Y, Ermon S, Rudra A and Ré C. 2022. FlashAttention: fast and memory-efficient exact attention with IO-awareness [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2205.14135.pdfhttps://arxiv.org/pdf/2205.14135.pdf
Devlin J, Chang M W, Lee K and Toutanova K. 2019. BERT: pre-training of deep bidirectional transformers for language understanding [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/1810.04805.pdfhttps://arxiv.org/pdf/1810.04805.pdf
Ding R Y, Yang J H, Xue C H, Zhang W Q, Bai S and Qi X J. 2023. PLA: language-driven open-vocabulary 3D scene understanding//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 7010-7019 [DOI: 10.1109/cvpr52729.2023.00677http://dx.doi.org/10.1109/cvpr52729.2023.00677]
Dong X Y, Bao J M, Zheng Y L, Zhang T, Chen D D, Yang H, Zeng M, Zhang W M, Yuan L, Chen D, Wen F and Yu N H. 2023. MaskCLIP: masked self-distillation advances contrastive language-image pretraining//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 10995-11005 [DOI: 10.1109/cvpr52729.2023.01058http://dx.doi.org/10.1109/cvpr52729.2023.01058]
Dou J, Xue J R and Fang J W. 2019. SEG-VoxelNet for 3D vehicle detection from RGB and LiDAR data//Proceedings of 2019 International Conference on Robotics and Automation. Montreal, Canada: IEEE: 4362-4368 [DOI: 10.1109/icra.2019.8793492http://dx.doi.org/10.1109/icra.2019.8793492]
Engelmann F, Kontogianni T, Schult J and Leibe B. 2019. Know what your neighbors do: 3D semantic segmentation of point clouds//Proceedings of 2019 European Conference on Computer Vision. Munich, Germany: Springer: 395-409 [DOI: 10.1007/978-3-030-11015-4_29http://dx.doi.org/10.1007/978-3-030-11015-4_29]
Feng M T, Li Z, Li Q, Zhang L, Zhang X D, Zhu G M, Zhang H, Wang Y N and Mian A. 2021. Free-form description guided 3D visual graph network for object grounding in point cloud//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 3702-3711 [DOI: 10.1109/iccv48922.2021.00370http://dx.doi.org/10.1109/iccv48922.2021.00370]
Fridman R, Abecasis A, Kasten Y and Dekel T. 2023. SceneScape: text-driven consistent scene generation [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2302.01133.pdfhttps://arxiv.org/pdf/2302.01133.pdf
Geiger A, Lenz P and Urtasun R. 2012. Are we ready for autonomous driving? The KITTI vision benchmark suite//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Providence, USA: IEEE: 3354-3361 [DOI: 10.1109/cvpr.2012.6248074http://dx.doi.org/10.1109/cvpr.2012.6248074]
Guo Y L, Wang H Y, Hu Q Y, Liu H, Liu L and Bennamoun M. 2021. Deep learning for 3D point clouds: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(12): 4338-4364 [DOI: 10.1109/TPAMI.2020.3005434http://dx.doi.org/10.1109/TPAMI.2020.3005434]
He C H, Zeng H, Huang J Q, Hua X S and Zhang L. 2020. Structure aware single-stage 3D object detection from point cloud//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE: 11870-11879 [DOI: 10.1109/cvpr42600.2020.01189http://dx.doi.org/10.1109/cvpr42600.2020.01189]
He D L, Zhao Y S, Luo J Y, Hui T R, Huang S F, Zhang A X and Liu S. 2021. TransRefer3D: entity-and-relation aware Transformer for fine-grained 3D visual grounding//Proceedings of the 29th ACM International Conference on Multimedia. Chengdu, China: ACM: 2344-2352 [DOI: 10.1145/3474085.3475397http://dx.doi.org/10.1145/3474085.3475397]
Höllein L, Cao A, Owens A, Johnson J and Nießner M. 2023. Text2Room: extracting textured 3D meshes from 2D text-to-image models [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2303.11989.pdfhttps://arxiv.org/pdf/2303.11989.pdf
Hu E J, Shen Y L, Wallis P, Allen-Zhu Z, Li Y Z, Wang S A, Wang L and Chen W Z. 2021. LoRA: low-rank adaptation of large language models [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2106.09685.pdfhttps://arxiv.org/pdf/2106.09685.pdf
Huang P H, Lee H H, Chen H T and Liu T L. 2021. Text-guided graph neural networks for referring 3D instance segmentation//Proceedings of the 35th AAAI Conference on Artificial Intelligence. Online: AAAI: 1610-1618 [DOI: 10.1609/aaai.v35i2.16253http://dx.doi.org/10.1609/aaai.v35i2.16253]
Huang S J, Chen Y L, Jia J Y and Wang L W. 2022. Multi-view Transformer for 3D visual grounding//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 15503-15512 [DOI: 10.1109/cvpr52688.2022.01508http://dx.doi.org/10.1109/cvpr52688.2022.01508]
Jain A, Gkanatsios N, Mediratta I and Fragkiadaki K. 2022. Bottom up top down detection Transformers for language grounding in images and point clouds//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 417-433 [DOI: 10.1007/978-3-031-20059-5_24http://dx.doi.org/10.1007/978-3-031-20059-5_24]
Jiang H J, Lin Y Z, Han D C, Song S J and Huang G. 2022. Pseudo-Q: generating pseudo language queries for visual grounding//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 15492-15502 [DOI: 10.1109/cvpr52688.2022.01507http://dx.doi.org/10.1109/cvpr52688.2022.01507]
Jiao Y, Chen S X, Jie Z Q, Chen J J, Ma L and Jiang Y G. 2022. MORE: multi-order relation mining for dense captioning in 3D scenes//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 528-545 [DOI: 10.1007/978-3-031-19833-5_31http://dx.doi.org/10.1007/978-3-031-19833-5_31]
Jin W K, Zhao Z, Cao X C, Zhu J M, He X Q and Zhuang Y T. 2021. Adaptive spatio-temporal graph enhanced vision-language representation for video QA. IEEE Transactions on Image Processing, 30: 5477-5489 [DOI: 10.1109/tip.2021.3076556http://dx.doi.org/10.1109/tip.2021.3076556]
Jin Z, Hayat M, Yang Y W, Guo Y L and Lei Y J. 2023. Context-aware alignment and mutual masking for 3D-language pre-training//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 10984-10994 [DOI: 10.1109/cvpr52729.2023.01057http://dx.doi.org/10.1109/cvpr52729.2023.01057]
Jin Z, Lei Y J, Akhtar N, Li H F and Hayat M. 2022. Deformation and correspondence aware unsupervised synthetic-to-real scene flow estimation for point clouds//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 7223-7233 [DOI: 10.1109/cvpr52688.2022.00709http://dx.doi.org/10.1109/cvpr52688.2022.00709]
Jing Z W, Guan H Y, Zang Y F, Ni H, Li D L and Yu Y T. 2021. Survey of point cloud semantic segmentation based on deep learning. Journal of Frontiers of Computer Science and Technology, 15(1): 1-26
景庄伟, 管海燕, 臧玉府, 倪欢, 李迪龙, 于永涛. 2021. 基于深度学习的点云语义分割研究综述. 计算机科学与探索, 15(1): 1-26 [DOI: 10.3778/j.issn.1673-9418.2006025http://dx.doi.org/10.3778/j.issn.1673-9418.2006025]
Joseph-Rivlin M, Zvirin A and Kimmel R. 2019. Momenet: flavor the moments in learning to classify shapes//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision Workshop. Seoul, Korea (South): IEEE: 4085-4094 [DOI: 10.1109/iccvw.2019.00503http://dx.doi.org/10.1109/iccvw.2019.00503]
Kamath A, Singh M, LeCun Y, Synnaeve G, Misra I and Carion N. 2021. MDETR-modulated detection for end-to-end multi-modal understanding//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1760-1770 [DOI: 10.1109/iccv48922.2021.00180http://dx.doi.org/10.1109/iccv48922.2021.00180]
Khashabi D, Min S, Khot T, Sabharwal A, Tafjord O, Clark P and Hajishirzi H. 2020. UnifiedQA: crossing format boundaries with a single QA system [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2205.00700.pdfhttps://arxiv.org/pdf/2205.00700.pdf
Lester B, Al-Rfou R and Constant N. 2021. The power of scale for parameter-efficient prompt tuning [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2104.08691.pdfhttps://arxiv.org/pdf/2104.08691.pdf
Li C H, Zhang C N, Waghwase A, Lee L H, Rameau F, Yang Y, Bse S H and Hong C S. 2023a. Generative AI meets 3D: a survey on Text-to-3D in AIGC era [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2305.06131.pdfhttps://arxiv.org/pdf/2305.06131.pdf
Li C L, Lu A D, Liu L and Tang J. 2023. Multi-modal visual tracking: a survey. Journal of Image and Graphics, 28(1): 37-56
李成龙, 鹿安东, 刘磊, 汤进. 2023. 多模态视觉跟踪方法综述. 中国图象图形学报, 28(1): 37-56 [DOI: 10.11834/jig.220578http://dx.doi.org/10.11834/jig.220578]
Li L H, Zhang P C, Zhang H T, Yang J W, Li C Y, Zhong Y W, Wang L J, Yuan L, Zhang L, Hwang J N, Chang K W and Gao J F. 2022. Grounded language-image pre-training//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 10955-10965 [DOI: 10.1109/cvpr52688.2022.01069http://dx.doi.org/10.1109/cvpr52688.2022.01069]
Li J N, Li D X, Savarese S and Hoi S. 2023b. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2301.12597.pdfhttps://arxiv.org/pdf/2301.12597.pdf
Lin C Y. 2004. ROUGE: a package for automatic evaluation of summaries//Proceedings of the Text Summarization Branches out. Barcelona, Spain: Association for Computational Linguistics: 74-81
Luo J Y, Fu J H, Kong X H, Gao C, Ren H B, Shen H, Xia H X and Liu S. 2022. 3D-SPS: single-stage 3D visual grounding via referred point progressive selection//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 16433-16442 [DOI: 10.1109/cvpr52688.2022.01596http://dx.doi.org/10.1109/cvpr52688.2022.01596]
Ma X J, Yong S L, Zheng Z L, Li Q, Liang Y T, Zhu S C and Huang S Y. 2023. SQA3D: situated question answering in 3D scenes [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2210.07474.pdfhttps://arxiv.org/pdf/2210.07474.pdf
Mittal V. 2020. AttnGrounder: talking to cars with attention//Proceedings of 2020 European Conference on Computer Vision. Glasgow, UK: Springer: 62-73 [DOI: 10.1007/978-3-030-66096-3_6http://dx.doi.org/10.1007/978-3-030-66096-3_6]
Papineni K, Roukos S, Ward T and Zhu W J. 2002. BLEU: a method for automatic evaluation of machine translation//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Philadelphia, USA: Association for Computational Linguistics: 311-318 [DOI: 10.3115/1073083.1073135http://dx.doi.org/10.3115/1073083.1073135]
Peng S Y, Genova K, Jiang C Y, Tagliasacchi A, Pollefeys M and Funkhouser T. 2023. OpenScene: 3D scene understanding with open vocabularies//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 815-824 [DOI: 10.1109/cvpr52729.2023.00085http://dx.doi.org/10.1109/cvpr52729.2023.00085]
Po R and Wetzstein G. 2023. Compositional 3D scene generation using locally conditioned diffusion [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2303.12218.pdfhttps://arxiv.org/pdf/2303.12218.pdf
Poole B, Jain A, Barron J T and Mildenhall B. 2022. DreamFusion: text-to-3D using 2D diffusion [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2209.14988.pdfhttps://arxiv.org/pdf/2209.14988.pdf
Qi C R, Litany O, He K M and Guibas L J. 2019. Deep Hough voting for 3D object detection in point clouds//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Seoul, Korea (South): IEEE: 9276-9285 [DOI: 10.1109/iccv.2019.00937http://dx.doi.org/10.1109/iccv.2019.00937]
Qi C R, Su H, Mo K C and Guibas L J. 2017a. PointNet: deep learning on point sets for 3D classification and segmentation//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Hawaii, USA: IEEE: 77-85 [DOI: 10.1109/cvpr.2017.16http://dx.doi.org/10.1109/cvpr.2017.16]
Qi C R, Yi L, Su H and Guibas L J. 2017b. PointNet++: deep hierarchical feature learning on point sets in a metric space//Proceedings of the 31st International Conference on Neural Information Processing Systems. Long Beach, USA: Curran Associates Inc.: 5105-5114
Radford A, Kim J W, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J, Krueger G and Sutskever I. 2021. Learning transferable visual models from natural language supervision//Proceedings of the 38th International Conference on Machine Learning. Online: MIT Press: 8748-8763
Rajpurkar P, Zhang J, Lopyrev K and Liang P. 2016. SQuAD: 100,000+ questions for machine comprehension of text [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/1606.05250.pdfhttps://arxiv.org/pdf/1606.05250.pdf
Roh J, Desingh K, Farhadi A and Fox D. 2022. LanguageRefer: spatial-language model for 3D visual grounding//Proceedings of the 5th Conference on Robot Learning. London, UK: [s.n.]: 1046-1056
Senior H, Slabaugh G, Yuan S X and Rossi L. 2023. Graph neural networks in vision-language image understanding: a survey [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2303.03761.pdfhttps://arxiv.org/pdf/2303.03761.pdf
Silberman N, Hoiem D, Kohli P and Fergus R. 2012. Indoor segmentation and support inference from RGBD images//Proceedings of the 12th European Conference on Computer Vision. Florence, Italy: Springer: 746-760 [DOI: 10.1007/978-3-642-33715-4_54http://dx.doi.org/10.1007/978-3-642-33715-4_54]
Tatarchenko M, Park J, Koltun V and Zhou Q Y. 2018. Tangent convolutions for dense prediction in 3D//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 3887-3896 [DOI: 10.1109/cvpr.2018.00409http://dx.doi.org/10.1109/cvpr.2018.00409]
Vedantam R, Zitnick C L and Parikh D. 2015. CIDEr: consensus-based image description evaluation//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 4566-4575 [DOI: 10.1109/cvpr.2015.7299087http://dx.doi.org/10.1109/cvpr.2015.7299087]
Vinyals O, Toshev A, Bengio S and Erhan D. 2015. Show and tell: a neural image caption generator//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Boston, USA: IEEE: 3156-3164 [DOI: 10.1109/cvpr.2015.7298935http://dx.doi.org/10.1109/cvpr.2015.7298935]
Wang H, Zhang C Y, Yu J H and Cai W D. 2022. Spatiality-guided Transformer for 3D dense captioning on point clouds [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2204.10688.pdfhttps://arxiv.org/pdf/2204.10688.pdf
Wang X, Huang Q Y, Celikyilmaz A, Gao J F, Shen D H, Wang Y F, Wang W Y and Zhang L. 2019. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach, USA: IEEE: 6622-6631 [DOI: 10.1109/cvpr.2019.00679http://dx.doi.org/10.1109/cvpr.2019.00679]
Wang Z H, Huang H F, Zhao Y, Li L J, Cheng X Z, Zhu Y C, Yin A X and Zhao Z. 2023a. Distilling coarse-to-fine semantic matching knowledge for weakly supervised 3D visual grounding [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2307.09267.pdfhttps://arxiv.org/pdf/2307.09267.pdf
Wang Y J, Mao Q Y, Zhu H Q, Deng J J, Zhang Y, Ji J M, Li H Q and Zhang Y Y. 2023b. Multi-modal 3D object detection in autonomous driving: a survey. International Journal of Computer Vision, 131(8): 2122-2152 [DOI: 10.1007/s11263-023-01784-zhttp://dx.doi.org/10.1007/s11263-023-01784-z]
Wu Q, Shen C H, Liu L Q, Dick A and Van Den Hengel A. 2016. What value do explicit high level concepts have in vision to language problems?//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE: 203-212 [DOI: 10.1109/cvpr.2016.29http://dx.doi.org/10.1109/cvpr.2016.29]
Wu Y M, Cheng X H, Zhang R R, Cheng Z S and Zhang J. 2023. EDA: explicit text-decoupling and dense alignment for 3D visual grounding//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 19231-19242 [DOI: 10.1109/cvpr52729.2023.01843http://dx.doi.org/10.1109/cvpr52729.2023.01843]
Xie L, Xiang C, Yu Z X, Xu G D, Yang Z, Cai D and He X F. 2020a. PI-RCNN: an efficient multi-sensor 3D object detector with point-based attentive Cont-Conv fusion module//Proceedings of the 34th AAAI Conference on Artificial Intelligence. New York, USA: AAAI: 12460-12467 [DOI: 10.1609/aaai.v34i07.6933http://dx.doi.org/10.1609/aaai.v34i07.6933]
Xie Y X, Tian J J and Zhu X X. 2020b. Linking points with labels in 3D: a review of point cloud semantic segmentation. IEEE Geoscience and Remote Sensing Magazine, 8(4): 38-59 [DOI: 10.1109/mgrs.2019.2937630http://dx.doi.org/10.1109/mgrs.2019.2937630]
Yang J H, Ding R Y, Deng W P, Wang Z and Qi X J. 2023a. RegionPLC: regional point-language contrastive learning for open-world 3D scene understanding [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2304.00962.pdfhttps://arxiv.org/pdf/2304.00962.pdf
Yang L, Xu Y, Yuan C F, Liu W, Li B and Hu W M. 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 9489-9498 [DOI: 10.1109/cvpr52688.2022.00928http://dx.doi.org/10.1109/cvpr52688.2022.00928]
Yang Y W, Hayat M, Jin Z, Ren C and Lei Y J. 2023b. Geometry and uncertainty-aware 3D point cloud class-incremental semantic segmentation//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 21759-21768 [DOI: 10.1109/cvpr52729.2023.02084http://dx.doi.org/10.1109/cvpr52729.2023.02084]
Yang Z Y, Zhang S Y, Wang L W and Luo J B. 2021. SAT: 2D semantics assisted training for 3D visual grounding//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1836-1846 [DOI: 10.1109/iccv48922.2021.00187http://dx.doi.org/10.1109/iccv48922.2021.00187]
Yuan Z H, Yan X, Liao Y H, Guo Y, Li G B, Cui S G and Li Z. 2022. X-Trans2Cap: cross-modal knowledge transfer using Transformer for 3D dense captioning//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New Orleans, USA: IEEE: 8553-8563 [DOI: 10.1109/cvpr52688.2022.00837http://dx.doi.org/10.1109/cvpr52688.2022.00837]
Yuan Z H, Yan X, Liao Y H, Zhang R M, Wang S, Li Z and Cui S G. 2021. InstanceRefer: cooperative holistic understanding for visual grounding on point clouds through instance multi-level contextual referring//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 1771-1780 [DOI: 10.1109/iccv48922.2021.00181http://dx.doi.org/10.1109/iccv48922.2021.00181]
Zeng Y H, Jiang C H, Mao J G, Han J H, Ye C Q, Huang Q Q, Yeung D Y, Yang Z, Liang X D and Xu H. 2023. CLIP2: contrastive language-image-point pretraining from real-world point cloud data//Proceedings of 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, Canada: IEEE: 15244-15253 [DOI: 10.1109/cvpr52729.2023.01463http://dx.doi.org/10.1109/cvpr52729.2023.01463]
Zhang H Y, Wang T B, Li M Z, Zhao Z, Pu S L and Wu F. 2022. Comprehensive review of visual-language-oriented multimodal pre-training methods. Journal of Image and Graphics, 27(9): 2652-2682
张浩宇, 王天保, 李孟择, 赵洲, 浦世亮, 吴飞. 2022. 视觉语言多模态预训练综述. 中国图象图形学报, 27(9): 2652-2682 [DOI: 10.11834/jig.220173http://dx.doi.org/10.11834/jig.220173]
Zhang J B, Dong R P and Ma K S. 2023a. CLIP-FO3D: learning free open-world 3D scene representations from 2D dense CLIP [EB/OL]. [2024-01-19]. https://arxiv.org/pdf/2303.04748.pdfhttps://arxiv.org/pdf/2303.04748.pdf
Zhang Z W, Zhang Z Z, Yu Q, Yi R, Xie Y and Ma L Z. 2023b. LiDAR-camera panoptic segmentation via geometry-consistent and semantic-aware alignment//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: IEEE: 3639-3648 [DOI: 10.1109/ICCV51070.2023.00339http://dx.doi.org/10.1109/ICCV51070.2023.00339]
Zhao H S, Jiang L, Jia J Y, Torr P and Koltun V. 2021a. Point Transformer//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 16259-16268 [DOI: 10.1109/ICCV48922.2021.01595http://dx.doi.org/10.1109/ICCV48922.2021.01595]
Zhao L C, Cai D G, Sheng L and Xu D. 2021b. 3DVG-Transformer: relation modeling for visual grounding on point clouds//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Montreal, Canada: IEEE: 2908-2917 [DOI: 10.1109/iccv48922.2021.00292http://dx.doi.org/10.1109/iccv48922.2021.00292]
Zhou Y and Tuzel O. 2018. VoxelNet: end-to-end learning for point cloud based 3D object detection//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 4490-4499 [DOI: 10.1109/cvpr.2018.00472http://dx.doi.org/10.1109/cvpr.2018.00472]
Zhu C Y, Zhou Y Y, Shen Y H, Luo G, Pan X J, Lin M B, Chen C, Cao L J, Sun X S and Ji R R. 2022. SeqTR: a simple yet universal network for visual grounding//Proceedings of the 17th European Conference on Computer Vision. Tel Aviv, Israel: Springer: 598-615 [DOI: 10.1007/978-3-031-19833-5_35http://dx.doi.org/10.1007/978-3-031-19833-5_35]
Zhu H Q, Deng J J, Zhang Y, Ji J M, Mao Q Y, Li H Q and Zhang Y Y. 2023a. VPFNet: improving 3D object detection with virtual point based LiDAR and stereo data fusion. IEEE Transactions on Multimedia, 25: 5291-5304 [DOI: 10.1109/tmm.2022.3189778http://dx.doi.org/10.1109/tmm.2022.3189778]
Zhu Z Y, Ma X J, Chen Y X, Deng Z D, Huang S Y and Li Q. 2023b. 3D-VisTA: pre-trained Transformer for 3D vision and text alignment//Proceedings of 2023 IEEE/CVF International Conference on Computer Vision. Paris, France: 2899-2909 [DOI: 10.1109/ICCV51070.2023.00272http://dx.doi.org/10.1109/ICCV51070.2023.00272]
Zhuang B H, Shen C H, Tan M K, Liu L Q and Reid I. 2018a. Towards effective low-bitwidth convolutional neural networks//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE: 7920-7928 [DOI: 10.1109/cvpr.2018.00826http://dx.doi.org/10.1109/cvpr.2018.00826]
Zhuang Z W, Tan M K, Zhuang B H, Liu J, Guo Y, Wu Q Y, Huang J Z and Zhu J H. 2018b. Discrimination-aware channel pruning for deep neural networks//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Montréal, Canada: Curran Associates Inc.: 883-894
相关作者
相关机构