Giter VIP home page Giter VIP logo

awesome-3d-visual-grounding's Introduction

Awesome-3D-Visual-Grounding Awesome

A continual collection of papers related to Text-guided 3D Visual Grounding (T-3DVG).

Text-guided 3D visual grounding (T-3DVG) aims to locate a specific object that semantically corresponds to a language query from a complicated 3D scene, has drawn increasing attention in the 3D research community over the past few years. T-3DVG presents great potential and challenges due to its closer proximity to the real world and the complexity of data collection and 3D point cloud source processing.

In the T-3DVG community, we've summarized existing T-3DVG methods in our survey paper👍.

A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions.

If you find some important work missed, it would be super helpful to let me know ([email protected]). Thanks!

If you find our survey useful for your research, please consider citing:

@article{liu2024survey,
  title={A Survey on Text-guided 3D Visual Grounding: Elements, Recent Advances, and Future Directions},
  author={Liu, Daizong and Liu, Yang and Huang, Wencan and Hu, Wei},
  journal={arXiv preprint arXiv:2406.05785},
  year={2024}
}

Table of Contents


Fully-Supervised-Two-Stage

Fully-Supervised-One-Stage

  • 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds | Github
  • 3D-SPS: Single-Stage 3D Visual Grounding via Referred Point Progressive Selection | Github
  • Bottom Up Top Down Detection Transformers for Language Grounding in Images and Point Clouds | Github
    • Ayush Jain, Nikolaos Gkanatsios, Ishita Mediratta, Katerina Fragkiadaki
    • Carnegie Mellon University, Meta AI
    • [ECCV2022] https://arxiv.org/abs/2112.08879
    • One-stage approach, unified detection-interaction
  • EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding | Github
    • Yanmin Wu, Xinhua Cheng, Renrui Zhang, Zesen Cheng, Jian Zhang
    • Peking University, The Chinese University of Hong Kong, Peng Cheng Laboratory, Shanghai AI Laboratory
    • [CVPR2023] https://arxiv.org/abs/2209.14941
    • One-stage approach, unified detection-interaction, text-decoupling, dense
  • Dense Object Grounding in 3D Scenes |
    • Wencan Huang, Daizong Liu, Wei Hu
    • Peking University
    • [ACMMM2023] https://arxiv.org/abs/2309.02224
    • One-stage approach, unified detection-interaction, transformer
  • 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding |
    • Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen Zhu, Aoxiong Yin, Zhou Zhao
    • Zhejiang University, ByteDance
    • [EMNLP2023] https://aclanthology.org/2023.emnlp-main.656/
    • One-stage approach, unified detection-interaction, relative position
  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Github
    • Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
    • Shanghai AI Lab, Beihang University, The Chinese University of Hong Kong (Shenzhen), Fudan University, Dalian University of Technology, The University of Sydney
    • [NeurIPs2023] https://arxiv.org/abs/2306.06687
    • A dataset, One-stage approach, regression-based, multi-task
  • PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues |
  • Toward Fine-Grained 3D Visual Grounding through Referring Textual Phrases | Github
    • Zhihao Yuan, Xu Yan, Zhuo Li, Xuhao Li, Yao Guo, Shuguang Cui, Zhen Li
    • CUHK-Shenzhen, Shanghai Jiao Tong University
    • [Arxiv2023] https://arxiv.org/abs/2207.01821
    • A dataset, One-stage approach, unified detection-interaction
  • A Unified Framework for 3D Point Cloud Visual Grounding | Github
    • Haojia Lin, Yongdong Luo, Xiawu Zheng, Lijiang Li, Fei Chao, Taisong Jin, Donghao Luo, Yan Wang, Liujuan Cao, Rongrong Ji
    • Xiamen University, Peng Cheng Laboratory
    • [Arxiv2023] https://arxiv.org/abs/2308.11887
    • One-stage approach, unified detection-interaction, superpoint
  • Uni3DL: Unified Model for 3D and Language Understanding |
    • Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny
    • King Abdullah University of Science and Technology, Ecole Polytechnique
    • [Arxiv2023] https://arxiv.org/abs/2312.03026
    • One-stage approach, regression-based, multi-task
  • 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation | Github
    • Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji, Xiaoshuai Sun
    • Xiamen University
    • [AAAI2024] https://arxiv.org/abs/2308.16632
    • One-stage approach, unified detection-interaction, superpoint
  • Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding |
    • Taolin Zhang, Sunan He, Tao Dai, Zhi Wang, Bin Chen, Shu-Tao Xia
    • Tsinghua University, Hong Kong University of Science and Technology, Shenzhen University, Harbin Institute of Technology(Shenzhen), Peng Cheng Laboratory
    • [AAAI2024] https://arxiv.org/abs/2305.10714
    • One-stage approach, regression-based, pre-training
  • Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding | Github
    • Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li
    • The Chinese University of Hong Kong (Shenzhen), A*STAR, The University of Hong Kong
    • [CVPR2024] https://arxiv.org/abs/2311.15383
    • One-stage approach, zero-shot, data construction
  • G3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding |
  • PointCloud-Text Matching: Benchmark Datasets and a Baseline |
    • Yanglin Feng, Yang Qin, Dezhong Peng, Hongyuan Zhu, Xi Peng, Peng Hu
    • Sichuan University, A*STAR
    • [Arxiv2024] https://arxiv.org/abs/2403.19386
    • A dataset, One-stage approach, regression-based, pre-training

Weakly-supervised

Other-Modality

  • Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images | Github
    • Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, Shuguang Cui
    • CUHK-Shenzhen, Deepwise AI Lab, The University of Hong Kong
    • [CVPR2021] https://arxiv.org/pdf/2103.07894
    • No point cloud input, RGB-D image
  • PATRON: Perspective-Aware Multitask Model for Referring Expression Grounding Using Embodied Multimodal Cues |
  • Mono3DVG: 3D Visual Grounding in Monocular Images | Github
    • Yang Zhan, Yuan Yuan, Zhitong Xiong
    • Northwestern Polytechnical University, Technical University of Munich
    • [AAAI2024] https://arxiv.org/pdf/2312.08022
    • No point cloud input, monocular image
  • EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI | Github
    • Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, Jiangmiao Pang
    • Shanghai AI Laboratory, Shanghai Jiao Tong University, The University of Hong Kong, The Chinese University of Hong Kong, Tsinghua University
    • [CVPR2024] https://arxiv.org/abs/2312.16170
    • A dataset, No point cloud input, RGB-D image
  • WildRefer: 3D Object Localization in Large-scale Dynamic Scenes with Multi-modal Visual Data and Natural Language |
    • Zhenxiang Lin, Xidong Peng, Peishan Cong, Yuenan Hou, Xinge Zhu, Sibei Yang, Yuexin Ma
    • ShanghaiTech University, Shanghai AI Laboratory, The Chinese University of Hong Kong
    • [Arxiv2023] https://arxiv.org/abs/2304.05645
    • No point cloud input, wild point cloud, additional multi-modal input

LLMs-based

  • ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance | Github
    • Zoey Guo, Yiwen Tang, Ray Zhang, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
    • Shanghai Artificial Intelligence Laboratory, The Chinese University of Hong Kong, Northwestern Polytechnical University
    • [ICCV2023] https://arxiv.org/pdf/2303.16894
    • LLMs-based, enriching text description
  • LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark | Github
    • Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, Jing Shao, Wanli Ouyang
    • Shanghai AI Lab, Beihang University, The Chinese University of Hong Kong (Shenzhen), Fudan University, Dalian University of Technology, The University of Sydney
    • [NeurIPs2023] https://arxiv.org/abs/2306.06687
    • LLMs-based, LLM architecture
  • Transcribe3D: Grounding LLMs Using Transcribed Information for 3D Referential Reasoning with Self-Corrected Finetuning |
    • Jiading Fang, Xiangshan Tan, Shengjie Lin, Hongyuan Mei, Matthew R. Walter
    • Toyota Technological Institute at Chicago
    • [CoRL2023] https://openreview.net/forum?id=7j3sdUZMTF
    • LLMs-based, enriching text description
  • LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent |
    • Jianing Yang, Xuweiyi Chen, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David F. Fouhey, Joyce Chai1
    • University of Michigan, New York University
    • [Arxiv2023] https://arxiv.org/abs/2309.12311
    • LLMs-based, enriching text description
  • Mono3DVG: 3D Visual Grounding in Monocular Images | Github
    • Yang Zhan, Yuan Yuan, Zhitong Xiong
    • Northwestern Polytechnical University, Technical University of Munich
    • [AAAI2024] https://arxiv.org/pdf/2312.08022
    • LLMs-based, enriching text description
  • COT3DREF: Chain-of-Thoughts Data-Efficient 3D Visual Grounding | Github
    • Eslam Mohamed Bakr, Mohamed Ayman, Mahmoud Ahmed, Habib Slim, Mohamed Elhoseiny
    • King Abdullah University of Science and Technology
    • [ICLR2024] https://arxiv.org/abs/2310.06214
    • LLMs-based, Chain-of-Thoughts, reasoning
  • Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding | Github
    • Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li
    • The Chinese University of Hong Kong (Shenzhen), A*STAR, The University of Hong Kong
    • [CVPR2024] https://arxiv.org/abs/2311.15383
    • LLMs-based, construct text description
  • Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners | Github
  • 3DMIT: 3D MULTI-MODAL INSTRUCTION TUNING FOR SCENE UNDERSTANDING | Github
    • Zeju Li, Chao Zhang, Xiaoyan Wang, Ruilong Ren, Yifan Xu, Ruifei Ma, Xiangde Liu
    • Beijing University of Posts and Telecommunications, Beijing Digital Native Digital City Research Center, Peking University, Beihang University, Beijing University of Science and Technology
    • [Arxiv2024] https://arxiv.org/abs/2401.03201
    • LLMs-based, LLM architecture
  • DOrA: 3D Visual Grounding with Order-Aware Referring |
    • Tung-Yu Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang
    • National Taiwan University, NVIDIA
    • [Arxiv2024] https://arxiv.org/abs/2403.16539
    • LLMs-based, Chain-of-Thoughts
  • SCENEVERSE: Scaling 3D Vision-Language Learning for Grounded Scene Understanding | Github
    • Baoxiong Jia , Yixin Chen , Huanyue Yu, Yan Wang, Xuesong Niu, Tengyu Liu, Qing Li, Siyuan Huang
    • Beijing Institute for General Artificial Intelligence
    • [Arxiv2024] https://arxiv.org/abs/2401.09340
    • A dataset, LLMs-based, LLM architecture

Outdoor-Scenes

  • Talk2Radar: Bridging Natural Language with 4D mmWave Radar for 3D Referring Expression Comprehension | Github
    • Runwei Guan, Ruixiao Zhang, Ningwei Ouyang, Jianan Liu, Ka Lok Man, Xiaohao Cai, Ming Xu, Jeremy Smith, Eng Gee Lim, Yutao Yue, Hui Xiong
    • JITRI, University of Liverpool, University of Southampton, Vitalent Consulting, Xi’an Jiaotong-Liverpool University, HKUST (GZ)
    • [Arxiv2024] https://arxiv.org/abs/2405.12821
    • Ourdoor scene, autonomous driving
  • Talk to Parallel LiDARs: A Human-LiDAR Interaction Method Based on 3D Visual Grounding |
    • Yuhang Liu, Boyi Sun, Guixu Zheng, Yishuo Wang, Jing Wang, Fei-Yue Wang
    • Chinese Academy of Sciences, South China Agricultural University, Beijing Institute of Technology
    • [Arxiv2024] https://arxiv.org/abs/2405.15274
    • Ourdoor scene, autonomous driving

awesome-3d-visual-grounding's People

Contributors

liudaizong avatar

Stargazers

 avatar Chaolei Tan avatar Keshen Zhou avatar nanfang avatar Xiongkun Linghu avatar 马洛伊 avatar  avatar 动千山 avatar Luo Jingzhou avatar zhuziyu avatar  avatar  avatar  avatar  avatar 胡黔江 avatar Liu Yang avatar zmmm avatar Xiaoye Qu avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.