Giter VIP home page Giter VIP logo

awesome-llms-for-video-understanding's Introduction

Awesome-LLMs-for-Video-Understanding Awesome

Yunlong Tang1,*, Jing Bi1,*, Siting Xu2,*, Luchuan Song1, Susan Liang1 , Teng Wang2,3 , Daoan Zhang1 , Jie An1 , Jingyang Lin1 , Rongyi Zhu1 , Ali Vosoughi1 , Chao Huang1 , Zeliang Zhang1 , Feng Zheng2 , Jianguo Zhang2 , Ping Luo3 , Jiebo Luo1, Chenliang Xu1,†. (*Core Contributors, †Corresponding Authors)

1University of Rochester, 2Southern University of Science and Technology, 3The University of Hong Kong

image

Table of Contents

😎 Vid-LLMs: Models

image

πŸ€– LLM-based Video Agents

Title Model Date Code Venue
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language Socratic Models 04/2022 project page arXiv
Video ChatCaptioner: Towards Enriched Spatiotemporal DescriptionsStar Video ChatCaptioner 04/2023 code arXiv
VLog: Video as a Long DocumentStar VLog 04/2023 code -
ChatVideo: A Tracklet-centric Multimodal and Versatile Video Understanding System ChatVideo 04/2023 project page arXiv
MM-VID: Advancing Video Understanding with GPT-4V(ision) MM-VID 10/2023 - arXiv
MISAR: A Multimodal Instructional System with Augmented RealityStar MISAR 10/2023 project page ICCV
Grounding-Prompter: Prompting LLM with Multimodal Information for Temporal Sentence Grounding in Long Videos Grounding-Prompter 12/2023 - arXiv
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation Navid 02/2024 - arXiv
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding VideoAgent 03/2024 project page arXiv

πŸ‘Ύ Vid-LLM Pretraining

Title Model Date Code Venue
Learning Video Representations from Large Language ModelsStar LaViLa 12/2022 code CVPR
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning Vid2Seq 02/2023 code CVPR
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and DatasetStar VAST 05/2023 code NeurIPS
Merlin:Empowering Multimodal LLMs with Foresight Minds Merlin 12/2023 - arXiv

πŸ‘€ Vid-LLM Instruction Tuning

Fine-tuning with Connective Adapters

Title Model Date Code Venue
Video-LLaMA: An Instruction-Finetuned Visual Language Model for Video Understanding Star Video-LLaMA 06/2023 code arXiv
VALLEY: Video Assistant with Large Language model Enhanced abilitYStar VALLEY 06/2023 code -
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language ModelsStar Video-ChatGPT 06/2023 code arXiv
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text IntegrationStar Macaw-LLM 06/2023 code arXiv
LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning Star LLMVA-GEBC 06/2023 code CVPR
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star mPLUG-video 06/2023 code arXiv
MovieChat: From Dense Token to Sparse Memory for Long Video UnderstandingStar MovieChat 07/2023 code arXiv
Large Language Models are Temporal and Causal Reasoners for Video Question AnsweringStar LLaMA-VQA 10/2023 code EMNLP
Video-LLaVA: Learning United Visual Representation by Alignment Before ProjectionStar Video-LLaVA 11/2023 code arXiv
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingStar Chat-UniVi 11/2023 code arXiv
LLaMA-VID: An Image is Worth 2 Tokens in Large Language ModelsStar LLaMA-VID 11/2023 code arXiv
VISTA-LLAMA: Reliable Video Narrator via Equal Distance to Visual Tokens VISTA-LLAMA 12/2023 - arXiv
Audio-Visual LLM for Video Understanding - 12/2023 - arXiv
AutoAD: Movie Description in Context AutoAD 06/2023 code CVPR
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description AutoAD II 10/2023 - ICCV
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language ModelsStar FAVOR 10/2023 code arXiv

Fine-tuning with Insertive Adapters

Title Model Date Code Venue
Otter: A Multi-Modal Model with In-Context Instruction TuningStar Otter 06/2023 code arXiv
VideoLLM: Modeling Video Sequence with Large Language ModelsStar VideoLLM 05/2023 code arXiv

Fine-tuning with Hybrid Adapters

Title Model Date Code Venue
VTimeLLM: Empower LLM to Grasp Video MomentsStar VTimeLLM 11/2023 code arXiv
GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation GPT4Video 11/2023 - arXiv

🦾 Hybrid Methods

Title Model Date Code Venue
VideoChat: Chat-Centric Video UnderstandingStar VideoChat 05/2023 code demo arXiv
PG-Video-LLaVA: Pixel Grounding Large Video-Language ModelsStar PG-Video-LLaVA 11/2023 code arXiv
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video UnderstandingStar TimeChat 12/2023 code arXiv
Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video GroundingStar Video-GroundingDINO 12/2023 code arXiv
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot Video4096 05/2023 EMNLP

Tasks, Datasets, and Benchmarks

Recognition and Anticipation

Name Paper Date Link Venue
Charades Hollywood in homes: Crowdsourcing data collection for activity understanding 2016 Link ECCV
YouTube8M YouTube-8M: A Large-Scale Video Classification Benchmark 2016 Link -
ActivityNet ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding 2015 Link CVPR
Kinetics-GEBC GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval 2022 Link ECCV
Kinetics-400 The Kinetics Human Action Video Dataset 2017 Link -
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS

Captioning and Description

Name Paper Date Link Venue
Microsoft Research Video Description Corpus (MSVD) Collecting Highly Parallel Data for Paraphrase Evaluation 2011 Link ACL
Microsoft Research Video-to-Text (MSR-VTT) MSR-VTT: A Large Video Description Dataset for Bridging Video and Language 2016 Link CVPR
Tumblr GIF (TGIF) TGIF: A New Dataset and Benchmark on Animated GIF Description 2016 Link CVPR
Charades Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding 2016 Link ECCV
Charades-Ego Actor and Observer: Joint Modeling of First and Third-Person Videos 2018 Link CVPR
ActivityNet Captions Dense-Captioning Events in Videos 2017 Link ICCV
HowTo100m HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips 2019 Link ICCV
Movie Audio Descriptions (MAD) MAD: A Scalable Dataset for Language Grounding in Videos from Movie Audio Descriptions 2021 Link CVPR
YouCook2 Towards Automatic Learning of Procedures from Web Instructional Videos 2017 Link AAAI
MovieNet MovieNet: A Holistic Dataset for Movie Understanding 2020 Link ECCV
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
Video Timeline Tags (ViTT) Multimodal Pretraining for Dense Video Captioning 2020 Link AACL-IJCNLP
TVSum TVSum: Summarizing web videos using titles 2015 Link CVPR
SumMe Creating Summaries from User Videos 2014 Link ECCV
VideoXum VideoXum: Cross-modal Visual and Textural Summarization of Videos 2023 Link IEEE Trans Multimedia

Grounding and Retrieval

Name Paper Date Link Venue
Epic-Kitchens-100 Rescaling Egocentric Vision 2021 Link IJCV
VCR (Visual Commonsense Reasoning) From Recognition to Cognition: Visual Commonsense Reasoning 2019 Link CVPR
Ego4D-MQ and Ego4D-NLQ Ego4D: Around the World in 3,000 Hours of Egocentric Video 2021 Link CVPR
Vid-STG Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences 2020 Link CVPR
Charades-STA TALL: Temporal Activity Localization via Language Query 2017 Link ICCV
DiDeMo Localizing Moments in Video with Natural Language 2017 Link ICCV

Question Answering

Name Paper Date Link Venue
MSVD-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
MSRVTT-QA Video Question Answering via Gradually Refined Attention over Appearance and Motion 2017 Link ACM Multimedia
TGIF-QA TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering 2017 Link CVPR
ActivityNet-QA ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering 2019 Link AAAI
Pororo-QA DeepStory: Video Story QA by Deep Embedded Memory Networks 2017 Link IJCAI
TVQA TVQA: Localized, Compositional Video Question Answering 2018 Link EMNLP

Video Instruction Tuning

Pretraining Dataset
Name Paper Date Link Venue
VidChapters-7M VidChapters-7M: Video Chapters at Scale 2023 Link NeurIPS
VALOR-1M VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset 2023 Link arXiv
Youku-mPLUG Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks 2023 Link arXiv
InternVid InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation 2023 Link arXiv
VAST-27M VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset 2023 Link NeurIPS
Fine-tuning Dataset
Name Paper Date Link Venue
MIMIC-IT MIMIC-IT: Multi-Modal In-Context Instruction Tuning 2023 Link arXiv
VideoInstruct100K Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models 2023 Link arXiv

Video-based Large Language Models Benchmark

Title Date Code Venue
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models 11/2023 code -
Perception Test: A Diagnostic Benchmark for Multimodal Video Models 05/2023 code NeurIPS 2023, ICCV 2023 Workshop
Youku-mPLUG: A 10 Million Large-scale Chinese Video-Language Dataset for Pre-training and Benchmarks Star 07/2023 code -
FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation Star 11/2023 code NeurIPS 2023
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding 12/2023 code -
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark 12/2023 code -
TempCompass: Do Video LLMs Really Understand Videos? Star 03/2024 code -

Contributing

We welcome everyone to contribute to this repository and help improve it. You can submit pull requests to add new papers, projects, and helpful materials, or to correct any errors that you may find. Please make sure that your pull requests follow the "Title|Model|Date|Code|Venue" format. Thank you for your valuable contributions!

πŸ“‘ Citation

If you find our survey useful for your research, please cite the following paper:

@article{vidllmsurvey,
      title={Video Understanding with Large Language Models: A Survey}, 
      author={Tang, Yunlong and Bi, Jing and Xu, Siting and Song, Luchuan and Liang, Susan and Wang, Teng and Zhang, Daoan and An, Jie and Lin, Jingyang and Zhu, Rongyi and Vosoughi, Ali and Huang, Chao and Zhang, Zeliang and Zheng, Feng and Zhang, Jianguo and Luo, Ping and Luo, Jiebo and Xu, Chenliang},
      journal={arXiv preprint arXiv:2312.17432},
      year={2023},
}

🌟 Star History

Star History Chart

β™₯️ Contributors

awesome-llms-for-video-understanding's People

Contributors

ali-vosoughi avatar dwanzhang-ai avatar infaaa avatar jeasinema avatar liangsusan-git avatar llyx97 avatar renshuhuai-andy avatar rongyizhu avatar sai-01 avatar wangkunyu241 avatar wikichao avatar yamanksingla avatar yunlong10 avatar zhangaipi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.