Giter VIP home page Giter VIP logo

videogpt-plus's Introduction

VideoGPT+ ๐ŸŽฅ ๐Ÿ’ฌ

videogpt_plus_face

Oryx Video-ChatGPT

VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

Mohamed bin Zayed University of Artificial Intelligence


paper video Dataset Demo


Diverse Video-based Generative Performance Benchmarking (VCGBench-Diverse)

PWC

Video Question Answering on MVBench

PWC

Video-based Generative Performance Benchmarking

PWC


๐Ÿ“ข Latest Updates

  • Jun-13-24: VideoGPT+ paper, code, model, dataset and benchmark is released. ๐Ÿ”ฅ๐Ÿ”ฅ

VideoGPT+ Overview ๐Ÿ’ก

VideoGPT+ integrates image and video encoders to leverage detailed spatial understanding and global temporal context, respectively. It processes videos in segments using adaptive pooling on features from both encoders, enhancing performance across various video benchmarks.

VideoGPT+ Architectural Overview


Contributions ๐Ÿ†

  • VideoGPT+ Model: We present VideoGPT+, the first video-conversation model that benefits from a dual-encoding scheme based on both image and video features. These complimentary sets of features offer rich spatiotemporal details for improved video understanding.
  • VCG+ 112K Dataset: Addressing the limitations of the existing VideoInstruct100K dataset, we develop VCG+ 112K with a novel semi-automatic annotation pipeline, offering dense video captions along with spatial understanding and reasoning-based QA pairs, further improving the model performance.
  • VCGBench-Diverse Benchmark: Recognizing the lack of diverse benchmarks for video-conversation tasks, we propose VCGBench-Diverse, which provides 4,354 human annotated QA pairs across 18 video categories to extensively evaluate the performance of a video-conversation model.

Contributions


Video Annotation Pipeline (VCG+ 112K) ๐Ÿ“‚

Video-ChatGPT introduces the VideoInstruct100K dataset, which employs a semi-automatic annotation pipeline to generate 75K instruction-tuning QA pairs. To address the limitations of this annotation process, we present \ourdata~dataset developed through an improved annotation pipeline. Our approach improves the accuracy and quality of instruction tuning pairs by improving keyframe extraction, leveraging SoTA large multimodal models (LMMs) for detailed descriptions, and refining the instruction generation strategy.

Contributions


VCGBench-Diverse ๐Ÿ”

Recognizing the limited diversity in existing video conversation benchmarks, we introduce VCGBench-Diverse to comprehensively evaluate the generalization ability of video LMMs. While VCG-Bench provides an extensive evaluation protocol, it is limited to videos from the ActivityNet200 dataset. Our benchmark comprises a total of 877 videos, 18 broad video categories and 4,354 QA pairs, ensuring a robust evaluation framework.

Contributions


Installation ๐Ÿ”ง

We recommend setting up a conda environment for the project:

conda create --name=videogpt_plus python=3.11
conda activate videogpt_plus

git clone https://github.com/mbzuai-oryx/VideoGPT-plus
cd VideoGPT-plus

pip install torch==2.1.2 torchvision==0.16.2 --index-url https://download.pytorch.org/whl/cu118
pip install transformers==4.41.0

pip install -r requirements.txt

export PYTHONPATH="./:$PYTHONPATH"

Additionally, install FlashAttention for training,

pip install ninja

git clone https://github.com/HazyResearch/flash-attention.git
cd flash-attention
python setup.py install

Quantitative Evaluation ๐Ÿ“Š

We provide instructions to reproduce VideoGPT+ results on VCGBench, VCGBench-Diverse and MVBench. Please follow the instructions at eval/README.md.

VCGBench Evaluation: Video-based Generative Performance Benchmarking ๐Ÿ“ˆ

VCGBench_quantitative


VCGBench-Diverse Evaluation ๐Ÿ“Š

VCGDiverse_quantitative


Zero-Shot Question-Answer Evaluation โ“

zero_shot_quantitative


MVBench Evaluation ๐ŸŽฅ

MVBench_quantitative


Training ๐Ÿš‹

We provide scripts for pretraining and finetuning of VideoGPT+. Please follow the instructions at scripts/README.md.


Qualitative Analysis ๐Ÿ”

A comprehensive evaluation of VideoGPT+ performance across multiple tasks and domains.

demo_vcg+_main


demo_vcg+_full_part1

demo_vcg+_full_part2


Acknowledgements ๐Ÿ™

  • Video-ChatGPT: A pioneering attempt in Video-based conversation models.
  • LLaVA: Our code base is build upon LLaVA and Video-ChatGPT.
  • Chat-UniVi: A recent work in image and video-based conversation models. We borrowed some implementation details from their public codebase.

Citations ๐Ÿ“œ:

If you're using VideoGPT+ in your research or applications, please cite using this BibTeX:

@article{Maaz2024VideoGPT+,
    title={VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding},
    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
    journal={arxiv},
    year={2024},
    url={https://arxiv.org/abs/2406.09418}
}

@inproceedings{Maaz2023VideoChatGPT,
    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},
    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},
    year={2024}
}

License ๐Ÿ“œ

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Looking forward to your feedback, contributions, and stars! ๐ŸŒŸ Please raise any issues or questions here.


videogpt-plus's People

Contributors

mmaaz60 avatar hanoonar avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.