Giter VIP home page Giter VIP logo

ig-vlm's Introduction

IG-VLM: An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models(LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models(VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in a grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot video question answering benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks.

Requirements and Installation

  • Pytorch
  • transformers
  • Install required packages : pip install -r requirements.txt

Inference and Evaluation

We provide code that enables the reproduction of our experiments with LLaVA v1.6 7b/13b/34b and GPT-4V using the IG-VLM approach. For each VLM, we offer files that facilitate experimentation across various benchmarks:Open-ended Video Question Answering (VQA) with datasets such as MSVD-QA, MSRVTT-QA, ActivityNet-QA, and TGIF-QA, Text Generation Performance VQA for CI, DO, CU, TU, and CO, Multiple-choice VQA including NExT-QA, STAR, TVQA, IntentQA, and EgoSchema.

  • To conduct these benchmark experiments, please prepare data download and a QA pair sheet.
  • The QA pair sheet should follow the format outlined below and must be converted into a CSV file for use.
# for open-ended QA sheet, it should include video_name, question, answer, question_id and question_type(optional)
# for multiple-choice QA sheet, it should include video_name, question, options(a0, a1, a2, .. ), answer, question_id and question_type(optional).
# question_id should be unique.

# example of multeple-choice QA
| video_name | question_id |                        question                       |       a0      |      a1     |    a2    |        a3      |        a4       |   answer   | question_type(optional) | 
|------------|-------------|-------------------------------------------------------|---------------|-------------|----------|----------------|-----------------|------------|-------------------------|
| 5333075105 | unique1234  | what did the man do after he reached the cameraman?   | play with toy |inspect wings|   stop   |move to the side|pick up something|    stop    |            TN           |
...
  • For experimenting with LLaVA v1.6 combined with IG-VLM, the following command can be used. Please install the LLaVA code to the execution path. Please make sure to reinstall it every time for reproductions. The llm_size parameter allows the selection among the 7b, 13b, and 34b model configurations:
# Open-ended video question answering
python eval_llava_openended.py --path_qa_pair_csv ./data/open_ended_qa/ActivityNet_QA.csv --path_video /data/activitynet/videos/%s.mp4 --path_result ./result_activitynet/ --api_key {api_key} --llm_size 7b
# Text generation performance
 python eval_llava_textgeneration_openended.py --path_qa_pair_csv ./data/text_generation_benchmark/Generic_QA.csv --path_video /data/activitynet/videos/%s.mp4 --path_result ./result_textgeneration/ --api_key {api_key} --llm_size 13b
# Multiple-choice VQA
python eval_llava_multiplechoice.py --path_qa_pair_csv ./data/multiple_choice_qa/TVQA.csv --path_video /data/TVQA/videos/%s.mp4 --path_result ./result_tvqa/ --llm_size 34b
  • When conducting experiments with GPT-4V combined with IG-VLM, the process can be initiated using the following command. Please be aware that utilizing the GPT-4 vision API may incur significant costs.
# Open-ended video question answering
python eval_gpt4v_openended.py --path_qa_pair_csv ./data/open_ended_qa/MSVD_QA.csv --path_video /data/msvd/videos/%s.avi --path_result ./result_activitynet_gpt4/ --api_key {api_key}
# Text generation performance
python eval_gpt4v_textgeneration_openended.py --path_qa_pair_csv ./data/text_generation_benchmark/Generic_QA.csv --path_video /data/activitynet/videos/%s.mp4 --path_result ./result_textgeneration_gpt4/ --api_key {api_key}
# Multiple-choice VQA
python eval_gpt4v_multiplechoice.py --path_qa_pair_csv ./data/multiple_choice_qa/EgoSchema.csv --path_video /data/EgoSchema/videos/%s.mp4 --path_result ./result_egoschema_gpt4/ --api_key {api_key}

ig-vlm's People

Contributors

imagegridworth avatar doublekwsj avatar

Stargazers

一叶之秋 avatar AiART avatar Yash Kant avatar Aaron Han avatar cckaixin avatar Linxi avatar Dawn avatar  avatar hcwei avatar VincentDENG avatar Xiaolong avatar Jihan Yang avatar Yutong Feng avatar  avatar Seyeon Park avatar Free Lam avatar  avatar Guanhong Wang avatar Echo~ avatar  avatar Rohit Gupta avatar  avatar DLight avatar  avatar zht8506 avatar Jihao Liu avatar Teo (Timothy) Wu Haoning avatar Mustard Bean avatar Xinyu Liu avatar Akhil Songa avatar  avatar Kunchang Li avatar seongukang avatar  avatar Cemil avatar MagicSource avatar  avatar Cheng-Han Lee avatar Alex Dorofeev avatar Anthony avatar XING Zhenghao avatar Varun Ganjigunte Prakash avatar Frank Zhou avatar ChangIn.Choi avatar  avatar  avatar xiaofof avatar  avatar Yifan Du avatar Viv avatar mori yuichiro avatar rdevnoah avatar Tan Morgan Kim avatar  avatar Youngwon avatar  avatar  avatar Daniel Blasko avatar Xuchen Li (李旭宸) avatar Qingkai Fang avatar Yepeng Jin avatar Frank Hu avatar Derick LEE avatar Ghulam Jilani Raza avatar Lingzhi Li avatar Hyojin Kim avatar  avatar  avatar  avatar Taekwan Lee avatar  avatar  avatar Suhyun Kang avatar Jungmin Ko avatar Jiayi Pan avatar Jangwon Suh avatar jimyeong kim avatar he neng avatar  avatar Tianwei Xiong avatar  avatar  avatar Bander Alsulami avatar Xu CAO avatar Zhangtao Cheng avatar lb203 avatar  avatar  avatar Yean Cheng avatar  avatar Choi Chang In avatar Sunan He avatar Vishaal Udandarao avatar Bowen Yuan avatar Bradley avatar

Watchers

he neng avatar  avatar  avatar  avatar

ig-vlm's Issues

Unable to Reproduce the reported numbers

Hi authors,

Thanks for the great work!
However, I cannot reproduce the numbers reported in the paper using your code. I use the LLaVA-1.6-vicuna-7B model.

Open-ended QA

MSVD-QA MSRVTT-QA ActivityNet-QA TGIF-QA
Reported 78.8 63.7 54.3 73.0
Reproduced 74.0 60.1 48.5 68.5

Multichoice-QA

NExT-QA Intent-QA EgoSchema
Reported 63.1 60.3 35.8
Reproduced 49.2 45.4 24.2

The multichoice-qa is not evaluated using Chat-GPT.

Could you re-run your experiments and see whether the reported numbers are reproducible? Thanks!

Where to download the STAR videos?

Seems like using Charades_v1 videos does not work. The filename are not matched. Specifically, your filename is

File not found: ./STAR/Charades_v1/Z97SD_18.1_24.1.mp4

while Charades_v1/Z97SD.mp4 does exits.

Should I update the filename?

Also, for TVQA, should I just ask authors for the videos? Can you give me a copy?

Thank you,

Mu

The inference time

Really appreciate the authors for open-sourcing this great project, but there is a little question for me about the time consume for the intentqa and nextqa dataset (which is what I'm work on)
The local inference time for me on these datasets are about 3hours and 6hours respectively, which I think is a little bit long. So I’m just wondering what's your time for the evaluation? And is there any tricks here to speed up the whole process?
I’m looking forward to your reply 😃

Evaluation Results and the version of evaluation openai api

Hello~thanks for the interesting work. I have two questions about the reproduction.

a. Could you release the results of the evaluated benchmarks?

b. What is the openai model you used for the evaluations? It seems the gpt-3.5-turbo was changed from gpt-3.5-turbo-0613 to gpt-3.5-turbo-0125 recently. Just want to which model did you use?

Trainging code

Thanks for the great contribution. Will you release the training code?

LORA code

A very wonderful piece of work, this approach is really novel.
Can you provide code for further fine-tuning lora on your own data set?

Thanks, looking forward to reply

Results on the MVBench

Thanks for your awesome work! It's really exciting to see that an image-only VLM could surpass some videoLLMs. Could you please evaluate LLaVA-1.6 on MVBench with your method? I'm quite interested in the results. Thanks in advance :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.