mbzuai-oryx / videogpt-plus Goto Github PK

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

License: Creative Commons Attribution 4.0 International

Python 97.43% Shell 2.57%

chatbot clip dual-encoder gpt4 gpt4o image-encoder llama3 llava multimodal phi-3-mini vicuna video-chatbot video-conversation video-encoder vision-language vision-language-pretraining

videogpt-plus's People

Contributors

Stargazers

Watchers

Forkers

scott-mao cdcordobaa vinicius-ianni eltociear zhanwenchen yogesh914 chaitrarg23 aicads deepakkhokhar1313 allen-oneill

videogpt-plus's Issues

Where can I find the dense captions for the 112K videos?

Thank you so much for sharing this amazing work! I’m wondering where I can find the dense captions for the 112k videos mentioned in the paper.

Detailed Video Descriptions

Do you have a plan to release the original "Detailed Video Descriptions"?

You mean phi3 surpassed mistral7B?

I think it really really out of expect, how will a phi3 model surpass mistaral7B, in the case of VideChat2 using a gaint vision encoder?
Which part could be really work one?

Phi3Model ImportError

when run the script, met the problem: ImportError: cannot import name 'Phi3Model' from 'transformers'

Support for Multi-turn Conversations with Fixed Video Input?

Hello,
I have a question regarding the conversation capabilities of this project:

Does the system support multi-turn conversations?
Is it possible to have a natural, ongoing dialogue while keeping the initial video input fixed?

Thank you for your time and assistance.

Zero-shot QA evaluation

How to perform zero-shot QA evaluation on datasets like MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA? Could we just follow the pipeline of Video-ChatGPT？

Inquiry about Costs Associated with Video LLM Benchmarks

Hello everyone,

I have been working on replicating benchmarks related to video-class Large Language Models (LLMs), and I've noticed that most of these benchmarks rely on the GPT-assistant framework. Given the complexity and potential costs associated with these benchmarks, I'm interested in gathering some feedback regarding the financial aspect of conducting such evaluations.

Could anyone share their experiences regarding the typical costs involved in running these benchmarks? Any insights into budgeting for such projects would be highly beneficial to the community.

Thank you!

Performance on MVBench

Thanks to open source for this exciting work!

I reproduced the performance on MVBench with a single gpu, but three experiments did not achieve the expected results in the paper (best results below), if any other weights were used in the tests. Also I noticed that when setting batch_size_per_gpu=2 drastically affects the performance and there is no OOM.

All Acc: [67.5, 57.99999999999999, 80.3030303030303, 48.5, 56.49999999999999, 86.5, 73.86934673366834, 37.0, 30.0, 30.5, 85.0, 38.0, 65.0, 82.5, 42.5, 51.0, 48.0, 31.0, 40.0, 56.00000000000001]% Total Acc: 55.37%

Where to download the VCGplus 110k original video?

You are using a model of type phi3 to instantiate a model of type VideoGPT+. This is not supported for all configurations of models and can yield errors.

Hi,

I am getting this error while train the model - You are using a model of type phi3 to instantiate a model of type VideoGPT+. This is not supported for all configurations of models and can yield errors.

why this error, can anyone tell me?

full-parameter or lora?

Will full-parameter fine-tuning be better?

Question about Training Time

Hello,

Thank you for sharing your excellent research and code. I am currently pretraining an image encoder using 8 A100 GPUs. The estimated time of arrival (ETA) is about 6 hours. Is this normal? Could you share the pretraining and fine-tuning times along with the number of GPUs used for each setting? It would be very helpful.

Thank you!

About downloading the datatset?

hi @salman-h-khan
i run the script to download the dataset, encounter this problem as below:

Cloning into 'VideoGPT-plus_Training_Dataset'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (43/43), done.
remote: Compressing objects: 100% (43/43), done.
remote: Total 46 (delta 6), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (46/46), 1.87 MiB | 2.48 MiB/s, done.











Filtering content: 100% (24/24), 28.09 GiB | 867.00 KiB/s, done.
Encountered 13 file(s) that may not have been copied correctly on Windows:
        instruction_tuning/k710.tgz
        pretraining/CC3M-595K.tgz
        instruction_tuning/clevrer.tgz
        pretraining/COCO.tgz
        instruction_tuning/videochat_it.tgz
        instruction_tuning/webvid/webvid_1.tgz
        instruction_tuning/webvid/webvid_3.tgz
        instruction_tuning/webvid/webvid_2.tgz
        instruction_tuning/webvid/webvid_5.tgz
        instruction_tuning/webvid/webvid_4.tgz
        instruction_tuning/activitynet_videos.tgz
        instruction_tuning/webvid/webvid_6.tgz
        instruction_tuning/NExTQA.tgz

See: `git lfs help smudge` for more details.

the download directory size

314G    ./instruction_tuning
26G     ./pretraining
341G    ./.git
405M    ./annotations
681G    .

is it ok?

About pre-training stage.

Thank you for sharing your work!

During the pre-training stage, the paper mentions that you used image data to train the video branch. How did you use image data to train the video part? Did you treat each image as a single frame of a video?

are you planning to relase the inference codes for VideoGPT-plus_LLaMA3-8B-8k and/or VideoGPT-plus_Vicuna-7B-4k

are you planning to release the inference codes for

MBZUAI/VideoGPT-plus_LLaMA3-8B-8k
MBZUAI/VideoGPT-plus_Vicuna-7B-4k

Simple Demo

Hey! Thanks for your great work. Do u have any plan to provide a simple demo, i.e., input a video and a question, not a benchmark?

“python setup.py install” for flash-attention reports errors

Hello there,

Thank you for your remarkable work and I am really interested in looking into it. The whole installation process works smoothly until the very last command. The “python setup.py install” for flash-attention reports "RuntimeError: Error compiling objects for extension" alongside with huge amount of error messages so that I have no clue how to debug.

Does anyone meet similar issue? Any feedback is greatly appreciated!

Intermediate descriptions for vcg-plus_112k

Hi team,

Nice work! Can I request the intermediate descriptions for vcg-plus_112k generated by this file?

Thanks in advance!

eval/vcgbench/inference/run_ddp_inference.sh

[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure

The webm file from ssv2 can not be loaded

raise DECORDError(err_str)
decord.ffi.base.DECORDError: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:292: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:218: Check failed: avcodec_send_packet(dec_ctx.get(), pkt.get()) >= 0 (-11 vs. 0) Thread worker: Error sending packet.

In what order should I reproduce the paper?

step1
pretrain_projector_image_encoder.sh
step2
pretrain_projector_video_encoder.sh
step3
finetune_dual_encoder.sh
step4
eval/vcgbench/inference/run_ddp_inference.sh
step5
eval/vcgbench/gpt_evaluation/vcgbench_evaluate.sh

#!/bin/sh


export DATASET_DIR=/mnt2/ninghuayang/data/videogpt_plus_dataset

BASE_LLM_PATH=microsoft/Phi-3-mini-4k-instruct
VISION_TOWER=OpenGVLab/InternVideo2-Stage2_1B-224p-f4
IMAGE_VISION_TOWER=openai/clip-vit-large-patch14-336
PROJECTOR_TYPE=mlp2x_gelu
#PRETRAIN_VIDEO_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_internvideo2/mm_projector.bin
#PRETRAIN_IMAGE_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_clip_l14_336px/mm_projector.bin
PRETRAIN_VIDEO_MLP_PATH=results/mlp2x_gelu_internvideo2/mm_projector.bin
PRETRAIN_IMAGE_MLP_PATH=results/mlp2x_gelu_clip_l14_336px/mm_projector.bin
OUTPUT_DIR_PATH=results/videogpt_plus_finetune

deepspeed videogpt_plus/train/train.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed scripts/zero3.json \
--model_name_or_path "$BASE_LLM_PATH" \
--version phi3_instruct \
--dataset_use FINETUNING \
--vision_tower "$VISION_TOWER" \
--image_vision_tower "$IMAGE_VISION_TOWER" \
--mm_projector_type "$PROJECTOR_TYPE" \
--image_mm_projector_type "$PROJECTOR_TYPE" \
--pretrain_mm_mlp_adapter "$PRETRAIN_VIDEO_MLP_PATH" \
--pretrain_image_mm_mlp_adapter "$PRETRAIN_IMAGE_MLP_PATH" \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir $OUTPUT_DIR_PATH \
--num_train_epochs 1 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to none