mbzuai-oryx / videogpt-plus Goto Github PK

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

License: Creative Commons Attribution 4.0 International

Python 97.43% Shell 2.57%

chatbot clip dual-encoder gpt4 gpt4o image-encoder llama3 llava multimodal phi-3-mini

videogpt-plus's Issues

In what order should I reproduce the paper?

step1
pretrain_projector_image_encoder.sh
step2
pretrain_projector_video_encoder.sh
step3
finetune_dual_encoder.sh
step4
eval/vcgbench/inference/run_ddp_inference.sh
step5
eval/vcgbench/gpt_evaluation/vcgbench_evaluate.sh

#!/bin/sh


export DATASET_DIR=/mnt2/ninghuayang/data/videogpt_plus_dataset

BASE_LLM_PATH=microsoft/Phi-3-mini-4k-instruct
VISION_TOWER=OpenGVLab/InternVideo2-Stage2_1B-224p-f4
IMAGE_VISION_TOWER=openai/clip-vit-large-patch14-336
PROJECTOR_TYPE=mlp2x_gelu
#PRETRAIN_VIDEO_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_internvideo2/mm_projector.bin
#PRETRAIN_IMAGE_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_clip_l14_336px/mm_projector.bin
PRETRAIN_VIDEO_MLP_PATH=results/mlp2x_gelu_internvideo2/mm_projector.bin
PRETRAIN_IMAGE_MLP_PATH=results/mlp2x_gelu_clip_l14_336px/mm_projector.bin
OUTPUT_DIR_PATH=results/videogpt_plus_finetune

deepspeed videogpt_plus/train/train.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed scripts/zero3.json \
--model_name_or_path "$BASE_LLM_PATH" \
--version phi3_instruct \
--dataset_use FINETUNING \
--vision_tower "$VISION_TOWER" \
--image_vision_tower "$IMAGE_VISION_TOWER" \
--mm_projector_type "$PROJECTOR_TYPE" \
--image_mm_projector_type "$PROJECTOR_TYPE" \
--pretrain_mm_mlp_adapter "$PRETRAIN_VIDEO_MLP_PATH" \
--pretrain_image_mm_mlp_adapter "$PRETRAIN_IMAGE_MLP_PATH" \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir $OUTPUT_DIR_PATH \
--num_train_epochs 1 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to none

“python setup.py install” for flash-attention reports errors

Hello there,

Thank you for your remarkable work and I am really interested in looking into it. The whole installation process works smoothly until the very last command. The “python setup.py install” for flash-attention reports "RuntimeError: Error compiling objects for extension" alongside with huge amount of error messages so that I have no clue how to debug.

Does anyone meet similar issue? Any feedback is greatly appreciated!

are you planning to relase the inference codes for VideoGPT-plus_LLaMA3-8B-8k and/or VideoGPT-plus_Vicuna-7B-4k

are you planning to release the inference codes for

MBZUAI/VideoGPT-plus_LLaMA3-8B-8k
MBZUAI/VideoGPT-plus_Vicuna-7B-4k

Performance on MVBench

Thanks to open source for this exciting work!

I reproduced the performance on MVBench with a single gpu, but three experiments did not achieve the expected results in the paper (best results below), if any other weights were used in the tests. Also I noticed that when setting batch_size_per_gpu=2 drastically affects the performance and there is no OOM.

All Acc: [67.5, 57.99999999999999, 80.3030303030303, 48.5, 56.49999999999999, 86.5, 73.86934673366834, 37.0, 30.0, 30.5, 85.0, 38.0, 65.0, 82.5, 42.5, 51.0, 48.0, 31.0, 40.0, 56.00000000000001]% Total Acc: 55.37%

full-parameter or lora?

Will full-parameter fine-tuning be better?

eval/vcgbench/inference/run_ddp_inference.sh

[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure

Where can I find the dense captions for the 112K videos?

Thank you so much for sharing this amazing work! I’m wondering where I can find the dense captions for the 112k videos mentioned in the paper.

You are using a model of type phi3 to instantiate a model of type VideoGPT+. This is not supported for all configurations of models and can yield errors.

Hi,

I am getting this error while train the model - You are using a model of type phi3 to instantiate a model of type VideoGPT+. This is not supported for all configurations of models and can yield errors.

why this error, can anyone tell me?

Where to download the VCGplus 110k original video?

About downloading the datatset?

hi @salman-h-khan
i run the script to download the dataset, encounter this problem as below:

Cloning into 'VideoGPT-plus_Training_Dataset'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (43/43), done.
remote: Compressing objects: 100% (43/43), done.
remote: Total 46 (delta 6), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (46/46), 1.87 MiB | 2.48 MiB/s, done.











Filtering content: 100% (24/24), 28.09 GiB | 867.00 KiB/s, done.
Encountered 13 file(s) that may not have been copied correctly on Windows:
        instruction_tuning/k710.tgz
        pretraining/CC3M-595K.tgz
        instruction_tuning/clevrer.tgz
        pretraining/COCO.tgz
        instruction_tuning/videochat_it.tgz
        instruction_tuning/webvid/webvid_1.tgz
        instruction_tuning/webvid/webvid_3.tgz
        instruction_tuning/webvid/webvid_2.tgz
        instruction_tuning/webvid/webvid_5.tgz
        instruction_tuning/webvid/webvid_4.tgz
        instruction_tuning/activitynet_videos.tgz
        instruction_tuning/webvid/webvid_6.tgz
        instruction_tuning/NExTQA.tgz

See: `git lfs help smudge` for more details.

the download directory size

314G    ./instruction_tuning
26G     ./pretraining
341G    ./.git
405M    ./annotations
681G    .

is it ok?

About pre-training stage.

Thank you for sharing your work!

During the pre-training stage, the paper mentions that you used image data to train the video branch. How did you use image data to train the video part? Did you treat each image as a single frame of a video?

Phi3Model ImportError

when run the script, met the problem: ImportError: cannot import name 'Phi3Model' from 'transformers'

You mean phi3 surpassed mistral7B?

I think it really really out of expect, how will a phi3 model surpass mistaral7B, in the case of VideChat2 using a gaint vision encoder?
Which part could be really work one?

Simple Demo

Hey! Thanks for your great work. Do u have any plan to provide a simple demo, i.e., input a video and a question, not a benchmark?

Support for Multi-turn Conversations with Fixed Video Input?

Hello,
I have a question regarding the conversation capabilities of this project:

Does the system support multi-turn conversations?
Is it possible to have a natural, ongoing dialogue while keeping the initial video input fixed?

Thank you for your time and assistance.

Intermediate descriptions for vcg-plus_112k

Hi team,

Nice work! Can I request the intermediate descriptions for vcg-plus_112k generated by this file?

Thanks in advance!

Inquiry about Costs Associated with Video LLM Benchmarks

Hello everyone,

I have been working on replicating benchmarks related to video-class Large Language Models (LLMs), and I've noticed that most of these benchmarks rely on the GPT-assistant framework. Given the complexity and potential costs associated with these benchmarks, I'm interested in gathering some feedback regarding the financial aspect of conducting such evaluations.

Could anyone share their experiences regarding the typical costs involved in running these benchmarks? Any insights into budgeting for such projects would be highly beneficial to the community.

Thank you!

Detailed Video Descriptions

Do you have a plan to release the original "Detailed Video Descriptions"?

Zero-shot QA evaluation

How to perform zero-shot QA evaluation on datasets like MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA? Could we just follow the pipeline of Video-ChatGPT？

Question about Training Time

Hello,

Thank you for sharing your excellent research and code. I am currently pretraining an image encoder using 8 A100 GPUs. The estimated time of arrival (ETA) is about 6 hours. Is this normal? Could you share the pretraining and fine-tuning times along with the number of GPUs used for each setting? It would be very helpful.

Thank you!

The webm file from ssv2 can not be loaded

raise DECORDError(err_str)
decord.ffi.base.DECORDError: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:292: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:218: Check failed: avcodec_send_packet(dec_ctx.get(), pkt.get()) >= 0 (-11 vs. 0) Thread worker: Error sending packet.

mbzuai-oryx / videogpt-plus Goto Github PK

videogpt-plus's Issues

Recommend Projects

Recommend Topics

Recommend Org