mbzuai-oryx / videogpt-plus Goto Github PK
View Code? Open in Web Editor NEWOfficial Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
License: Creative Commons Attribution 4.0 International
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
License: Creative Commons Attribution 4.0 International
step1
pretrain_projector_image_encoder.sh
step2
pretrain_projector_video_encoder.sh
step3
finetune_dual_encoder.sh
step4
eval/vcgbench/inference/run_ddp_inference.sh
step5
eval/vcgbench/gpt_evaluation/vcgbench_evaluate.sh
#!/bin/sh
export DATASET_DIR=/mnt2/ninghuayang/data/videogpt_plus_dataset
BASE_LLM_PATH=microsoft/Phi-3-mini-4k-instruct
VISION_TOWER=OpenGVLab/InternVideo2-Stage2_1B-224p-f4
IMAGE_VISION_TOWER=openai/clip-vit-large-patch14-336
PROJECTOR_TYPE=mlp2x_gelu
#PRETRAIN_VIDEO_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_internvideo2/mm_projector.bin
#PRETRAIN_IMAGE_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_clip_l14_336px/mm_projector.bin
PRETRAIN_VIDEO_MLP_PATH=results/mlp2x_gelu_internvideo2/mm_projector.bin
PRETRAIN_IMAGE_MLP_PATH=results/mlp2x_gelu_clip_l14_336px/mm_projector.bin
OUTPUT_DIR_PATH=results/videogpt_plus_finetune
deepspeed videogpt_plus/train/train.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed scripts/zero3.json \
--model_name_or_path "$BASE_LLM_PATH" \
--version phi3_instruct \
--dataset_use FINETUNING \
--vision_tower "$VISION_TOWER" \
--image_vision_tower "$IMAGE_VISION_TOWER" \
--mm_projector_type "$PROJECTOR_TYPE" \
--image_mm_projector_type "$PROJECTOR_TYPE" \
--pretrain_mm_mlp_adapter "$PRETRAIN_VIDEO_MLP_PATH" \
--pretrain_image_mm_mlp_adapter "$PRETRAIN_IMAGE_MLP_PATH" \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir $OUTPUT_DIR_PATH \
--num_train_epochs 1 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to none
Hello there,
Thank you for your remarkable work and I am really interested in looking into it. The whole installation process works smoothly until the very last command. The “python setup.py install” for flash-attention reports "RuntimeError: Error compiling objects for extension" alongside with huge amount of error messages so that I have no clue how to debug.
Does anyone meet similar issue? Any feedback is greatly appreciated!
are you planning to release the inference codes for
Thanks to open source for this exciting work!
I reproduced the performance on MVBench with a single gpu, but three experiments did not achieve the expected results in the paper (best results below), if any other weights were used in the tests. Also I noticed that when setting batch_size_per_gpu=2 drastically affects the performance and there is no OOM.
All Acc: [67.5, 57.99999999999999, 80.3030303030303, 48.5, 56.49999999999999, 86.5, 73.86934673366834, 37.0, 30.0, 30.5, 85.0, 38.0, 65.0, 82.5, 42.5, 51.0, 48.0, 31.0, 40.0, 56.00000000000001]% Total Acc: 55.37%
Will full-parameter fine-tuning be better?
[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure
Thank you so much for sharing this amazing work! I’m wondering where I can find the dense captions for the 112k videos mentioned in the paper.
Hi,
I am getting this error while train the model - You are using a model of type phi3 to instantiate a model of type VideoGPT+. This is not supported for all configurations of models and can yield errors.
why this error, can anyone tell me?
Where to download the VCGplus 110k original video?
hi @salman-h-khan
i run the script to download the dataset, encounter this problem as below:
Cloning into 'VideoGPT-plus_Training_Dataset'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (43/43), done.
remote: Compressing objects: 100% (43/43), done.
remote: Total 46 (delta 6), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (46/46), 1.87 MiB | 2.48 MiB/s, done.
Filtering content: 100% (24/24), 28.09 GiB | 867.00 KiB/s, done.
Encountered 13 file(s) that may not have been copied correctly on Windows:
instruction_tuning/k710.tgz
pretraining/CC3M-595K.tgz
instruction_tuning/clevrer.tgz
pretraining/COCO.tgz
instruction_tuning/videochat_it.tgz
instruction_tuning/webvid/webvid_1.tgz
instruction_tuning/webvid/webvid_3.tgz
instruction_tuning/webvid/webvid_2.tgz
instruction_tuning/webvid/webvid_5.tgz
instruction_tuning/webvid/webvid_4.tgz
instruction_tuning/activitynet_videos.tgz
instruction_tuning/webvid/webvid_6.tgz
instruction_tuning/NExTQA.tgz
See: `git lfs help smudge` for more details.
the download directory size
314G ./instruction_tuning
26G ./pretraining
341G ./.git
405M ./annotations
681G .
is it ok?
Thank you for sharing your work!
During the pre-training stage, the paper mentions that you used image data to train the video branch. How did you use image data to train the video part? Did you treat each image as a single frame of a video?
when run the script, met the problem: ImportError: cannot import name 'Phi3Model' from 'transformers'
I think it really really out of expect, how will a phi3 model surpass mistaral7B, in the case of VideChat2 using a gaint vision encoder?
Which part could be really work one?
Hey! Thanks for your great work. Do u have any plan to provide a simple demo, i.e., input a video and a question, not a benchmark?
Hello,
I have a question regarding the conversation capabilities of this project:
Thank you for your time and assistance.
Hi team,
Nice work! Can I request the intermediate descriptions for vcg-plus_112k generated by this file?
Thanks in advance!
Hello everyone,
I have been working on replicating benchmarks related to video-class Large Language Models (LLMs), and I've noticed that most of these benchmarks rely on the GPT-assistant framework. Given the complexity and potential costs associated with these benchmarks, I'm interested in gathering some feedback regarding the financial aspect of conducting such evaluations.
Could anyone share their experiences regarding the typical costs involved in running these benchmarks? Any insights into budgeting for such projects would be highly beneficial to the community.
Thank you!
Do you have a plan to release the original "Detailed Video Descriptions"?
How to perform zero-shot QA evaluation on datasets like MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA? Could we just follow the pipeline of Video-ChatGPT?
Hello,
Thank you for sharing your excellent research and code. I am currently pretraining an image encoder using 8 A100 GPUs. The estimated time of arrival (ETA) is about 6 hours. Is this normal? Could you share the pretraining and fine-tuning times along with the number of GPUs used for each setting? It would be very helpful.
Thank you!
raise DECORDError(err_str)
decord.ffi.base.DECORDError: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:292: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:218: Check failed: avcodec_send_packet(dec_ctx.get(), pkt.get()) >= 0 (-11 vs. 0) Thread worker: Error sending packet.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.