Giter VIP home page Giter VIP logo

mbzuai-oryx / videogpt-plus Goto Github PK

View Code? Open in Web Editor NEW
174.0 174.0 10.0 16.87 MB

Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding

License: Creative Commons Attribution 4.0 International

Python 97.43% Shell 2.57%
chatbot clip dual-encoder gpt4 gpt4o image-encoder llama3 llava multimodal phi-3-mini vicuna video-chatbot video-conversation video-encoder vision-language vision-language-pretraining

videogpt-plus's People

Contributors

hanoonar avatar mmaaz60 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

videogpt-plus's Issues

You mean phi3 surpassed mistral7B?

I think it really really out of expect, how will a phi3 model surpass mistaral7B, in the case of VideChat2 using a gaint vision encoder?
Which part could be really work one?

Phi3Model ImportError

when run the script, met the problem: ImportError: cannot import name 'Phi3Model' from 'transformers'

Support for Multi-turn Conversations with Fixed Video Input?

Hello,
I have a question regarding the conversation capabilities of this project:

  1. Does the system support multi-turn conversations?
  2. Is it possible to have a natural, ongoing dialogue while keeping the initial video input fixed?

Thank you for your time and assistance.

Zero-shot QA evaluation

How to perform zero-shot QA evaluation on datasets like MSVD-QA, MSRVTT-QA, TGIF-QA, ActivityNet-QA? Could we just follow the pipeline of Video-ChatGPT?

Inquiry about Costs Associated with Video LLM Benchmarks

Hello everyone,

I have been working on replicating benchmarks related to video-class Large Language Models (LLMs), and I've noticed that most of these benchmarks rely on the GPT-assistant framework. Given the complexity and potential costs associated with these benchmarks, I'm interested in gathering some feedback regarding the financial aspect of conducting such evaluations.

Could anyone share their experiences regarding the typical costs involved in running these benchmarks? Any insights into budgeting for such projects would be highly beneficial to the community.

Thank you!

Performance on MVBench

Thanks to open source for this exciting work!

I reproduced the performance on MVBench with a single gpu, but three experiments did not achieve the expected results in the paper (best results below), if any other weights were used in the tests. Also I noticed that when setting batch_size_per_gpu=2 drastically affects the performance and there is no OOM.

All Acc: [67.5, 57.99999999999999, 80.3030303030303, 48.5, 56.49999999999999, 86.5, 73.86934673366834, 37.0, 30.0, 30.5, 85.0, 38.0, 65.0, 82.5, 42.5, 51.0, 48.0, 31.0, 40.0, 56.00000000000001]% Total Acc: 55.37%

Question about Training Time

Hello,

Thank you for sharing your excellent research and code. I am currently pretraining an image encoder using 8 A100 GPUs. The estimated time of arrival (ETA) is about 6 hours. Is this normal? Could you share the pretraining and fine-tuning times along with the number of GPUs used for each setting? It would be very helpful.

Thank you!

About downloading the datatset?

hi @salman-h-khan
i run the script to download the dataset, encounter this problem as below:

Cloning into 'VideoGPT-plus_Training_Dataset'...
remote: Enumerating objects: 46, done.
remote: Counting objects: 100% (43/43), done.
remote: Compressing objects: 100% (43/43), done.
remote: Total 46 (delta 6), reused 0 (delta 0), pack-reused 3 (from 1)
Unpacking objects: 100% (46/46), 1.87 MiB | 2.48 MiB/s, done.











Filtering content: 100% (24/24), 28.09 GiB | 867.00 KiB/s, done.
Encountered 13 file(s) that may not have been copied correctly on Windows:
        instruction_tuning/k710.tgz
        pretraining/CC3M-595K.tgz
        instruction_tuning/clevrer.tgz
        pretraining/COCO.tgz
        instruction_tuning/videochat_it.tgz
        instruction_tuning/webvid/webvid_1.tgz
        instruction_tuning/webvid/webvid_3.tgz
        instruction_tuning/webvid/webvid_2.tgz
        instruction_tuning/webvid/webvid_5.tgz
        instruction_tuning/webvid/webvid_4.tgz
        instruction_tuning/activitynet_videos.tgz
        instruction_tuning/webvid/webvid_6.tgz
        instruction_tuning/NExTQA.tgz

See: `git lfs help smudge` for more details.

the download directory size

314G    ./instruction_tuning
26G     ./pretraining
341G    ./.git
405M    ./annotations
681G    .

is it ok?

About pre-training stage.

Thank you for sharing your work!

During the pre-training stage, the paper mentions that you used image data to train the video branch. How did you use image data to train the video part? Did you treat each image as a single frame of a video?

Simple Demo

Hey! Thanks for your great work. Do u have any plan to provide a simple demo, i.e., input a video and a question, not a benchmark?

“python setup.py install” for flash-attention reports errors

Hello there,

Thank you for your remarkable work and I am really interested in looking into it. The whole installation process works smoothly until the very last command. The “python setup.py install” for flash-attention reports "RuntimeError: Error compiling objects for extension" alongside with huge amount of error messages so that I have no clue how to debug.

Does anyone meet similar issue? Any feedback is greatly appreciated!

eval/vcgbench/inference/run_ddp_inference.sh

[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] Missing reference picture, default is 65562
[h264 @ 0x16543c00] mmco: unref short failure
[h264 @ 0x16543c00] mmco: unref short failure

The webm file from ssv2 can not be loaded

raise DECORDError(err_str)
decord.ffi.base.DECORDError: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:292: [05:19:05] /github/workspace/src/video/ffmpeg/threaded_decoder.cc:218: Check failed: avcodec_send_packet(dec_ctx.get(), pkt.get()) >= 0 (-11 vs. 0) Thread worker: Error sending packet.

In what order should I reproduce the paper?

step1
pretrain_projector_image_encoder.sh
step2
pretrain_projector_video_encoder.sh
step3
finetune_dual_encoder.sh
step4
eval/vcgbench/inference/run_ddp_inference.sh
step5
eval/vcgbench/gpt_evaluation/vcgbench_evaluate.sh

#!/bin/sh


export DATASET_DIR=/mnt2/ninghuayang/data/videogpt_plus_dataset

BASE_LLM_PATH=microsoft/Phi-3-mini-4k-instruct
VISION_TOWER=OpenGVLab/InternVideo2-Stage2_1B-224p-f4
IMAGE_VISION_TOWER=openai/clip-vit-large-patch14-336
PROJECTOR_TYPE=mlp2x_gelu
#PRETRAIN_VIDEO_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_internvideo2/mm_projector.bin
#PRETRAIN_IMAGE_MLP_PATH=MBZUAI/VideoGPT-plus_Phi3-mini-4k_Pretrain/mlp2x_gelu_clip_l14_336px/mm_projector.bin
PRETRAIN_VIDEO_MLP_PATH=results/mlp2x_gelu_internvideo2/mm_projector.bin
PRETRAIN_IMAGE_MLP_PATH=results/mlp2x_gelu_clip_l14_336px/mm_projector.bin
OUTPUT_DIR_PATH=results/videogpt_plus_finetune

deepspeed videogpt_plus/train/train.py \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed scripts/zero3.json \
--model_name_or_path "$BASE_LLM_PATH" \
--version phi3_instruct \
--dataset_use FINETUNING \
--vision_tower "$VISION_TOWER" \
--image_vision_tower "$IMAGE_VISION_TOWER" \
--mm_projector_type "$PROJECTOR_TYPE" \
--image_mm_projector_type "$PROJECTOR_TYPE" \
--pretrain_mm_mlp_adapter "$PRETRAIN_VIDEO_MLP_PATH" \
--pretrain_image_mm_mlp_adapter "$PRETRAIN_IMAGE_MLP_PATH" \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir $OUTPUT_DIR_PATH \
--num_train_epochs 1 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 16 \
--gradient_accumulation_steps 2 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 4096 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to none

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.