yui010206 / sevila Goto Github PK

View Code? Open in Web Editor NEW

166.0 3.0 19.0 53.5 MB

[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

Home Page: https://arxiv.org/abs/2305.06988

License: BSD 3-Clause "New" or "Revised" License

Python 97.85% Shell 0.18% Jupyter Notebook 1.97%

sevila's Introduction

[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

Authors: Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal
Paper: arXiv
Online Demo: Try our Gradio demo on Hugging Face

Code structure

# data & data preprocessing
./sevila_data

# pretrained checkpoints
./sevila_checkpoints

# SeViLA code
./lavis/

# running scripts for SeViLA localizer/answerer training/inference
./run_scripts

Setup

Install Dependencies

(Optional) Creating conda environment

conda create -n sevila python=3.8
conda activate sevila

build from source

pip install -e .

Download Pretrained Models

We pre-train SeViLA localizer on QVHighlights and hold checkpoints via Hugging Face. Download checkpoints and put it under /sevila_checkpoints. The checkpoints (814.55M) contains pre-trained localizer and zero-shot answerer.

If you want to pre-train your own localizer, you can download qformer_loc.pth, which is a copy of the original BLIP-2 Q-former to initialize the localizer (with changed model keys).

Run Gradio Demo Locally

We also provide a UI for testing our SeViLA locally that is built with gradio. Running demo locally requires about 12GB of memory.

Installing Gradio:

pip install gradio==3.30.0

Running the following command in a terminal will launch the demo:

python app.py

Dataset Preparation

We test our model on:

Please download original QA data and preprocess them via our scripts.

Training and Inference

We provide SeViLA training and inference script examples as follows.

And please refer to dataset page to custom your data path.

1) Localizer Pre-training

sh run_scripts/sevila/pre-train/pretrain_qvh.sh

2) Answerer Fine-tuning

sh run_scripts/sevila/finetune/nextqa_ft.sh

3) Localizer Self-refinement

sh run_scripts/sevila/refinement/nextqa_sr.sh

4) Inference

sh run_scripts/sevila/inference/nextqa_infer.sh

Acknowledgments

We thank the developers of LAVIS, BLIP-2, CLIP, All-in-One, for their public code release.

Reference

Please cite our paper if you use our models in your works:

@inproceedings{yu2023self,
  title   = {Self-Chained Image-Language Model for Video Localization and Question Answering},
  author  = {Yu, Shoubin and Cho, Jaemin and Yadav, Prateek and Bansal, Mohit},
  booktitle = {NeurIPS},
  year    = {2023}
}

sevila's People

Contributors

Stargazers

Watchers

Forkers

eltociear zchuz tonywhite11 danielflaherty ziyang412 ziyangw2000 mohammedmahdiali vhzy wenjiajia123 shekharravi ptchallenge-workshop jeanpark96 baiiiiiiiiii coreyfury lookuz ander1119 arranzeyuwang irvansian bhavyaalekhya

sevila's Issues

Can you provide the file qformer_loc.pth？

When I run the pre-train code 'sh run_scripts/sevila/pre-train/pretrain_qvh.sh', it told me the file qformer_loc.pth not find.
The error is in blip2.py

what is the meaning of frame_num and answer_num?

Thanks for your brilliant work!

I can't find explanations about these two configuration : frame_num and answer_num . Could you please help me?

Could you please provide a sample to show annotation files after processing?

Hi, I want to finetune your model on my dataset. But I am not clear about the structure of the annotation files and video files I should create.

Could you please help me?

Problems encountered when running the inference script

Hi，when I run the inference code :sh run_scripts/sevila/inference/nextqa_infer.sh, I've been encountering the following problem all along. Could you help me solve it?（This is the first time I came into contact with multimodal large model, I didn't really understand it）

The script used is :

parameters/data path

result_dir="/home/txz/Result/"
train_path="/home/txz/SeViLA/sevila_datasets/nextqa/train.json"
val_path="/home/txz/SeViLA/sevila_datasets/nextqa/val.json"
video_path="/home/txz/SeViLA/video_feats/nextqa/app_mot_train.h5"

exp_name='nextqa_infer'
ckpt='/home/txz/SeViLA/sevila_checkpoints/sevila_pretrained.pth'
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.run --nproc_per_node=1 evaluate.py
#CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.run --nproc_per_node=1 evaluate.py
--config lavis/projects/sevila/eval/nextqa_eval.yaml
--options run.output_dir=${result_dir}${exp_name}
datasets.nextqa.build_info.annotations.train.storage=${train_path}
datasets.nextqa.build_info.annotations.val.storage=${val_path}
datasets.nextqa.build_info.annotations.test.storage=${val_path}
datasets.nextqa.build_info.videos.storage=${video_path}
model.frame_num=4
datasets.nextqa.vis_processor.eval.n_frms=32
run.batch_size_eval=8
model.task='qvh_freeze_loc_freeze_qa_vid'
model.finetuned=${ckpt}
run.task='videoqa'

In addition, the GPU I used is double RTX 2080Ti. I would also like to confirm what I should download？Is it the original video, the video frames and the video features provided by the data researcher or other features for video data?

BLIP2 run scripts

Hi,

Thanks for your excellent work. Would you please release the BLIP2 run scripts for NextQA dataset?

Best

How to define answer_id?

Hi,

Thanks for releasing the code.
I found that L101 of sevila.py defines the answer_id = [71, 272, 205, 309, 262], which corresponds to A B C D E.
Can you let me know how is the answer_id obtained?

HF Spaces demo not working.

Huggingface demo not working

FileNotFoundError: [Errno 2] No such file or directory: '/nas-hdd/shoubin/pretrained_model/hub/checkpoints/qformer_loc.pth

finetuning result is so bad when using my pretrained checkpoints on QVhighlights.

Hi, I used your pre-trained SeViLA localizer checkpoints on QVHighlights to fine-tune NExT-QA and got similar NExT-QA results as in your paper (73.2 vs. 73.8). However, when I used your script to first pretrain a sevila localizer and then finetune NExT-QA using my pretrained SeViLA localizer checkpoints, I got only an accuracy of 45 in the first epoch (71 using the checkpoints your gave). I found that your checkpoint is 815M and my pretrained sevila localizer on QVHighlights is 1.4G. Is there any post-processing for the pretrained sevila localizer checkpoints?

There is a writing error in the paper

There is a writing error in the paper
The batchsize in the pre-training and refinement phase should be 16 per gpu instead of 64 per gpu.

The result of inference cannot be found in the paper

Hi, I reran the inference code: sh run_scripts/sevila/inference/nextqa_infer.sh, the result is {'agg_metrics': 0.649119295436349, 'total': 4996, 'DC': 60.32608695652174, 'CH': 64.62809917355372, 'CW': 63.91959798994975, 'TN': 57.279236276849645, 'TC': 65.26458616010855, 'DL': 89.84962406015038, 'DO': 73.74631268436578, 'TP': 51.35135135135135}. However, I cannot find acc 64.9 in your paper, so what setting does the results of nextqa_infer.sh used and why I cannot find an acc of 64.9 in the paper?

I used the NExT-QA videos and annotation from the original author's github, the preprocessing code you given, and the checkpoint you given.

AttributeError:'NoneType' object has no attribute 'storage'

Hello, I have changed the configuration file located in /lavis/configs/datasets/nextqa/defaults_qa.yaml according to the readme file, but this error occurred. After debugging, I found that the key is not only build_ Info, or features (data_type: features), can you tell me the solution?

the results on STAR is lower than paper

Hi, I used your [pre-trained SeViLA localizer checkpoints] on QVHighlights to fine-tune answerer and self-refinement localizer on STAR. I ensure that the same batch_size is used for training with 2 GPUs(A100-sxm-80GB). I got the results 60.10 and 61.69 at each step, which is lower than 62.7 and 64.9 given in the paper. Is there any different processing used in training?

Problem during inference with VLEP

I processed with the indicates jupyter file the data , I have something like this:
{"video": "friends_s03e09_seg02_clip_07_ep", "num_option": 2, "qid": "VLEP_20142", "a0": "Ross will stop, turn and point at Monica.", "a1": "Ross will stop and ask Monica why she is pointing at him.", "answer": 0, "start": 38.81, "end": 40.37}

but during inference I get an error:

File "/SeViLA/lavis/datasets/datasets/mc_video_vqa_datasets.py", line 54, in __getitem__
q = ann['question']
KeyError: 'question'

if you go to that file you have:

You can see that line 54 is asking for ann['question'] that vlep does not have. I do not know if I forgot to set some path or variable or something.

Thanks in advance.

HF Spaces demo not working ?

Hi, I've been trying to use the demo for inference on some examples. The demo loads properly, and I'm able to add videos to the demo but the inference always returns an Error. I even tried out each of the default examples provided on the website but the same issue persists. I've attached a sample screenshot of the problem :

Could you please help me out on this? Thanks in advance!

Fine-tuned model checkpoints

Great work and code release! I was wondering if you could release the model checkpoints that the fine-tuning results in Table 1 were achieved with? This would help re-producibility of the work, thanks!

How to fintune on my own data?

Hello, I am very interested in your work.
I want to know how to fine tune the model on the new dataset? Is it Localizer Self definition plus Answerer Fine tuning?

About Localizer

Hello, Thank you for your great work!
Does the Localizer+ in your paper refer to the one that has been pre-trained by QVHighlights but not fine-tuned on the QA data set? That is the sevila_pretrained.pth you provided?

Can the model infer on open-ended questions?

How many GPUs are required for inference?

Hi, how many GPUs are required for inference?

The warning of missing keys

When I finetune Answers on my datasets,I find the warning of missing keys, how to find this keys

Problem on NextQA Inference

Hi,
When I using your SeViLA localizer on QVHighlights and Inference on NextQA data.I found all predictions are option1.And output key frames of the localizer are indential( like [0,1,2,3]).So what's the potential reason of this problem?

Performance gets worse when using Localizer that has been self-refined on the QA dataset

Hello Author,
I fine-tuned the Answerer with the exact same configuration as yours. When I did not refine the pretrained Localizer you provided, the highest accuracy on the NExT-QA val set was 72.4%. However, when I fine-tuned the Answerer using the Localizer refined on NExT-QA, the highest accuracy was only 71.8%. Did I overlook something?

How to load the parameters of vit and t5?

Hello.
I see that the script uses the following command to load the model, but there are no parameters for vit and t5 in it. How can I load them?

load_finetuned: True finetuned: 'https://huggingface.co/Shoubin/SeViLA/resolve/main/sevila_pretrained.pth'

yui010206 / sevila Goto Github PK

sevila's Introduction

[NeurIPS 2023] Self-Chained Image-Language Model for Video Localization and Question Answering

Code structure

Setup

Install Dependencies

Download Pretrained Models

Run Gradio Demo Locally

Dataset Preparation

Training and Inference

1) Localizer Pre-training

2) Answerer Fine-tuning

3) Localizer Self-refinement

4) Inference

Acknowledgments

Reference

sevila's People

Contributors

Stargazers

Watchers

Forkers

sevila's Issues

parameters/data path

Recommend Projects

Recommend Topics

Recommend Org