txh-mercury / valor Goto Github PK

Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Home Page: https://arxiv.org/abs/2304.08345

License: MIT License

Jupyter Notebook 17.19% Shell 0.58% Python 43.61% C++ 21.73% Cuda 16.63% C 0.13% Makefile 0.03% CSS 0.06% HTML 0.03% Dockerfile 0.02%

vision-language-pretraining audio-language-pretraining audiovisual-language-pretraining multimodal-representation-learning

valor's People

Contributors

Stargazers

Watchers

Forkers

givyuscss lihanddd binzhu-ece dutvar invisibleanni lqtuong jin1258804025 kanguyen-vn adrianwangzhao dvanhuy11 inesriahi dwhnicholas yz26cn arctanbell

valor's Issues

TypeError: init() missing 2 required positional arguments: 'stdout' and 'stderr'

Thank you very much for your nice work! However, I encountered the following error when executing utils/extract_frame_and_wav_multiprocess.py for processing MSRVTT. Additionally, the progress bar is not being displayed, but the generated video frames (.jpg) and audio files (.wav) do appear in the testt folder. This program has been running for 15 hours with only 2374 audio files and frame files generated.
The error is as follows:

(valor) xxx:/VALOR/utils$ python extract_frame_and_wav_multiprocess.py                                                                                                                                                                           
0%|                                       | 0/10005 [00:00<?, ?it/s]
Exception in thread Thread-3:                                                                                                                                                      
Traceback (most recent call last):                                                                                                                                                   
File "/anaconda3/envs/valor/lib/python3.9/threading.py", line 973, in _bootstrap_inner                                                    
self.run()                                                                                                                                                                       
File "/anaconda3/envs/valor/lib/python3.9/threading.py", line 910, in run                                                                                          
self._target(*self._args, **self._kwargs)                                                                                                                                        
File "/anaconda3/envs/valor/lib/python3.9/multiprocessing/pool.py", line 576, in _handle_results                                                                   
task = get()                                                                                                                                                                     
File "/anaconda3/envs/valor/lib/python3.9/multiprocessing/connection.py", line 256, in recv                                                                        
return _ForkingPickler.loads(buf.getbuffer())                                                                                                                                  
TypeError: __init__() missing 2 required positional arguments: 'stdout' and 'stderr'

It's strange that similar errors did not occur when processing the DiDeMo dataset, but they are encountered when handling the MSRVTT dataset. (Is this related to the fact that the DiDeMo dataset doesn't have audio?)

Providing all versions of pretrained weights

Hi, could you also provide all versions of the pretrained weights for BERT,CLIP,VideoSwin? And could you explain how different version of these backbone corresponds to the version of the VALOR weights? Thanks!

Inference Code

Hi. Do you have any plans to release the inference code?

Questions about how to calculate metrics

Hello, i'm new in this field and I'm a bit confused about how to calculate the metric on the MSRVTT set, when each video will have 20 corresponding descriptive captions. So how do we calculate to get the correlation matrix between captions and videos because the number of videos in the test set is only 2990 and the number of captions is 2990x20=59800, I have read your code but I really haven't seen it yet understand the core point here. Hope you can explain this to me

RuntimeError: CUDA error: no kernel image is available for execution on the device

when running the code ,it appers the error as follows:

Traceback (most recent call last):
  File "./train.py", line 95, in <module>
  File "./train.y", line 78, in main
    zero_shot_evaluation(model, val_loaders, opts)
  File "/media/yxl/a/2025191008/VALOR-master/train_utils.py", line 247, in zero_shot_evaluation
    eval_log = validate(model, test_loader, opts, global_step=0, total_step=opts.num_train_steps)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 26, in validate
    val_log = validate_single(model, loader, task.split('--')[0], opts, global_step, total_step,task.split('--')[1])
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 40, in validate_single
    return validate_cap(model, val_loader, task, opts, global_step, dset_name)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 161, in validate_cap
    evaluation_dict = model(batch, task_str, compute_loss=False)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/apex/amp/_initialize.py", line 196, in new_fwd
    output = old_fwd(*applier(args, input_caster),
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 135, in forward
    return self.forward_cap(batch, task, compute_loss=compute_loss)
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 726, in forward_cap
    return self.generate_cap(batch, task)
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 930, in generate_cap
    video_input = self.get_multimodal_forward_input_video(video_output) 
  File "/media/yxl/a/2025191008/VALOR-master/model/modeling.py", line 490, in get_multimodal_forward_input_video
    video_output =  video_output + self.video_frame_embedding[:,:video_output.shape[1],:].unsqueeze(-2)
RuntimeError: CUDA error: no kernel image is available for execution on the device
  0%|                                                                                                                                                      | 0/1495 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15399) of binary: /home/yxl/anaconda3/envs/valor_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

I've tried every method on the Internet but still don't solve the problem.
My environment :
_sys.platform linux
Python 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]
numpy 1.24.4
detectron2 failed to import
detectron2._C not built correctly: No module named 'detectron2'
Compiler ($CXX) c++ (GCC) 7.3.0
CUDA compiler Build cuda_11.1.TC455_06.29190527_0
DETECTRON2_ENV_MODULE
PyTorch 1.9.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch
PyTorch debug build False
GPU available True
GPU 0 GeForce RTX 2080 Ti (arch=7.5)
CUDA_HOME /usr/local/cuda
TORCH_CUDA_ARCH_LIST 7.5
Pillow 10.1.0
torchvision 0.10.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
cv2 Not found

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,_

Here guessing what to do to start runbning this on videos

Hey guys! I got to installing it, apex finally installed, and all is here. Models downloaded.
How do I run the inference on videos? I already extracted the video and audio and now what should I run, train or test?

Errors in loading Bert and attention score calculation

Hi, thanks for your interesting work!

I'm trying to reproduce your results. Running 'sh preinstall.sh' I get some errors with Bert model.
I downloaded pretrained weights from your link from "Download Checkpoints" section. I use "pretrained_weights/bert_base_uncased_config.json" and "pretrained_weights/bert-base-uncased.bin" files.
However

I have some unexpected_keys in multimodal encoder and a lot of missing keys, while loading state_dict. Is it okay?
I got the following error in cross attention calculation:

File "/VALOR/model/bert.py", line 334, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (5000) must match the size of tensor b (250) at non-singleton dimension 0

Size of query_layer tensor is [5000, 12, 42, 64], size of key_layer tensor is [250, 12, 49, 64]

Comparison between SoTA methods

Hi, I have read your paper, nice work on various video downstream tasks. However, some of the major or competitive methods are not compared for VideoQA (such as MulTI, mPLUG-2, and UMT-L) and VideoCaption (such as HiTeA, and mPLUG-2). These methods are also SoTA methods and worth for comparison.

Hope you can consider above suggestions, thanks.

Pre-training Data Release

Hi, nice work. When will the pre-training dataset will be released?

Different Results on msrvtt-1kA

Thanks for sharing your work!!!

I tested the test code provided in the README.md on msrvtt-1kA and obtained the following results:

07/18/2023 17:38:05 - INFO - main -   ====-zero-shot evaluation--ret%tva%tv--msrvtt_ret_t_v========

07/18/2023 17:38:05 - INFO - main -   {'video_recall': '39.9/69.2/78.8', 'video_ravg': 62.6, 'video_medianR': 2.0, 'video_meanR': 17.953125}
07/18/2023 17:38:05 - INFO - main -   ====-zero-shot evaluation--ret%tva%tv--msrvtt_ret_t_va========

07/18/2023 17:38:05 - INFO - main -   {'video_recall': '43.0/72.1/82.1, 'video_ravg': 65.7, 'video_medianR': 2.0, 'video_meanR': 15.1953125}

Why is the result much lower than the official announcement?

Information on where to find the frames_fps4 and audio_22050hz sections

Hello, I read your paper, great work! While I try running your code, I found that only the raw_videos section of the MSRVTT dataset is publicly available online. Could you please provide information on where to find the frames_fps4 and audio_22050hz sections? Thank you very much!

Code to perform QA task

Does anybody have an inference code or notebook to run VALOR for QA task? any notebooks at least for information retrieval

link to the pretrained_weights is not available

Hi!
Thanks for your great job!

The link to the "pretrained_weights" (first in the section "Download Checkpoints") is not available. It gives me "You need access"

AssertionError when calculating BLEU score

Thanks for the code and documentation. I am running the captioning finetuning experiment on MSRVTT. During the evaluation stage, the code stops with an AssertionError here. Seems like hypo variable contains repetition of the same sentence multiple times. Can you please tell if I missed any step of if not, why is this error coming and how to solve it?

Here is the generated hypo variable and the ref variable output for video9894:

['in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera']
['a boy is talking to his roomnates who are in different room', 'a man asks another man to help him with chores', 'a man avoids helping his roommates', 'a man doesn t help his friends with anything', 'a man drinking some beer', 'a man walking in an apartment', 'a person communicating with other person', 'he was in the kitchen', 'man asking another man to do the dishes', 'man refusing to help his roommate', 'roommate continues to say no each time he is asked to help with something', 'the boys meet courier boy', 'the man asks for help', 'the youtube nigahiga doesn t want to help anyone', 'two friends are having fun', 'two men are talking in a kitchen', 'two young men talking to each other about doing dishes', 'a man walking in an apartment', 'a man drinking some beer', 'roommate continues to say no each time he is asked to help with something']

Thanks!

About prerequisite

Thanks for your very good work. Excellent performance on my audio-text multimodal dataset.
I'd like to ask if there are any plans to streamline or optimize the NVIDIA/Apex related content in the future, I got tons of weird bugs when setting up my environment and I believe this will also cause problems for others who want to try your repo.

"Output file #0 does not contain any stream"

Thank you for your excellent contribution, but I found some problems when running the code: When I processed the MSRVTT dataset through extract_frame_and_wav_multiprocess.py, when processing some videos, an error "Output file #0 does not contain any stream" appeared, In the end, all the pictures were successfully extracted for the MSRVTT dataset, but only 8811 audio files were extracted. Is this normal, maybe because some videos in the MSRVTT dataset have no audio? Looking forward to your response, thank you again for your contribution to the field of video retrieval!

Strange error, but it works normally

Thank you very much for your excellent contribution,But I found some problems while running:During training or inference, the program repeatedly warns as follows, but the program still runs:

torch.cat(): Sizes of tensors must match except in dimension 0. Got 240 and 224 in dimension 2 (The offending index is 3)
05/17/2023 20:37:16 - INFO - main - current idx video7812 from ret returns wrong image/video, use 4266 instead.
torch.cat(): Sizes of tensors must match except in dimension 0. Got 240 and 224 in dimension 2 (The offending index is 3)
05/17/2023 20:37:17 - INFO - main - current idx video3481 from ret returns wrong image/video, use 225 instead.

Some videos in VALOR-32K are unavailable on YouTube

Thank you so much for your excellent work!

I found many videos in the proposed VALOR-32K are unavailable on YouTube. 146 and 151 videos in the test and val sets are unavailable, respectively. And I am still downloading videos in the train set.
Can you provide the original raw videos for these unavailable ones for me? Because it is a large number of videos that can definitely affect the model's performance.

Plan to release finetuned models?

Hi authors,

Amazing paper and thanks for providing this nice code base. I have a question regarding the finetuned model, specifically for video-text retrieval task. Do you have plans to release those models? I do understand that we can use the pretrained VALOR as provided in the main page README (shown below)

Download Checkpoints

pretrained_weights (BERT,CLIP,VideoSwin). Put pretrained_weights dir under main path. (VALOR/pretrained_weights)
VALOR-base. Put VALOR-base under the output dir. (VALOR/output/VALOR-base)
VALOR-large. Put VALOR-large under the output dir. (VALOR/output/VALOR-large)

to finetune the pretrained models for down-stream tasks. But in the paper, the implementation details suggest using 8 A100 GPUs which I don't have. So I probably cannot reproduce the good results reported in the paper. Therefore, I am wondering if you plan to release the finetuned models for video-text retrieval task?

Thanks!
Shane

Inference code

Hi, would it be possible to release a demo or code through which I could essentially perform inference on a video. I would like to get embeddings (MGA) for video clips, text and background audio in the same latent space.

A question about the optimizer:

There are two functions in VALOR/optim/misc.py, one is build_optimizer and the other is build_optimizer_for_VQA. Is the second one specifically for the VQA task, while the first one is for other tasks? Which function did you use to obtain the results listed in the paper?