Giter VIP home page Giter VIP logo

txh-mercury / valor Goto Github PK

View Code? Open in Web Editor NEW
238.0 9.0 14.0 77.42 MB

Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

Home Page: https://arxiv.org/abs/2304.08345

License: MIT License

Jupyter Notebook 17.19% Shell 0.58% Python 43.61% C++ 21.73% Cuda 16.63% C 0.13% Makefile 0.03% CSS 0.06% HTML 0.03% Dockerfile 0.02%
vision-language-pretraining audio-language-pretraining audiovisual-language-pretraining multimodal-representation-learning

valor's People

Contributors

johncaged avatar lihanddd avatar txh-mercury avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

valor's Issues

TypeError: __init__() missing 2 required positional arguments: 'stdout' and 'stderr'

Thank you very much for your nice work! However, I encountered the following error when executing utils/extract_frame_and_wav_multiprocess.py for processing MSRVTT. Additionally, the progress bar is not being displayed, but the generated video frames (.jpg) and audio files (.wav) do appear in the testt folder. This program has been running for 15 hours with only 2374 audio files and frame files generated.
The error is as follows:

(valor) xxx:/VALOR/utils$ python extract_frame_and_wav_multiprocess.py                                                                                                                                                                           
0%|                                       | 0/10005 [00:00<?, ?it/s]
Exception in thread Thread-3:                                                                                                                                                      
Traceback (most recent call last):                                                                                                                                                   
File "/anaconda3/envs/valor/lib/python3.9/threading.py", line 973, in _bootstrap_inner                                                    
self.run()                                                                                                                                                                       
File "/anaconda3/envs/valor/lib/python3.9/threading.py", line 910, in run                                                                                          
self._target(*self._args, **self._kwargs)                                                                                                                                        
File "/anaconda3/envs/valor/lib/python3.9/multiprocessing/pool.py", line 576, in _handle_results                                                                   
task = get()                                                                                                                                                                     
File "/anaconda3/envs/valor/lib/python3.9/multiprocessing/connection.py", line 256, in recv                                                                        
return _ForkingPickler.loads(buf.getbuffer())                                                                                                                                  
TypeError: __init__() missing 2 required positional arguments: 'stdout' and 'stderr'    

It's strange that similar errors did not occur when processing the DiDeMo dataset, but they are encountered when handling the MSRVTT dataset. (Is this related to the fact that the DiDeMo dataset doesn't have audio?)

Providing all versions of pretrained weights

Hi, could you also provide all versions of the pretrained weights for BERT,CLIP,VideoSwin? And could you explain how different version of these backbone corresponds to the version of the VALOR weights? Thanks!

Inference Code

Hi. Do you have any plans to release the inference code?

Questions about how to calculate metrics

Hello, i'm new in this field and I'm a bit confused about how to calculate the metric on the MSRVTT set, when each video will have 20 corresponding descriptive captions. So how do we calculate to get the correlation matrix between captions and videos because the number of videos in the test set is only 2990 and the number of captions is 2990x20=59800, I have read your code but I really haven't seen it yet understand the core point here. Hope you can explain this to me

RuntimeError: CUDA error: no kernel image is available for execution on the device

when running the code ,it appers the error as follows:

Traceback (most recent call last):
  File "./train.py", line 95, in <module>
  File "./train.y", line 78, in main
    zero_shot_evaluation(model, val_loaders, opts)
  File "/media/yxl/a/2025191008/VALOR-master/train_utils.py", line 247, in zero_shot_evaluation
    eval_log = validate(model, test_loader, opts, global_step=0, total_step=opts.num_train_steps)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 26, in validate
    val_log = validate_single(model, loader, task.split('--')[0], opts, global_step, total_step,task.split('--')[1])
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 40, in validate_single
    return validate_cap(model, val_loader, task, opts, global_step, dset_name)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
    return func(*args, **kwargs)
  File "/media/yxl/a/2025191008/VALOR-master/test.py", line 161, in validate_cap
    evaluation_dict = model(batch, task_str, compute_loss=False)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/apex/amp/_initialize.py", line 196, in new_fwd
    output = old_fwd(*applier(args, input_caster),
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 135, in forward
    return self.forward_cap(batch, task, compute_loss=compute_loss)
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 726, in forward_cap
    return self.generate_cap(batch, task)
  File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 930, in generate_cap
    video_input = self.get_multimodal_forward_input_video(video_output) 
  File "/media/yxl/a/2025191008/VALOR-master/model/modeling.py", line 490, in get_multimodal_forward_input_video
    video_output =  video_output + self.video_frame_embedding[:,:video_output.shape[1],:].unsqueeze(-2)
RuntimeError: CUDA error: no kernel image is available for execution on the device
  0%|                                                                                                                                                      | 0/1495 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15399) of binary: /home/yxl/anaconda3/envs/valor_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

I've tried every method on the Internet but still don't solve the problem.
My environment :
_sys.platform linux
Python 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]
numpy 1.24.4
detectron2 failed to import
detectron2._C not built correctly: No module named 'detectron2'
Compiler ($CXX) c++ (GCC) 7.3.0
CUDA compiler Build cuda_11.1.TC455_06.29190527_0
DETECTRON2_ENV_MODULE
PyTorch 1.9.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch
PyTorch debug build False
GPU available True
GPU 0 GeForce RTX 2080 Ti (arch=7.5)
CUDA_HOME /usr/local/cuda
TORCH_CUDA_ARCH_LIST 7.5
Pillow 10.1.0
torchvision 0.10.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
cv2 Not found


PyTorch built with:

  • GCC 7.3
  • C++ Version: 201402
  • Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  • Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
  • OpenMP 201511 (a.k.a. OpenMP 4.5)
  • NNPACK is enabled
  • CPU capability usage: AVX2
  • CUDA Runtime 11.1
  • NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  • CuDNN 8.0.5
  • Magma 2.5.2
  • Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,_

Errors in loading Bert and attention score calculation

Hi, thanks for your interesting work!

I'm trying to reproduce your results. Running 'sh preinstall.sh' I get some errors with Bert model.
I downloaded pretrained weights from your link from "Download Checkpoints" section. I use "pretrained_weights/bert_base_uncased_config.json" and "pretrained_weights/bert-base-uncased.bin" files.
However

  1. I have some unexpected_keys in multimodal encoder and a lot of missing keys, while loading state_dict. Is it okay?
  2. I got the following error in cross attention calculation:
File "/VALOR/model/bert.py", line 334, in forward
    attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (5000) must match the size of tensor b (250) at non-singleton dimension 0

Size of query_layer tensor is [5000, 12, 42, 64], size of key_layer tensor is [250, 12, 49, 64]

Comparison between SoTA methods

Hi, I have read your paper, nice work on various video downstream tasks. However, some of the major or competitive methods are not compared for VideoQA (such as MulTI, mPLUG-2, and UMT-L) and VideoCaption (such as HiTeA, and mPLUG-2). These methods are also SoTA methods and worth for comparison.

Hope you can consider above suggestions, thanks.

Different Results on msrvtt-1kA

Thanks for sharing your work!!!

I tested the test code provided in the README.md on msrvtt-1kA and obtained the following results:

07/18/2023 17:38:05 - INFO - main -   ====-zero-shot evaluation--ret%tva%tv--msrvtt_ret_t_v========

07/18/2023 17:38:05 - INFO - main -   {'video_recall': '39.9/69.2/78.8', 'video_ravg': 62.6, 'video_medianR': 2.0, 'video_meanR': 17.953125}
07/18/2023 17:38:05 - INFO - main -   ====-zero-shot evaluation--ret%tva%tv--msrvtt_ret_t_va========

07/18/2023 17:38:05 - INFO - main -   {'video_recall': '43.0/72.1/82.1, 'video_ravg': 65.7, 'video_medianR': 2.0, 'video_meanR': 15.1953125}

Why is the result much lower than the official announcement?

Code to perform QA task

Does anybody have an inference code or notebook to run VALOR for QA task? any notebooks at least for information retrieval

AssertionError when calculating BLEU score

Thanks for the code and documentation. I am running the captioning finetuning experiment on MSRVTT. During the evaluation stage, the code stops with an AssertionError here. Seems like hypo variable contains repetition of the same sentence multiple times. Can you please tell if I missed any step of if not, why is this error coming and how to solve it?

Here is the generated hypo variable and the ref variable output for video9894:

['in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera']
['a boy is talking to his roomnates who are in different room', 'a man asks another man to help him with chores', 'a man avoids helping his roommates', 'a man doesn t help his friends with anything', 'a man drinking some beer', 'a man walking in an apartment', 'a person communicating with other person', 'he was in the kitchen', 'man asking another man to do the dishes', 'man refusing to help his roommate', 'roommate continues to say no each time he is asked to help with something', 'the boys meet courier boy', 'the man asks for help', 'the youtube nigahiga doesn t want to help anyone', 'two friends are having fun', 'two men are talking in a kitchen', 'two young men talking to each other about doing dishes', 'a man walking in an apartment', 'a man drinking some beer', 'roommate continues to say no each time he is asked to help with something']

Thanks!

About prerequisite

Thanks for your very good work. Excellent performance on my audio-text multimodal dataset.
I'd like to ask if there are any plans to streamline or optimize the NVIDIA/Apex related content in the future, I got tons of weird bugs when setting up my environment and I believe this will also cause problems for others who want to try your repo.

"Output file #0 does not contain any stream"

Thank you for your excellent contribution, but I found some problems when running the code: When I processed the MSRVTT dataset through extract_frame_and_wav_multiprocess.py, when processing some videos, an error "Output file #0 does not contain any stream" appeared, In the end, all the pictures were successfully extracted for the MSRVTT dataset, but only 8811 audio files were extracted. Is this normal, maybe because some videos in the MSRVTT dataset have no audio? Looking forward to your response, thank you again for your contribution to the field of video retrieval!

Strange error, but it works normally

Thank you very much for your excellent contribution,But I found some problems while running:During training or inference, the program repeatedly warns as follows, but the program still runs:

torch.cat(): Sizes of tensors must match except in dimension 0. Got 240 and 224 in dimension 2 (The offending index is 3)
05/17/2023 20:37:16 - INFO - main - current idx video7812 from ret returns wrong image/video, use 4266 instead.
torch.cat(): Sizes of tensors must match except in dimension 0. Got 240 and 224 in dimension 2 (The offending index is 3)
05/17/2023 20:37:17 - INFO - main - current idx video3481 from ret returns wrong image/video, use 225 instead.

Some videos in VALOR-32K are unavailable on YouTube

Thank you so much for your excellent work!

I found many videos in the proposed VALOR-32K are unavailable on YouTube. 146 and 151 videos in the test and val sets are unavailable, respectively. And I am still downloading videos in the train set.
Can you provide the original raw videos for these unavailable ones for me? Because it is a large number of videos that can definitely affect the model's performance.

Plan to release finetuned models?

Hi authors,

Amazing paper and thanks for providing this nice code base. I have a question regarding the finetuned model, specifically for video-text retrieval task. Do you have plans to release those models? I do understand that we can use the pretrained VALOR as provided in the main page README (shown below)

Download Checkpoints

  • pretrained_weights (BERT,CLIP,VideoSwin). Put pretrained_weights dir under main path. (VALOR/pretrained_weights)
  • VALOR-base. Put VALOR-base under the output dir. (VALOR/output/VALOR-base)
  • VALOR-large. Put VALOR-large under the output dir. (VALOR/output/VALOR-large)

to finetune the pretrained models for down-stream tasks. But in the paper, the implementation details suggest using 8 A100 GPUs which I don't have. So I probably cannot reproduce the good results reported in the paper. Therefore, I am wondering if you plan to release the finetuned models for video-text retrieval task?

Thanks!
Shane

Inference code

Hi, would it be possible to release a demo or code through which I could essentially perform inference on a video. I would like to get embeddings (MGA) for video clips, text and background audio in the same latent space.

A question about the optimizer:

There are two functions in VALOR/optim/misc.py, one is build_optimizer and the other is build_optimizer_for_VQA. Is the second one specifically for the VQA task, while the first one is for other tasks? Which function did you use to obtain the results listed in the paper?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.