txh-mercury / valor Goto Github PK
View Code? Open in Web Editor NEWCodes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Home Page: https://arxiv.org/abs/2304.08345
License: MIT License
Codes and Models for VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Home Page: https://arxiv.org/abs/2304.08345
License: MIT License
Thank you very much for your nice work! However, I encountered the following error when executing utils/extract_frame_and_wav_multiprocess.py
for processing MSRVTT. Additionally, the progress bar is not being displayed, but the generated video frames (.jpg) and audio files (.wav) do appear in the testt
folder. This program has been running for 15 hours with only 2374 audio files and frame files generated.
The error is as follows:
(valor) xxx:/VALOR/utils$ python extract_frame_and_wav_multiprocess.py
0%| | 0/10005 [00:00<?, ?it/s]
Exception in thread Thread-3:
Traceback (most recent call last):
File "/anaconda3/envs/valor/lib/python3.9/threading.py", line 973, in _bootstrap_inner
self.run()
File "/anaconda3/envs/valor/lib/python3.9/threading.py", line 910, in run
self._target(*self._args, **self._kwargs)
File "/anaconda3/envs/valor/lib/python3.9/multiprocessing/pool.py", line 576, in _handle_results
task = get()
File "/anaconda3/envs/valor/lib/python3.9/multiprocessing/connection.py", line 256, in recv
return _ForkingPickler.loads(buf.getbuffer())
TypeError: __init__() missing 2 required positional arguments: 'stdout' and 'stderr'
It's strange that similar errors did not occur when processing the DiDeMo
dataset, but they are encountered when handling the MSRVTT
dataset. (Is this related to the fact that the DiDeMo
dataset doesn't have audio?)
Hi, could you also provide all versions of the pretrained weights for BERT,CLIP,VideoSwin? And could you explain how different version of these backbone corresponds to the version of the VALOR weights? Thanks!
Hi. Do you have any plans to release the inference code?
Hello, i'm new in this field and I'm a bit confused about how to calculate the metric on the MSRVTT set, when each video will have 20 corresponding descriptive captions. So how do we calculate to get the correlation matrix between captions and videos because the number of videos in the test set is only 2990 and the number of captions is 2990x20=59800, I have read your code but I really haven't seen it yet understand the core point here. Hope you can explain this to me
when running the code ,it appers the error as follows:
Traceback (most recent call last):
File "./train.py", line 95, in <module>
File "./train.y", line 78, in main
zero_shot_evaluation(model, val_loaders, opts)
File "/media/yxl/a/2025191008/VALOR-master/train_utils.py", line 247, in zero_shot_evaluation
eval_log = validate(model, test_loader, opts, global_step=0, total_step=opts.num_train_steps)
File "/media/yxl/a/2025191008/VALOR-master/test.py", line 26, in validate
val_log = validate_single(model, loader, task.split('--')[0], opts, global_step, total_step,task.split('--')[1])
File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/media/yxl/a/2025191008/VALOR-master/test.py", line 40, in validate_single
return validate_cap(model, val_loader, task, opts, global_step, dset_name)
File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/media/yxl/a/2025191008/VALOR-master/test.py", line 161, in validate_cap
evaluation_dict = model(batch, task_str, compute_loss=False)
File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/apex/amp/_initialize.py", line 196, in new_fwd
output = old_fwd(*applier(args, input_caster),
File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 135, in forward
return self.forward_cap(batch, task, compute_loss=compute_loss)
File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 726, in forward_cap
return self.generate_cap(batch, task)
File "/media/yxl/a/2025191008/VALOR-master/model/pretrain.py", line 930, in generate_cap
video_input = self.get_multimodal_forward_input_video(video_output)
File "/media/yxl/a/2025191008/VALOR-master/model/modeling.py", line 490, in get_multimodal_forward_input_video
video_output = video_output + self.video_frame_embedding[:,:video_output.shape[1],:].unsqueeze(-2)
RuntimeError: CUDA error: no kernel image is available for execution on the device
0%| | 0/1495 [00:00<?, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 15399) of binary: /home/yxl/anaconda3/envs/valor_env/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
I've tried every method on the Internet but still don't solve the problem.
My environment :
_sys.platform linux
Python 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0]
numpy 1.24.4
detectron2 failed to import
detectron2._C not built correctly: No module named 'detectron2'
Compiler ($CXX) c++ (GCC) 7.3.0
CUDA compiler Build cuda_11.1.TC455_06.29190527_0
DETECTRON2_ENV_MODULE
PyTorch 1.9.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torch
PyTorch debug build False
GPU available True
GPU 0 GeForce RTX 2080 Ti (arch=7.5)
CUDA_HOME /usr/local/cuda
TORCH_CUDA_ARCH_LIST 7.5
Pillow 10.1.0
torchvision 0.10.0+cu111 @/home/yxl/anaconda3/envs/valor_env/lib/python3.8/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
cv2 Not found
PyTorch built with:
Hey guys! I got to installing it, apex finally installed, and all is here. Models downloaded.
How do I run the inference on videos? I already extracted the video and audio and now what should I run, train or test?
Hi, thanks for your interesting work!
I'm trying to reproduce your results. Running 'sh preinstall.sh' I get some errors with Bert model.
I downloaded pretrained weights from your link from "Download Checkpoints" section. I use "pretrained_weights/bert_base_uncased_config.json" and "pretrained_weights/bert-base-uncased.bin" files.
However
File "/VALOR/model/bert.py", line 334, in forward
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
RuntimeError: The size of tensor a (5000) must match the size of tensor b (250) at non-singleton dimension 0
Size of query_layer tensor is [5000, 12, 42, 64], size of key_layer tensor is [250, 12, 49, 64]
Hi, I have read your paper, nice work on various video downstream tasks. However, some of the major or competitive methods are not compared for VideoQA (such as MulTI, mPLUG-2, and UMT-L) and VideoCaption (such as HiTeA, and mPLUG-2). These methods are also SoTA methods and worth for comparison.
Hope you can consider above suggestions, thanks.
Hi, nice work. When will the pre-training dataset will be released?
Thanks for sharing your work!!!
I tested the test code provided in the README.md
on msrvtt-1kA and obtained the following results:
07/18/2023 17:38:05 - INFO - main - ====-zero-shot evaluation--ret%tva%tv--msrvtt_ret_t_v========
07/18/2023 17:38:05 - INFO - main - {'video_recall': '39.9/69.2/78.8', 'video_ravg': 62.6, 'video_medianR': 2.0, 'video_meanR': 17.953125}
07/18/2023 17:38:05 - INFO - main - ====-zero-shot evaluation--ret%tva%tv--msrvtt_ret_t_va========
07/18/2023 17:38:05 - INFO - main - {'video_recall': '43.0/72.1/82.1, 'video_ravg': 65.7, 'video_medianR': 2.0, 'video_meanR': 15.1953125}
Why is the result much lower than the official announcement?
Hello, I read your paper, great work! While I try running your code, I found that only the raw_videos section of the MSRVTT dataset is publicly available online. Could you please provide information on where to find the frames_fps4 and audio_22050hz sections? Thank you very much!
Does anybody have an inference code or notebook to run VALOR for QA task? any notebooks at least for information retrieval
Hi!
Thanks for your great job!
The link to the "pretrained_weights" (first in the section "Download Checkpoints") is not available. It gives me "You need access"
Thanks for the code and documentation. I am running the captioning finetuning experiment on MSRVTT. During the evaluation stage, the code stops with an AssertionError here. Seems like hypo
variable contains repetition of the same sentence multiple times. Can you please tell if I missed any step of if not, why is this error coming and how to solve it?
Here is the generated hypo
variable and the ref
variable output for video9894:
['in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera', 'in the room a man in red was talking to the camera']
['a boy is talking to his roomnates who are in different room', 'a man asks another man to help him with chores', 'a man avoids helping his roommates', 'a man doesn t help his friends with anything', 'a man drinking some beer', 'a man walking in an apartment', 'a person communicating with other person', 'he was in the kitchen', 'man asking another man to do the dishes', 'man refusing to help his roommate', 'roommate continues to say no each time he is asked to help with something', 'the boys meet courier boy', 'the man asks for help', 'the youtube nigahiga doesn t want to help anyone', 'two friends are having fun', 'two men are talking in a kitchen', 'two young men talking to each other about doing dishes', 'a man walking in an apartment', 'a man drinking some beer', 'roommate continues to say no each time he is asked to help with something']
Thanks!
Thanks for your very good work. Excellent performance on my audio-text multimodal dataset.
I'd like to ask if there are any plans to streamline or optimize the NVIDIA/Apex related content in the future, I got tons of weird bugs when setting up my environment and I believe this will also cause problems for others who want to try your repo.
Thank you for your excellent contribution, but I found some problems when running the code: When I processed the MSRVTT dataset through extract_frame_and_wav_multiprocess.py, when processing some videos, an error "Output file #0 does not contain any stream" appeared, In the end, all the pictures were successfully extracted for the MSRVTT dataset, but only 8811 audio files were extracted. Is this normal, maybe because some videos in the MSRVTT dataset have no audio? Looking forward to your response, thank you again for your contribution to the field of video retrieval!
Thank you very much for your excellent contribution,But I found some problems while running:During training or inference, the program repeatedly warns as follows, but the program still runs:
torch.cat(): Sizes of tensors must match except in dimension 0. Got 240 and 224 in dimension 2 (The offending index is 3)
05/17/2023 20:37:16 - INFO - main - current idx video7812 from ret returns wrong image/video, use 4266 instead.
torch.cat(): Sizes of tensors must match except in dimension 0. Got 240 and 224 in dimension 2 (The offending index is 3)
05/17/2023 20:37:17 - INFO - main - current idx video3481 from ret returns wrong image/video, use 225 instead.
Thank you so much for your excellent work!
I found many videos in the proposed VALOR-32K are unavailable on YouTube. 146 and 151 videos in the test and val sets are unavailable, respectively. And I am still downloading videos in the train set.
Can you provide the original raw videos for these unavailable ones for me? Because it is a large number of videos that can definitely affect the model's performance.
Hi authors,
Amazing paper and thanks for providing this nice code base. I have a question regarding the finetuned model, specifically for video-text retrieval task. Do you have plans to release those models? I do understand that we can use the pretrained VALOR as provided in the main page README (shown below)
to finetune the pretrained models for down-stream tasks. But in the paper, the implementation details suggest using 8 A100 GPUs which I don't have. So I probably cannot reproduce the good results reported in the paper. Therefore, I am wondering if you plan to release the finetuned models for video-text retrieval task?
Thanks!
Shane
Hi, would it be possible to release a demo or code through which I could essentially perform inference on a video. I would like to get embeddings (MGA) for video clips, text and background audio in the same latent space.
There are two functions in VALOR/optim/misc.py, one is build_optimizer and the other is build_optimizer_for_VQA. Is the second one specifically for the VQA task, while the first one is for other tasks? Which function did you use to obtain the results listed in the paper?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.