Giter VIP home page Giter VIP logo

univl's Introduction

The implementation of paper UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation.

UniVL is a video-language pretrain model. It is designed with four modules and five objectives for both video language understanding and generation tasks. It is also a flexible model for most of the multimodal downstream tasks considering both efficiency and effectiveness.

alt text

Preliminary

Execute below scripts in the main folder firstly. It will avoid download conflict when doing distributed pretrain.

mkdir modules/bert-base-uncased
cd modules/bert-base-uncased/
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-vocab.txt
mv bert-base-uncased-vocab.txt vocab.txt
wget https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz
tar -xvf bert-base-uncased.tar.gz
rm bert-base-uncased.tar.gz
cd ../../

Requirements

  • python==3.6.9
  • torch==1.7.0+cu92
  • tqdm
  • boto3
  • requests
  • pandas
  • nlg-eval (Install Java 1.8.0 (or higher) firstly)
conda create -n py_univl python=3.6.9 tqdm boto3 requests pandas
conda activate py_univl
pip install torch==1.7.1+cu92
pip install git+https://github.com/Maluuba/nlg-eval.git@master

Pretrained Weight

mkdir -p ./weight
wget -P ./weight https://github.com/microsoft/UniVL/releases/download/v0/univl.pretrained.bin

Prepare for Evaluation

Get data for retrieval and caption (with only video input) on YoucookII and MSRVTT.

YoucookII

mkdir -p data
cd data
wget https://github.com/microsoft/UniVL/releases/download/v0/youcookii.zip
unzip youcookii.zip
cd ..

Note: you can find youcookii_data.no_transcript.pickle in the zip file, which is a version without transcript. The transcript version will not be publicly avaliable due to possible legal issue. Thus, you need to replace youcookii_data.pickle with youcookii_data.no_transcript.pickle for youcook retrieval task and caption with only video input task. S3D feature can be found in youcookii_videos_features.pickle. The feature is extract as one 1024-dimension vector per second. More details can be found in dataloaders and our paper.

MSRVTT

mkdir -p data
cd data
wget https://github.com/microsoft/UniVL/releases/download/v0/msrvtt.zip
unzip msrvtt.zip
cd ..

Finetune on YoucookII

Retrieval

  1. Run retrieval task on YoucookII
DATATYPE="youcook"
TRAIN_CSV="data/youcookii/youcookii_train.csv"
VAL_CSV="data/youcookii/youcookii_val.csv"
DATA_PATH="data/youcookii/youcookii_data.pickle"
FEATURES_PATH="data/youcookii/youcookii_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"

python -m torch.distributed.launch --nproc_per_node=4 \
main_task_retrieval.py \
--do_train --num_thread_reader=16 \
--epochs=5 --batch_size=32 \
--n_display=100 \
--train_csv ${TRAIN_CSV} \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_youcook_retrieval --bert_model bert-base-uncased \
--do_lower_case --lr 3e-5 --max_words 48 --max_frames 48 \
--batch_size_val 64 --visual_num_hidden_layers 6 \
--datatype ${DATATYPE} --init_model ${INIT_MODEL}

The results (FT-Joint) are close to R@1: 0.2269 - R@5: 0.5245 - R@10: 0.6586 - Median R: 5.0

Plus --train_sim_after_cross to train align approach (FT-Align),

The results (FT-Align) are close to R@1: 0.2890 - R@5: 0.5760 - R@10: 0.7000 - Median R: 4.0

  1. Run retrieval task on MSRVTT
DATATYPE="msrvtt"
TRAIN_CSV="data/msrvtt/MSRVTT_train.9k.csv"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"

python -m torch.distributed.launch --nproc_per_node=4 \
main_task_retrieval.py \
--do_train --num_thread_reader=16 \
--epochs=5 --batch_size=128 \
--n_display=100 \
--train_csv ${TRAIN_CSV} \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_retrieval --bert_model bert-base-uncased \
--do_lower_case --lr 5e-5 --max_words 48 --max_frames 48 \
--batch_size_val 64 --visual_num_hidden_layers 6 \
--datatype ${DATATYPE} --expand_msrvtt_sentences --init_model ${INIT_MODEL}

The results (FT-Joint) are close to R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0

Plus --train_sim_after_cross to train align approach (FT-Align)

Caption

Run caption task on YoucookII

TRAIN_CSV="data/youcookii/youcookii_train.csv"
VAL_CSV="data/youcookii/youcookii_val.csv"
DATA_PATH="data/youcookii/youcookii_data.pickle"
FEATURES_PATH="data/youcookii/youcookii_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"

python -m torch.distributed.launch --nproc_per_node=4 \
main_task_caption.py \
--do_train --num_thread_reader=4 \
--epochs=5 --batch_size=16 \
--n_display=100 \
--train_csv ${TRAIN_CSV} \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_youcook_caption --bert_model bert-base-uncased \
--do_lower_case --lr 3e-5 --max_words 128 --max_frames 96 \
--batch_size_val 64 --visual_num_hidden_layers 6 \
--decoder_num_hidden_layers 3 --stage_two \
--init_model ${INIT_MODEL}

The results are close to

BLEU_1: 0.4746, BLEU_2: 0.3355, BLEU_3: 0.2423, BLEU_4: 0.1779
METEOR: 0.2261, ROUGE_L: 0.4697, CIDEr: 1.8631

If using video only as input (youcookii_data.no_transcript.pickle),

The results are close to

BLEU_1: 0.3921, BLEU_2: 0.2522, BLEU_3: 0.1655, BLEU_4: 0.1117
METEOR: 0.1769, ROUGE_L: 0.4049, CIDEr: 1.2725

Run caption task on MSRVTT

DATATYPE="msrvtt"
TRAIN_CSV="data/msrvtt/MSRVTT_train.9k.csv"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"

python -m torch.distributed.launch --nproc_per_node=4 \
main_task_caption.py \
--do_train --num_thread_reader=4 \
--epochs=5 --batch_size=128 \
--n_display=100 \
--train_csv ${TRAIN_CSV} \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_caption --bert_model bert-base-uncased \
--do_lower_case --lr 3e-5 --max_words 48 --max_frames 48 \
--batch_size_val 32 --visual_num_hidden_layers 6 \
--decoder_num_hidden_layers 3 --datatype ${DATATYPE} --stage_two \
--init_model ${INIT_MODEL}

The results are close to

BLEU_1: 0.8051, BLEU_2: 0.6672, BLEU_3: 0.5342, BLEU_4: 0.4179
METEOR: 0.2894, ROUGE_L: 0.6078, CIDEr: 0.5004

Pretrain on HowTo100M

Format of csv

video_id,feature_file
Z8xhli297v8,Z8xhli297v8.npy
...

Stage I

ROOT_PATH=.
DATA_PATH=${ROOT_PATH}/data
SAVE_PATH=${ROOT_PATH}/models
MODEL_PATH=${ROOT_PATH}/UniVL
python -m torch.distributed.launch --nproc_per_node=8 \
${MODEL_PATH}/main_pretrain.py \
 --do_pretrain --num_thread_reader=0 --epochs=50 \
--batch_size=1920 --n_pair=3 --n_display=100 \
--bert_model bert-base-uncased --do_lower_case --lr 1e-4 \
--max_words 48 --max_frames 64 --batch_size_val 344 \
--output_dir ${SAVE_PATH}/pre_trained/L48_V6_D3_Phase1 \
--features_path ${DATA_PATH}/features \
--train_csv ${DATA_PATH}/HowTo100M.csv \
--data_path ${DATA_PATH}/caption.pickle \
--visual_num_hidden_layers 6 --gradient_accumulation_steps 16 \
--sampled_use_mil --load_checkpoint

Stage II

ROOT_PATH=.
DATA_PATH=${ROOT_PATH}/data
SAVE_PATH=${ROOT_PATH}/models
MODEL_PATH=${ROOT_PATH}/UniVL
INIT_MODEL=<from first stage>
python -m torch.distributed.launch --nproc_per_node=8 \
${MODEL_PATH}/main_pretrain.py \
--do_pretrain --num_thread_reader=0 --epochs=50 \
--batch_size=960 --n_pair=3 --n_display=100 \
--bert_model bert-base-uncased --do_lower_case --lr 1e-4 \
--max_words 48 --max_frames 64 --batch_size_val 344 \
--output_dir ${SAVE_PATH}/pre_trained/L48_V6_D3_Phase2 \
--features_path ${DATA_PATH}/features \
--train_csv ${DATA_PATH}/HowTo100M.csv \
--data_path ${DATA_PATH}/caption.pickle \
--visual_num_hidden_layers 6 --decoder_num_hidden_layers 3 \
--gradient_accumulation_steps 60 \
--stage_two --sampled_use_mil \
--pretrain_enhance_vmodal \
--load_checkpoint --init_model ${INIT_MODEL}

Citation

If you find UniVL useful in your work, you can cite the following paper:

@Article{Luo2020UniVL,
  author  = {Huaishao Luo and Lei Ji and Botian Shi and Haoyang Huang and Nan Duan and Tianrui Li and Jason Li and Taroon Bharti and Ming Zhou},
  title   = {UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation},
  journal = {arXiv preprint arXiv:2002.06353},
  year    = {2020},
}

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Acknowledgments

Our code is based on pytorch-transformers v0.4.0 and howto100m. We thank the authors for their wonderful open-source efforts.

univl's People

Contributors

arrowluo avatar jilei avatar microsoft-github-policy-service[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

univl's Issues

Captioning task clarification: video vs. video+text for captioning task

Just to clarify, is the MSR-VTT captioning pipeline the video-only task, or the video+transcript setting (as described in Sec. 4.3.2. of your paper)? Upon inspection, it seems like at test time, by default, the model simply takes in [CLS] [SEP] as the textual input -- so I'm led to believe that this is the video-only setting?

I am most interested in the video-only setting, so I just wanted to confirm. I'm currently trying to replicate your UniVL MSR-VTT captioning results to study some of the model capabilities and was able to get very similar metrics to those reported in the README.

Thank you so much for making your work easy to reproduce!

Expected data format?

Hello, I was trying to run the README example(s) on the youcook2 dataset. I've downloaded all the files from the youcook webpage and ran the download scripts. Reading through dataloader_youcook_caption.py seems to indicate you expect the data to be in some custom/different format. Is that correct? For example, I don't see any .pickle files in the original dataset, and none of the csv files have a column feature_file. Can you clarify the steps required to run the README?

Why is the fine-tuning performance much lower than benchmark in paper?

Hi, @ArrowLuo , I am fine tuning the model on captioning downstream task, however, I find its evaluation performance is much poor than benchmark in paper. Actually I have set epoch to be 10 and batch size =16 (same for validation), and my best validation score is: BLEU_1: 0.3759, BLEU_2: 0.2398, BLEU_3: 0.1576, BLEU_4: 0.1069, METEOR: 0.1682, ROUGE_L: 0.3916, CIDEr: 1.2186.
Is it kind due to the the operation that I threw away the distributed training in coding? because I always faced the distributed computation issue in colab, and batch size = 16 is because larger batch size would cause memory issue.

where to get transcript to generate youcookii_data.pickle

in dataloaders/README.md

This file is generated from `youcookii_annotations_trainval.json`, which can be downloaded from [official webpage](http://youcook2.eecs.umich.edu/download).

but, i download youcookii_annotations_trainval.tar.gz from
image
and extract youcookii_annotations_trainval.json, then found youcookii_annotations_trainval.json has no transcript content;
youcookii_annotations_trainval.json show as following picture:
image

caption my own video with provided pretrained model

Hi, thanks for the wonderful work.
I want to caption my own videos giving the video frames (without transcript), can I use the pretrained weight (univl.pretrained.bin) provided in the repository directly to finish this task? I evaluated the pretained weightunivl.pretrained.bin directly on MSRVTT with the following code,

DATATYPE="msrvtt"
TRAIN_CSV="data/msrvtt/MSRVTT_train.9k.csv"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"

python -m torch.distributed.launch --nproc_per_node=1 \
main_task_caption.py \
--do_eval --num_thread_reader=4 \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_caption --bert_model bert-base-uncased \
--do_lower_case \
--batch_size_val 32 --visual_num_hidden_layers 6 \
--decoder_num_hidden_layers 3 --datatype ${DATATYPE} --stage_two \
--init_model ${INIT_MODEL}

but got a very low metric value:

BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
 METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045

Emmm, I'm a fresher of this field, I would appreciate it a lot if you can provide some suggestions, instructions or codes on making use of provided pretrained model to deal with video captioning tasks in the real cases. (Perhaps main points lie in pretrained model, feature extraction and result visualization?)

Unable to run video captioning code

I followed the steps in downloading all the necessary dependencies and data to run the code. When running the code, this error is thrown:

in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['<path to python executable>', '-u', 'main_task_caption.py', '--local_rank=3', '--do_train', '--num_thread_reader=4', '--epochs=5', '--batch_size=128', '--n_display=100', '--train_csv', 'data/msrvtt/MSRVTT_train.9k.csv', '--val_csv', 'data/msrvtt/MSRVTT_JSFUSION_test.csv', '--data_path', 'data/msrvtt/MSRVTT_data.json', '--features_path', 'data/msrvtt/msrvtt_videos_features.pickle', '--output_dir', 'ckpts/ckpt_msrvtt_caption', '--bert_model', 'bert-base-uncased', '--do_lower_case', '--lr', '3e-5', '--max_words', '48', '--max_frames', '48', '--batch_size_val', '32', '--visual_num_hidden_layers', '6', '--decoder_num_hidden_layers', '3', '--datatype', 'msrvtt', '--stage_two', '--init_model', 'weight/univl.pretrained.bin']' returned non-zero exit status 1.

There is only 1 gpu on my laptop so I am not sure if this is causing the issue. I just wanted to try out the video captioning capability of this model. Thank you!

How should I set the value in youcookii_videos_features.pickle when fine-tuning with single transcript as input?

Hi, @ArrowLuo . I propose to fine tune the model with single transcript as input, so I generate another .pickle for youcookii_videos_features whose numpy arrays are set to be 'nan' for all single elements in video's each ndarray, like array with shape of number_of_frames * 1024 will all elements are 'nan'. I found the progressive training loss is nan.
So I set them to be zero, do you agree with this modification?
May I ask how do you deal with it in the case of single text modal info as input?

feature & data shape

Hi, I am leaving another question here.

I am looking at the data format for setting my custom dataset in the same format as YouCookii.
When I loaded the pickle files which are 'youcookii_data.no_transcript.pickle' and 'youcookii_videos_features.pickle',
I could see the lengths of the two files are different.
I expected to see the same length for both data and feature pickles, but can you tell me why those lengths are different?

data = pickle.load(open('youcookii_data.no_transcript.pickle','rb'))
len(data)
# 1790
features = pickle.load(open('youcookii_videos_features.pickle','rb'))
len(features)
# 1905

Thank you so much,

How does the visual token come from?

As described in paper, S3D is used to extract video features and then feed features to Trasnformer.
But the source code in module_visual, using an Embedding.
My question is did you do some processing to convert video features to video tokens?

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

Hi, @ArrowLuo I did training in fine-tuning stage for video captioning task. However, there is error of 'RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.'


RuntimeError Traceback (most recent call last)
in ()
31 coef_lr = 1.0
32 optimizer, scheduler, model = prep_optimizer(args, model, num_train_optimization_steps, device, n_gpu,
---> 33 args.local_rank, coef_lr=coef_lr)
34
35 if args.local_rank == 0:

2 frames
in prep_optimizer(args, model, num_train_optimization_steps, device, n_gpu, local_rank, coef_lr)
28
29 model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank],
---> 30 output_device=local_rank, find_unused_parameters=True)
31
32 #model = torch.nn.DataParallel(model).cuda()

/usr/local/lib/python3.7/dist-packages/torch/nn/parallel/distributed.py in init(self, module, device_ids, output_device, dim, broadcast_buffers, process_group, bucket_cap_mb, find_unused_parameters, check_reduction, gradient_as_bucket_view)
399
400 if process_group is None:
--> 401 self.process_group = _get_default_group()
402 else:
403 self.process_group = process_group

/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py in _get_default_group()
345 """
346 if not is_initialized():
--> 347 raise RuntimeError("Default process group has not been initialized, "
348 "please make sure to call init_process_group.")
349 return GroupMember.WORLD

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

I set worldsize = 1. I attach my log.txt here.
log (1).txt

caption using features extracted from my raw video

Hi~ sorry for bothering you again.
I have successfully finetuned the model on the caption task with MSRVTT dataset. And following the readme in the dataloader dir, I also successfully extracted the S3D feature from my own raw video, and get a pickle file. But there are some extra files which is need to run the script as listed in the parameter:

DATATYPE="msrvtt"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="ckpts/ckpt_msrvtt_caption@server-westlakeT0528/pytorch_model.bin.4"
OUTPUT_ROOT="results"

Could you please explain how I can get corresponding VAL_CSV and DATA_PATH files (or how to assigned the parameter) to finish the evaluation of captioning task on my extracted feature? Thanks a lot!

Joint loss in pretraining

Hi,
We found that video text joint loss in pretraining is calculated from masked video and text. Why not use the origin video and text like retrieval finetune?

sim_matrix_text_visual = self.get_similarity_logits(sequence_output_alm, visual_output_alm,

Error message (torch.distributed.elastic.multiprocessing.errors.ChildFailedError:)

I am trying to test on own data. However, I got this error message. Can you help me fix it? thanks,
Traceback (most recent call last):
File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in
main()
File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/tingchih/.local/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_task_caption.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2022-10-05_12:41:40
host : nlplab1
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 36124)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

How to fine-tune with additional layers before UniVL?

Hi! Thanks for your awesome work and I am trying to use your pretrained weights to train on another dataset.
However, the inputs of my data consist of two different parts and I need to do the attention operation before put them into the pretrained UniVL to finetune.
Could you please give me some suggestions on how to fine-tune the model with additional layers before the UniVL? It is like inputs -> additional attention module (random initialization) -> UniVL. I am confused about the training strategy since I have not done the pre-training work before.
It will be my great pleasure if you could reply to me :)
Best wishes.

Estimate of zero-shot performance

Hi! Thanks for the open-sourced code!

I wonder if you have conducted zero-shot experiments on MSRVTT or other downstream datasets. I get the following performance on standard text-to-video retrieval:

MR             68.5
R1              7.0
R10             23.4
R5             16.6

I am trying to make sure my pipeline is correct (with the UniVL model and my own trainer pipeline). Do you have zero-shot numbers on MSRVTT for comparison?

Weights from pretrained model not used in UniVL in evaluation. In EVALUATION, there is lack of visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin in visual-base, cross-base , decoder-base

当我基于预训练的weights进行evaluation时, when I do the evaluation based on pre-trained weights
我遇到了如下问题: I meet issues below:

  • INFO - Weight doesn't exsits. /content/visual-base/visual_pytorch_model.bin
    ......
  • INFO - Weight doesn't exsits. /content/cross-base/cross_pytorch_model.bin
    ......
  • INFO - Weight doesn't exsits. /content/decoder-base/decoder_pytorch_model.bin
  • WARNING - Stage-One:True, Stage-Two:False
  • WARNING - Set bert_config.num_hidden_layers: 12.
  • WARNING - Set visual_config.num_hidden_layers: 6.
  • INFO - --------------------
  • INFO - Weights from pretrained model not used in UniVL:

eval_epoch()没有实际运行就结束了. The eval_epoch() does not execute actually and the program finishes.

There is lack of visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin in visual-base, cross-base , decoder-base on Github page.
在主页上的visual-base, cross-base , decoder-base文件夹里不存在visual_pytorch_model.bin, cross_pytorch_model.bin, decoder_pytorch_model.bin

Hyper-parameter in pretraining

Hi,
I found that the learning rate in pretraining Stage I released in the paper is 1e-3, and batchsize is 600. The scripts in this repo suggests 1e-4 and 1920, but usually learning rate should be increased with the batchsize. In Stage II, batchsize in the paper and the scripts is very different (48 vs 960). Consider that hyper parameters searching will take a lot of time in pretraining, I'm not sure what parameters should be used. Is there any misunderstanding?

video only test for youcook

Hi again,

Thanks for the supports.
I am now trying to insert only video file as an test input for Youcook model.
And I am having trouble modyfing getitem() funciton in dataloader.

Most issues are in the self._get_text().
It seems like I have to change the lines which are using data dictionary to default text input such as [CLS] or [SEP], but have no idea here.

Could you give me a few guides for modyfing to video only input?
Thanks,

About msrvtt retrieval results

I found that the MSRVTT text-to-video retrieval performance under FT-Joint setting released in the readme is R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0, but the result in the paper is R@1: 0.206 - R@5: 0.491 - R@10: 0.629 - Median R: 6.0. What is the difference between them?
Addtionally, what is the performance of the FT-Align setting should be? It seems to be forgotten in the readme. Actually I tried to finetune use the scripts released by the repo but got worse score than FT-Joint on MSRVTT.

CLip

do you try to use CLIP to generate video-caption?
I think it will be useful.

Run Without Distributed

Hello, I am trying to run your code but I keep running into issues with the distributed learning. Is it possible to run without this?

How to only input text feature or video feature

I want to only input text feature or video feature in UniVL. In this paper, it said that one transformer combines text representation T and video representation V. Could you tell me how to change it to only input T or V into UniVL? thanks

Issues about Freezing some additional layers instead of meanP in CLIP4Clip

Dear Author, I really deeply fascinated to your multimodal studies in these days.

To extended questions about ArrowLuo/CLIP4Clip#42, I have questions in my own.
So I really appreciate in advance for your kind teaching and advice.

Though I want to apply the cross module for fine-grained, cross representations, like you did in UniVL,
I suddenly come up with that before questionaires in my mind.

as you mentioned before in upper link,
transformer in the cross module gains randomly initialized weights, so it cannot outperform than when I just set the similarity policy in meanP.

Here is my question :

  1. Even though the cross module has the limits than meanP as you mentioned before,
    is there any special reason you selected cross module, not meanP in UniVL?
  2. Is it possible in CLIP4Clip that Masking modelling as you did in UniVL?
  3. Unlike CLIP4Clip,
    cross_output, pooled_output, concat_mask = self._get_cross_output(sequence_output_alm, visual_output_alm, attention_mask, video_mask)

    Can you explain the exact meaning of 'cross_output', 'concat_mask', and following objects : sequence_cross_output, and visual_cross_output?
    I guest that sequence_cross_output and visual_cross_output have more multi-modally engaged than offline representations - sequence_output , visual_output - though I want to know that.

I really feel enthusiastic in your studies, and thanks for your contribution in multimodal fields.

Sincerely,

What's the role of the parameter coef_lr?

Hi Arrow,
I observed that after the pre-training stage-1, the parameters of BERT had very small changes with the initialization parameters. Is it because parameter coef_lr is working? Since it was set to 0.1 at the 1st stage and set to 1 at the 2nd stage. I guess it's to prevent BERT from being damaged at the beginning of training.

UniVL/main_pretrain.py

Lines 383 to 385 in 0a7c07f

coef_lr = args.coef_lr
if args.init_model:
coef_lr = 1.0

By the way, you named no_decay_xxx with the decay coefficient, and named decay_xxx without decay coefficient. Are these typoes?

UniVL/main_pretrain.py

Lines 191 to 194 in 0a7c07f

{'params': [p for n, p in no_decay_bert_param_tp], 'weight_decay': 0.01, 'lr': args.lr * coef_lr},
{'params': [p for n, p in no_decay_nobert_param_tp], 'weight_decay': 0.01},
{'params': [p for n, p in decay_bert_param_tp], 'weight_decay': 0.0, 'lr': args.lr * coef_lr},
{'params': [p for n, p in decay_nobert_param_tp], 'weight_decay': 0.0}

CrossTask and COIN dataset code

hi, thank you for sharing your excellent work and code.
could you provide the crosstask dataset and coin dataset s3d feature and code?

How can I create my video feature pickle

In the caption task, I see you have youcookii_videos_features.pickle to record video features. Now, I want to test this model in my own video dataset. How can I build up this file? I follow this github(https://github.com/ArrowLuo/VideoFeatureExtractor) to extract feature and build up the pickle. However, I have an error message like this. It looks like the tensor size problem. Could you help me to fit it up?

Traceback (most recent call last): File "main_task_caption.py", line 689, in <module> main() File "main_task_caption.py", line 667, in main scheduler, global_step, nlgEvalObj=nlgEvalObj, local_rank=args.local_rank) File "main_task_caption.py", line 361, in train_epoch output_caption_ids=pairs_output_caption_ids) File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/tingchih/github_clone/UniVL/modules/modeling.py", line 196, in forward video = self.normalize_video(video) File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/tingchih/github_clone/UniVL/modules/modeling.py", line 91, in forward video = self.visual_norm2d(video) File "/home/tingchih/anaconda3/envs/py_univl/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/home/tingchih/github_clone/UniVL/modules/until_module.py", line 53, in forward return self.weight * x + self.bias RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2

thanks,

end-to-end video file captioning process

Hi, thanks for the great sources.
I successfully fine-tuned video captioning model for Youcook2, and now trying to input video file as a input and get the caption of it.
But when I see the code, it seems like it always require pickle file or npy feature file.
Is there a way to get the captioning script as an output with the custom video file input?
Thanks,

Questions on retrieval result and "Info: Weight doesn't exsits"

Hi, @ArrowLuo
Thanks for your great project! I would like to ask some questions on retrieval result and "Info: Weight doesn't exsits"

  1. I have finished fine-tuning on MSR-VTT retrieval&captioning with pre-trained model
    Why is the retrieval result with FT-Align a bit lower than the FT-Joint in readme?
    By the way, I just directly used the default setting with your command in readme.
retrieval, FT-Joint, 8 A100 GPUs
R@1: 0.2560 - R@5: 0.5510 - R@10: 0.6860 - Median R: 4.0

retrieval, FT-Align, 8 A100 GPUs
R@1: 0.2620 - R@5: 0.5500 - R@10: 0.6920 - Median R: 4.0

The results (FT-Joint) are close to R@1: 0.2720 - R@5: 0.5570 - R@10: 0.6870 - Median R: 4.0
  1. I have also finishing extracting video features, pre-training on HowTo100M and fine-tuning on MSR-VTT retrieval&captioning
    I am curious about the message "Info: Weight doesn't exists" that appears both in pre-training and fine-tuning.
    It seems that they are used to to remind me to load the pre-trained video encoder, cross encoder and decoder respectively.
    Have you conducted some experiments with these pre-trained modules?
INFO: Weight doesn't exsits. /nvme/UniVL/modules/visual-base/visual_pytorch_model.bin
INFO: Weight doesn't exsits. /nvme/UniVL/modules/cross-base/cross_pytorch_model.bin
INFO: Weight doesn't exsits. /nvme/UniVL/modules/decoder-base/decoder_pytorch_model.bin

The program hangs when runs into parallel_apply() function in util.py

I ran main_task_retrieval.py as README said, but when a epoch was finished and runing eval_epoch() function in main_task_retrieval.py. But when the grogram invoke parallel_apply() in eval_epoch(), it hang at the line 'modules = nn.parallel.replicate(model, device_ids)' in parallel_apply() function in util.py.

In this moment, if the NCCL_DEBUG was turn on by setting 'export NCCL_DEBUG=INFO', the messages will be show below:

210109:155288 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
210109:155287 [0] NCCL INFO Channel 00/02 : 0 1
210109:155288 [1] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
210109:155287 [0] NCCL INFO Channel 01/02 : 0 1
210109:155288 [1] NCCL INFO Setting affinity for GPU 5 to ffff,f00000ff,fff00000
210109:155287 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
210109:155287 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
210109:155287 [0] NCCL INFO Setting affinity for GPU 3 to 0fffff00,000fffff
210109:155287 [0] NCCL INFO Channel 00 : 0[66000] -> 1[b9000] via direct shared memory
210109:155288 [1] NCCL INFO Channel 00 : 1[b9000] -> 0[66000] via direct shared memory
210109:155287 [0] NCCL INFO Channel 01 : 0[66000] -> 1[b9000] via direct shared memory
210109:155288 [1] NCCL INFO Channel 01 : 1[b9000] -> 0[66000] via direct shared memory
210109:155287 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
210109:155287 [0] NCCL INFO comm 0x7f7e14003240 rank 0 nranks 2 cudaDev 0 busId 66000 - Init COMPLETE
210109:155288 [1] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
210109:155288 [1] NCCL INFO comm 0x7f7eb8003010 rank 1 nranks 2 cudaDev 1 busId b9000 - Init COMPLETE

About auto mixed precision training

Hi,
There are mixed precision related arguments (like --fp16, --fp16_opt_level) in main_pretrain.py. But it seems that they are not used. I have tried apex.amp to do mixed precision, and found it works well in the pretraining Stage I (Nearly doubled the speed). But in Stage II, the gradient occurred nan always. Have you ever had a similar problem? How could this occurred? torch.cuda.amp have this issue too.

Is the provided weights based on the pre-trained work on Howto100M dataset?

Hi, @ArrowLuo , many thanks for your previous replies, very helpful.
May I ask is the provided weights based on the pre-trained work on Howto100M dataset? When I do the video captioning downstream task, to get better evaluation results, do I need to fine-tune the model weights by further training it on YOUCOOKII ? as in the main_caption_youcook, using train function. Because when I evaluated based on provided weights, the scores for captioning is very low, instead of high score in paper.
Secondly, since I find youcookii_videos_features.pickle comprises S3D features for 1905 videos(nearly all videos), to evaluate it, is it better to do train-test split on it and just fine-tuning on training set while validating and testing on other portions?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.