Giter VIP home page Giter VIP logo

hero's Introduction

HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training

This is the official repository of HERO (EMNLP 2020). This repository currently supports finetuning HERO on TVR, TVQA, TVC, VIOLIN, DiDeMo, and MSR-VTT Retrieval. The best pre-trained checkpoint (on both HowTo100M and TV Dataset) are released. Code for HERO pre-training on TV Dataset is also available.

Overview of HERO

Some code in this repo are copied/modified from opensource implementations made available by PyTorch, HuggingFace, OpenNMT, Nvidia, TVRetrieval, TVCaption, and UNITER. The visual frame features are extracted using SlowFast and ResNet-152. Feature extraction code is available at HERO_Video_Feature_Extractor

Requirements

We provide Docker image for easier reproduction. Please install the following:

Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.

Quick Start

NOTE: Please run bash scripts/download_pretrained.sh $PATH_TO_STORAGE to get our latest pretrained checkpoints.

We use TVR as an end-to-end example for using this code base.

  1. Download processed data and pretrained models with the following command.

    bash scripts/download_tvr.sh $PATH_TO_STORAGE

    After downloading you should see the following folder structure:

    ├── finetune
    │   ├── tvr_default
    ├── video_db
    │   ├── tv
    ├── pretrained
    │   └── hero-tv-ht100.pt
    └── txt_db
        ├── tv_subtitles.db
        ├── tvr_train.db
        ├── tvr_val.db
        └── tvr_test_public.db
    
  2. Launch the Docker container for running the experiments.

    # docker image should be automatically pulled
    source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/video_db \
        $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained

    The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)

  3. Run finetuning for the TVR task.

    # inside the container
    horovodrun -np 8 python train_vcmr.py --config config/train-tvr-8gpu.json
    
    # for single gpu
    python train_vcmr.py --config $YOUR_CONFIG_JSON
  4. Run inference for the TVR task.

    # inference, inside the container
    horovodrun -np 8 python eval_vcmr.py --query_txt_db /txt/tvr_val.db/ --split val \
        --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \
        --output_dir /storage/tvr_default/ --checkpoint 4800 --fp16 --pin_mem
    

    The result file will be written at /storage/tvr_default/results_val/results_4800_all.json. Change to --query_txt_db /txt/tvr_test_public.db/ --split test_public for inference on test_public split. Please format the result file as requested by the evaluation server for submission, our code does not include formatting.

    The above command runs inference on the model we trained. Feel free to replace --output_dir and --checkpoint with your own model trained in step 3. Single GPU inference is also supported.

  5. Misc. In case you would like to reproduce the whole preprocessing pipeline.

  • Text annotation and subtitle preprocessing

    # outside of the container
    bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann
  • Video feature extraction

    We provide feature extraction code at HERO_Video_Feature_Extractor. Please follow the link for instructions to extract both 2D ResNet features and 3D SlowFast features. These features are saved as separate .npz files per video.

  • Video feature preprocessing and saved to lmdb

    # inside of the container
    
    # Gather slowfast/resnet feature paths
    python scripts/collect_video_feature_paths.py  --feature_dir $PATH_TO_STORAGE/feature_output_dir\
        --output $PATH_TO_STORAGE/video_db --dataset $DATASET_NAME
    
    # Convert to lmdb
    python scripts/convert_videodb.py --vfeat_info_file $PATH_TO_STORAGE/video_db/$DATASET_NAME/video_feat_info.pkl \
        --output $PATH_TO_STORAGE/video_db --dataset $DATASET_NAME --frame_length 1.5
    • --frame_length: 1 feature per "frame_length" seconds, we use 1.5/2 in our implementation. set it to be consistent with the one used in feature extraction.
    • --compress: enable compression of lmdb

Downstream Tasks Finetuning

TVQA

NOTE: train and inference should be ran inside the docker container

  1. download data
    # outside of the container
    bash scripts/download_tvqa.sh $PATH_TO_STORAGE
  2. train
    # inside the container
    horovodrun -np 8 python train_videoQA.py --config config/train-tvqa-8gpu.json \
        --output_dir $TVQA_EXP
  3. inference
    # inside the container
    horovodrun -np 8 python eval_videoQA.py --query_txt_db /txt/tvqa_test_public.db/ --split test_public \
        --vfeat_db /video/tv/ --sub_txt_db /txt/tv_subtitles.db/ \
        --output_dir $TVQA_EXP --checkpoint $ckpt --pin_mem --fp16
    The result file will be written at $TVQA_EXP/results_test_public/results_$ckpt_all.json, which can be submitted to the evaluation server. Please format the result file as requested by the evaluation server for submission, our code does not include formatting.

TVC

  1. download data
    # outside of the container
    bash scripts/download_tvc.sh $PATH_TO_STORAGE
  2. train
    # inside the container
    horovodrun -np 8 python train_tvc.py --config config/train-tvc-8gpu.json \
        --output_dir $TVC_EXP
  3. inference
    # inside the container
    python inf_tvc.py --model_dir $TVC_EXP --ckpt_step 7000 \
        --target_clip /txt/tvc_val_release.jsonl --output tvc_val_output.jsonl
    • tvc_val_output.jsonl will be in the official TVC submission format.
    • change to --target_clip /txt/tvc_test_public_release.jsonl for test results.

NOTE: see scripts/prepro_tvc.sh for LMDB preprocessing.

VIOLIN

  1. download data
    # outside of the container
    bash scripts/download_violin.sh $PATH_TO_STORAGE
  2. train
    # inside the container
    horovodrun -np 8 python train_violin.py --config config/train-violin-8gpu.json \
        --output_dir $VIOLIN_EXP

DiDeMo

  1. download data
    # outside of the container
    bash scripts/download_didemo.sh $PATH_TO_STORAGE
  2. train
    # inside the container
    horovodrun -np 4 python train_vcmr.py --config config/train-didemo_video_only-4gpu.json \
        --output_dir $DIDEMO_EXP
    Switch to config/train-didemo_video_sub-8gpu.json for ASR augmented DiDeMo results. You can also download the original ASR here.

MSR-VTT Retrieval

  1. download data
    # outside of the container
    bash scripts/download_msrvtt.sh $PATH_TO_STORAGE
  2. train
    # inside the container
    horovodrun -np 4 python train_vr.py --config config/train-msrvtt_video_only-4gpu.json \
        --output_dir $MSRVTT_EXP
    Switch to config/train-msrvtt_video_sub-4gpu.json for ASR augmented MSR-VTT results. You can also download the original ASR here.

How2R and How2QA

For raw annotation, please refer to How2R and How2QA. Features and code will be available soon ....

Pre-training

  1. download data
    # outside of the container
    bash scripts/download_tv_pretrain.sh $PATH_TO_STORAGE
  2. pre-train
    # inside of the container
    horovodrun -np 16 python pretrain.py --config config/pretrain-tv-16gpu.json \
        --output_dir $PRETRAIN_EXP
    Unfortunately, we cannot host HowTo100M features due to its large size. Users can either process them on their own or send your inquiry to my email address (which you can find on our paper).

Citation

If you find this code useful for your research, please consider citing:

@inproceedings{li2020hero,
  title={HERO: Hierarchical Encoder for Video+ Language Omni-representation Pre-training},
  author={Li, Linjie and Chen, Yen-Chun and Cheng, Yu and Gan, Zhe and Yu, Licheng and Liu, Jingjing},
  booktitle={EMNLP},
  year={2020}
}

License

MIT

hero's People

Contributors

bryant1410 avatar chenrocks avatar linjieli222 avatar michaelmyc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

hero's Issues

The number of queries for the didemo dataset

Thanks for sharing the code. It helps a lot !
I downloaded the didemo dataset and I found the numbers of queries are not equal in files query_data.jsonl and id2len.json. For the test set, the former is 4021, and the latter is 3982. I saw the code implementation, and the code used the latter as the final query number. I'm a little confused about why the numbers of queries in the two files are not equal. Hope to get help

One issue about pretrain-tv-init.bin file

I'm training the HERO model based on the provided TV dataset. The model needs to pre-load "pretrain-tv-init.bin" to initialize the network. However, I do not know how to obtain this file by ourselves, and I modified the code to ignore this file. But after the pre-training process is finished, the performance is really poor. Is "pretrain-tv-init.bin" essential? If so, how can we obtain this file from scratch?

Can not download the video_db file

Hi, Linjie Li:

Thank you for your open-source.

I am downloading your extract feature using bash scripts/download_tvr.sh $PATH_TO_STORAGE.

However,hero/video_db/tv.tar seems to have a problem with downloading. I have nearly downloaded for one day and the download always failed.

Screenshot 2020-12-16 111453
Screenshot 2020-12-16 111658

Do you mind check that file or provide another approach like google drive or baidu netdisk?

Datapath

Hi,
Congrats on the amazing work.

I downloaded data using bash scripts/download_tvr.sh $PATH_TO_STORAGE (I gave my own $PATH_TO_STORAGE). How can I mention this data path in the fine-tuning script both single gpu and 8 gpus?

Thank you.

The video features used in TVC task

Hi, Lin jie, How can I repreduce the results in TVC task in your paper? Can you kindly provide the features you used in this task?
Thanks.

got error while running pretrain.py

Hi,

I'm trying to reproduce pretraining with config pretrain-tv-ht-16gpu.json

I got error messages as follows:

[1,4]<stderr>:Traceback (most recent call last):
[1,4]<stderr>:  File "pretrain.py", line 619, in <module>
[1,4]<stderr>:    main(args)
[1,4]<stderr>:  File "pretrain.py", line 175, in main
[1,4]<stderr>:    train_loaders, val_loaders = build_target_loaders(target, t_r, opts)
[1,4]<stderr>:  File "pretrain.py", line 59, in build_target_loaders
[1,4]<stderr>:    target['vfeat_interval'], opts)
[1,4]<stderr>:  File "/src/load_data.py", line 37, in load_video_sub_dataset
[1,4]<stderr>:    if "msrvtt" in opts.tasks:
[1,4]<stderr>:AttributeError: 'Namespace' object has no attribute 'tasks'

So I printed that opts.

[1,4]<stdout>:Namespace(betas=[0.9, 0.98], checkpoint='/pretrain/pretrain-tv-init.bin', compressed_db=False, drop_svmr_prob=0.8, dropout=0.1, fp16=True, grad_norm=1.0, gradient_accumulation_steps=2, hard_neg_weights=[10], hard_negtiave_start_step=[20000], hard_pool_size=[20], img_db='/video', learning_rate=3e-05, load_partial_pretrained=True, lr_mul=1.0, lw_neg_ctx=8.0, lw_neg_q=8.0, lw_st_ed=0.01, margin=0.1, mask_prob=0.15, max_clip_len=100, max_txt_len=60, model_config='config/hero_pretrain.json', n_gpu=6, n_workers=1, num_train_steps=1650000, optim='adamw', output_dir='pt-temp', pin_mem=True, ranking_loss_type='hinge', save_steps=500, seed=77, skip_layer_loading=True, sub_ctx_len=0, targets=[{'name': 'tv', 'sub_txt_db': 'tv_subtitles.db', 'vfeat_db': 'tv', 'vfeat_interval': 1.5, 'splits': [{'name': 'all', 'tasks': ['mlm', 'mfm-nce', 'fom', 'vsm'], 'train_idx': 'pretrain_splits/tv_train.json', 'val_idx': 'pretrain_splits/tv_val.json', 'ratio': [2, 2, 1, 2]}]}, {'name': 'ht100_full_filtered', 'sub_txt_db': 'howto100m_pretrain_all_60s_clip_sub.db', 'vfeat_db': 'howto100m_pretrain_all_60s_clips', 'vfeat_shards': ['howto100m_pretrain_all_clips_8', 'howto100m_pretrain_all_clips_0', 'howto100m_pretrain_all_clips_1', 'howto100m_pretrain_all_clips_2', 'howto100m_pretrain_all_clips_3', 'howto100m_pretrain_all_clips_4', 'howto100m_pretrain_all_clips_5', 'howto100m_pretrain_all_clips_6', 'howto100m_pretrain_all_clips_7', 'howto100m_pretrain_all_clips_9'], 'vfeat_interval': 2.0, 'splits': [{'name': 'all', 'tasks': ['mfm-nce', 'fom'], 'train_idx': ['howto100_full_pretrain_split/ht100_full_filtered_train_8.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_0.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_1.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_2.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_3.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_4.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_5.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_6.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_7.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_9.json'], 'val_idx': 'howto100_full_pretrain_split/ht100_full_filtered_val.json', 'ratio': [2, 1]}, {'name': 'has-sub', 'tasks': ['mlm', 'vsm'], 'train_idx': ['howto100_full_pretrain_split/ht100_full_filtered_train_8.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_0.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_1.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_2.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_3.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_4.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_5.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_6.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_7.json', 'howto100_full_pretrain_split/ht100_full_filtered_train_9.json'], 'val_idx': 'howto100_full_pretrain_split/ht100_full_filtered_val.json', 'ratio': [2, 2]}]}], targets_ratio=[1, 9], train_batch_size=32, train_span_start_step=0, txt_db='/txt', use_all_neg=True, val_batch_size=32, valid_steps=5000, vfeat_interval=1.5, vfeat_version='resnet_slowfast', warmup_steps=10000, weight_decay=0.01)

I think sub_txt_db should be SubTokLmdb, but it's not..? I'm not sure. How should I fix this issue?

HERO/load_data.py

Lines 36 to 40 in 00d8fbf

if not isinstance(sub_txt_db, SubTokLmdb):
if "msrvtt" in opts.task:
sub_txt_db = VrSubTokLmdb(sub_txt_db, opts.max_clip_len)
else:
sub_txt_db = SubTokLmdb(sub_txt_db, opts.max_clip_len)

I can bypass this error message when I ignore L37-L39 and run L40.

Here is another issue

f"{target['vfeat_db']}/{shard}", sub_txt_db,

should be modified to f"{opts.img_db}/{target['vfeat_db']}/{shard}", sub_txt_db, ?

The baseline code

Hi,

Could you please also share the baseline codes of F-TRM and H-TRM? That would be helpful. Thanks!

running on cpu

Is it possible to run the pretrained model on cpu? I see the requirement for this project is nvidia (for gpus) but I have a MacBook and it does not have a gpu.

question about vocab size

In pre-train configuration file "hero_pretrain.json", the vocab size of f_config is 50,265 (it may be from RoBERTa model).

However, The pre-trained model 'hero-tv-ht100.pt" has the vocab size of f_config as 50,272 (I check the dimension of the model.v_encoder.f_encoder.lm_head.decoder)

When the 'hero-tv-ht100.pt' model is trained, which configuration file is used?

I found a bug in pretrain.py

Hi @linjieli222

from .load_data import load_video_sub_dataset

This line makes on error as follows:
ModuleNotFoundError: No module named '__main__.load_data'; '__main__' is not a package

This line should be modified to from load_data import load_video_sub_dataset

How many train steps are needed to get the performance of the paper when finetuning TVR dataset?

Hi, I'm trying to fintune TVR dataset with HERO pretrained model.
But with 5000 or 10000 train steps, I failed to reach the performance of the paper.

  1. How many train steps are needed to finetune TVR dataset?
  2. Is the number of GPU is critical to performance? I'm running this finetuning with 4 gpus.

Also, the paper doesn't describe any about hard negative sampling, but it seems to be important.
3. Have you done ablation study about hard negatives? Could you share your experience?

Preprocessing code for VIOLIN

Hi,
In the repo. create_txtdb.sh is given to create txt DB for TVR. Can you please provide the script which you used to create text DB for violin?
Thanks

An error during finetuning for the TVR task

@linjieli222
Hi, I just encountered an error in Quick Start Step3 using 1 GPU:

# inside the container
CUDA_VISIBLE_DEVICES = 0
horovodrun -np 1 python train_vcmr.py --config config/train-tvr-8gpu.json
...
...
[1,0]<stderr>:12/13/2021 09:08:05 - INFO - model.model -        Decoder Transformer config: None
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils -   Weights of HeroForVcmr not initialized from pretrained model: ['v_encoder.fom_output.linear_1.weight', 'v_encoder.fom_output.linear_1.bias', 'v_encoder.fom_output.LayerNorm.weight', 'v_encoder.fom_output.LayerNorm.bias', 'v_encoder.fom_output.linear_2.weight', 'v_encoder.fom_output.linear_2.bias']
[1,0]<stderr>:12/13/2021 09:08:08 - INFO - model.modeling_utils -   Weights from pretrained model not used in HeroForVcmr: ['vocab_padded', 'v_encoder.fr_output.linear_1.weight', 'v_encoder.fr_output.linear_1.bias', 'v_encoder.fr_output.LayerNorm.weight', 'v_encoder.fr_output.LayerNorm.bias', 'v_encoder.fr_output.linear_2.weight', 'v_encoder.fr_output.linear_2.bias', 'v_encoder.itm_clip_transform.linear_1.weight', 'v_encoder.itm_clip_transform.linear_1.bias', 'v_encoder.itm_clip_transform.LayerNorm.weight', 'v_encoder.itm_clip_transform.LayerNorm.bias', 'v_encoder.itm_clip_transform.linear_2.weight', 'v_encoder.itm_clip_transform.linear_2.bias', 'v_encoder.itm_sub_transform.linear_1.weight', 'v_encoder.itm_sub_transform.linear_1.bias', 'v_encoder.itm_sub_transform.LayerNorm.weight', 'v_encoder.itm_sub_transform.LayerNorm.bias', 'v_encoder.itm_sub_transform.linear_2.weight', 'v_encoder.itm_sub_transform.linear_2.bias']
[1,0]<stdout>:Selected optimization level O2:  FP16 training with FP32 batchnorm and FP32 master weights.
[1,0]<stdout>:
[1,0]<stdout>:Defaults for this optimization level are:
[1,0]<stdout>:enabled                : True
[1,0]<stdout>:opt_level              : O2
[1,0]<stdout>:cast_model_type        : torch.float16
[1,0]<stdout>:patch_torch_functions  : False
[1,0]<stdout>:keep_batchnorm_fp32    : True
[1,0]<stdout>:master_weights         : True
[1,0]<stdout>:loss_scale             : dynamic
[1,0]<stdout>:Processing user overrides (additional kwargs that are not None)...
[1,0]<stdout>:After processing overrides, optimization options are:
[1,0]<stdout>:enabled                : True
[1,0]<stdout>:opt_level              : O2
[1,0]<stdout>:cast_model_type        : torch.float16
[1,0]<stdout>:patch_torch_functions  : False
[1,0]<stdout>:keep_batchnorm_fp32    : True
[1,0]<stdout>:master_weights         : True
[1,0]<stdout>:loss_scale             : dynamic
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "train_vcmr.py", line 399, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "train_vcmr.py", line 161, in main
[1,0]<stderr>:    restorer = TrainingRestorer(opts, model, optimizer)
[1,0]<stderr>:  File "/src/utils/save.py", line 141, in __init__
[1,0]<stderr>:    assert vars(opts) == restore_opts
[1,0]<stderr>:AssertionError
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[30056,1],0]
  Exit code:    1
--------------------------------------------------------------------------

It seems to be caused by vars(opts) and restore_opts:

vars(opts)= {'model_config': 'config/hero_finetune.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
        0.9,
        0.98
    ], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
        20
    ], 'hard_neg_weights': [
        10
    ], 'hard_negtiave_start_step': [
        2000
    ], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
    ], 'min_pred_l': 2, 'max_pred_l': 16, 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'test_query_txt_db': None, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 1
}

restore_opts= {'model_config': 'config/hero.json', 'checkpoint': '/pretrain/hero-tv-ht100.pt', 'train_batch_size': 32, 'val_batch_size': 20, 'gradient_accumulation_steps': 2, 'learning_rate': 0.0001, 'valid_steps': 200, 'save_steps': 200, 'optim': 'adamw', 'betas': [
        0.9,
        0.98
    ], 'dropout': 0.1, 'weight_decay': 0.01, 'grad_norm': 1.0, 'warmup_steps': 500, 'lr_mul': 1.0, 'num_train_steps': 5000, 'output_dir': '/storage/linjie_saved_results/release_debug/tvr_default', 'sub_ctx_len': 0, 'max_clip_len': 100, 'max_txt_len': 60, 'vfeat_version': 'resnet_slowfast', 'vfeat_interval': 1.5, 'compressed_db': False, 'seed': 77, 'n_workers': 4, 'pin_mem': True, 'fp16': True, 'task': 'tvr', 'vcmr_eval_video_batch_size': 50, 'vcmr_eval_q_batch_size': 80, 'drop_svmr_prob': 0.8, 'lw_neg_q': 8.0, 'lw_neg_ctx': 8.0, 'lw_st_ed': 0.01, 'ranking_loss_type': 'hinge', 'margin': 0.1, 'hard_pool_size': [
        20
    ], 'hard_neg_weights': [
        10
    ], 'hard_negtiave_start_step': [
        2000
    ], 'train_span_start_step': 0, 'use_all_neg': True, 'eval_with_query_type': True, 'max_before_nms': 200, 'max_after_nms': 100, 'distributed_eval': True, 'nms_thd': 0.5, 'q2c_alpha': 20, 'max_vcmr_video': 100, 'full_eval_tasks': ['VCMR', 'SVMR', 'VR'
    ], 'min_pred_l': 2, 'max_pred_l': 16, 'tasks': 'tvr', 'sub_txt_db': '/txt/tv_subtitles.db', 'vfeat_db': '/video/tv', 'train_query_txt_db': '/txt/tvr_train.db', 'val_query_txt_db': '/txt/tvr_val.db', 'drop_sub_prob': 0, 'vcmr_eval_batch_size': 80, 'rank': 0, 'n_gpu': 8
}

And then, I just changed the contents of these two files $PATH_TO_STORAGE/finetune/tvr_default/log/hps.json and config/train-tvr-8gpu.json to fix this error:

# store_temp/finetune/tvr_default/log/hps.json
{
    "model_config": "config/hero_finetune.json",
    "checkpoint": "/pretrain/hero-tv-ht100.pt",
    "train_batch_size": 32,
    "val_batch_size": 20,
    "gradient_accumulation_steps": 2,
    "learning_rate": 0.0001,
    "valid_steps": 200,
    "save_steps": 200,
    "optim": "adamw",
    "betas": [
        0.9,
        0.98
    ],
    "dropout": 0.1,
    "weight_decay": 0.01,
    "grad_norm": 1.0,
    "warmup_steps": 500,
    "lr_mul": 1.0,
    "num_train_steps": 5000,
    "output_dir": "/storage/tvr_default",
    "sub_ctx_len": 0,
    "max_clip_len": 100,
    "max_txt_len": 60,
    "vfeat_version": "resnet_slowfast",
    "vfeat_interval": 1.5,
    "compressed_db": false,
    "seed": 77,
    "n_workers": 4,
    "pin_mem": true,
    "fp16": true,
    "task": "tvr",
    "vcmr_eval_video_batch_size": 50,
    "vcmr_eval_q_batch_size": 80,
    "drop_svmr_prob": 0.8,
    "lw_neg_q": 8.0,
    "lw_neg_ctx": 8.0,
    "lw_st_ed": 0.01,
    "ranking_loss_type": "hinge",
    "margin": 0.1,
    "hard_pool_size": [
        20
    ],
    "hard_neg_weights": [
        10
    ],
    "hard_negtiave_start_step": [
        2000
    ],
    "train_span_start_step": 0,
    "use_all_neg": true,
    "eval_with_query_type": true,
    "max_before_nms": 200,
    "max_after_nms": 100,
    "distributed_eval": true,
    "nms_thd": 0.5,
    "q2c_alpha": 20,
    "max_vcmr_video": 100,
    "full_eval_tasks": [
        "VCMR",
        "SVMR",
        "VR"
    ],
    "min_pred_l": 2,
    "max_pred_l": 16,
    "sub_txt_db": "/txt/tv_subtitles.db",
    "vfeat_db": "/video/tv",
    "train_query_txt_db": "/txt/tvr_train.db",
    "val_query_txt_db": "/txt/tvr_val.db",
    "test_query_txt_db": null,
    "vcmr_eval_batch_size": 80,
    "rank": 0,
    "tasks": "tvr",
    "drop_sub_prob": 0,
    "n_gpu": 1
}
# config/train-tvr-8gpu.json
{
    "task": "tvr",
    "sub_txt_db": "/txt/tv_subtitles.db",
    "vfeat_db": "/video/tv",
    "train_query_txt_db": "/txt/tvr_train.db",
    "val_query_txt_db": "/txt/tvr_val.db",
    "test_query_txt_db": null,
    "compressed_db": false,
    "model_config": "config/hero_finetune.json",
    "checkpoint": "/pretrain/hero-tv-ht100.pt",
    "output_dir": "/storage/tvr_default",
    "eval_with_query_type": true,
    "max_before_nms": 200,
    "max_after_nms": 100,
    "distributed_eval": true,
    "nms_thd": 0.5,
    "q2c_alpha": 20,
    "max_vcmr_video": 100,
    "full_eval_tasks": [
        "VCMR",
        "SVMR",
        "VR"
    ],
    "max_clip_len": 100,
    "max_txt_len": 60,
    "vfeat_version": "resnet_slowfast",
    "vfeat_interval": 1.5,
    "min_pred_l": 2,
    "max_pred_l": 16,
    "drop_svmr_prob": 0.8,
    "train_batch_size": 32,
    "val_batch_size": 20,
    "vcmr_eval_video_batch_size": 50,
    "vcmr_eval_batch_size": 80,
    "gradient_accumulation_steps":2,
    "learning_rate": 1e-04,
    "valid_steps": 200,
    "save_steps": 200,
    "num_train_steps": 5000,
    "optim": "adamw",
    "betas": [
        0.9,
        0.98
    ],
    "dropout": 0.1,
    "weight_decay": 0.01,
    "grad_norm": 1.0,
    "warmup_steps": 500,
    "lw_neg_q": 8.0,
    "lw_neg_ctx": 8.0,
    "lw_st_ed": 0.01,
    "ranking_loss_type": "hinge",
    "margin": 0.1,
    "hard_pool_size": [
        20
    ],
    "hard_neg_weights": [
        10
    ],
    "hard_negtiave_start_step": [
        2000
    ],
    "train_span_start_step": 0,
    "sub_ctx_len": 0,
    "use_all_neg": true,
    "seed": 77,
    "fp16": true,
    "n_workers": 4,
    "pin_mem": true,
    "rank": 0,
    "tasks": "tvr",
    "drop_sub_prob": 0
}

Is this the correct way to fix this error?

Finetuning for the TVR task

I only have two gpu in my computer. When I use the command " horovodrun -np 2 python train_vcmr.py --config config/train-tvr-8gpu.json', the code meets deadlock. I attach the output here. How to solve this issue?

[1,0]:02/01/2021 22:11:30 - INFO - main - Loading tvr train dataset /video/tv
[1,0]:02/01/2021 22:11:33 - INFO - main - 87153 samples loaded
[1,1]:
[1,0]:02/01/2021 22:11:33 - INFO - main - Loading tvr validation dataset/video/tv
[1,0]:02/01/2021 22:11:33 - INFO - main - 10895 samples loaded
[1,0]:02/01/2021 22:11:33 - INFO - main - Loading Inference Dataset /txt/tvr_val.db (val)
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Model config:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Cross-Modal Transformer config: {
[1,0]: "attention_probs_dropout_prob": 0.1,
[1,0]: "hidden_act": "gelu",
[1,0]: "hidden_dropout_prob": 0.1,
[1,0]: "hidden_size": 768,
[1,0]: "initializer_range": 0.02,
[1,0]: "intermediate_size": 3072,
[1,0]: "layer_norm_eps": 1e-12,
[1,0]: "max_position_embeddings": 514,
[1,0]: "num_attention_heads": 12,
[1,0]: "num_hidden_layers": 6,
[1,0]: "output_attentions": false,
[1,0]: "output_hidden_states": false,
[1,0]: "type_vocab_size": 2,
[1,0]: "vocab_size": 50272
[1,0]:}
[1,0]:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Temporal Transformer config: {
[1,0]: "attention_probs_dropout_prob": 0.1,
[1,0]: "hidden_act": "gelu",
[1,0]: "hidden_dropout_prob": 0.1,
[1,0]: "hidden_size": 768,
[1,0]: "initializer_range": 0.02,
[1,0]: "intermediate_size": 3072,
[1,0]: "layer_norm_eps": 1e-12,
[1,0]: "max_position_embeddings": 514,
[1,0]: "num_attention_heads": 12,
[1,0]: "num_hidden_layers": 3,
[1,0]: "output_attentions": false,
[1,0]: "output_hidden_states": false,
[1,0]: "type_vocab_size": 2,
[1,0]: "vocab_size": -1
[1,0]:}
[1,0]:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - QueryEncoder config: {
[1,0]: "attention_probs_dropout_prob": 0.1,
[1,0]: "hidden_act": "gelu",
[1,0]: "hidden_dropout_prob": 0.1,
[1,0]: "hidden_size": 768,
[1,0]: "initializer_range": 0.02,
[1,0]: "intermediate_size": 3072,
[1,0]: "layer_norm_eps": 1e-12,
[1,0]: "max_position_embeddings": 514,
[1,0]: "num_attention_heads": 12,
[1,0]: "num_hidden_layers": 0,
[1,0]: "output_attentions": false,
[1,0]: "output_hidden_states": false,
[1,0]: "type_vocab_size": 1,
[1,0]: "vocab_size": 50272
[1,0]:}
[1,0]:
[1,0]:02/01/2021 22:11:34 - INFO - model.model - Decoder Transformer config: None
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Model config:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Cross-Modal Transformer config: {
[1,1]: "attention_probs_dropout_prob": 0.1,
[1,1]: "hidden_act": "gelu",
[1,1]: "hidden_dropout_prob": 0.1,
[1,1]: "hidden_size": 768,
[1,1]: "initializer_range": 0.02,
[1,1]: "intermediate_size": 3072,
[1,1]: "layer_norm_eps": 1e-12,
[1,1]: "max_position_embeddings": 514,
[1,1]: "num_attention_heads": 12,
[1,1]: "num_hidden_layers": 6,
[1,1]: "output_attentions": false,
[1,1]: "output_hidden_states": false,
[1,1]: "type_vocab_size": 2,
[1,1]: "vocab_size": 50272
[1,1]:}
[1,1]:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Temporal Transformer config: {
[1,1]: "attention_probs_dropout_prob": 0.1,
[1,1]: "hidden_act": "gelu",
[1,1]: "hidden_dropout_prob": 0.1,
[1,1]: "hidden_size": 768,
[1,1]: "initializer_range": 0.02,
[1,1]: "intermediate_size": 3072,
[1,1]: "layer_norm_eps": 1e-12,
[1,1]: "max_position_embeddings": 514,
[1,1]: "num_attention_heads": 12,
[1,1]: "num_hidden_layers": 3,
[1,1]: "output_attentions": false,
[1,1]: "output_hidden_states": false,
[1,1]: "type_vocab_size": 2,
[1,1]: "vocab_size": -1
[1,1]:}
[1,1]:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - QueryEncoder config: {
[1,1]: "attention_probs_dropout_prob": 0.1,
[1,1]: "hidden_act": "gelu",
[1,1]: "hidden_dropout_prob": 0.1,
[1,1]: "hidden_size": 768,
[1,1]: "initializer_range": 0.02,
[1,1]: "intermediate_size": 3072,
[1,1]: "layer_norm_eps": 1e-12,
[1,1]: "max_position_embeddings": 514,
[1,1]: "num_attention_heads": 12,
[1,1]: "num_hidden_layers": 0,
[1,1]: "output_attentions": false,
[1,1]: "output_hidden_states": false,
[1,1]: "type_vocab_size": 1,
[1,1]: "vocab_size": 50272
[1,1]:}
[1,1]:
[1,1]:02/01/2021 22:11:34 - INFO - model.model - Decoder Transformer config: None
[1,1]:02/01/2021 22:11:45 - INFO - model.modeling_utils - Weights from pretrained model not used in HeroForVcmr: ['vocab_padded']
[1,0]:02/01/2021 22:11:45 - INFO - model.modeling_utils - Weights from pretrained model not used in HeroForVcmr: ['vocab_padded']
[1,1]:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
[1,1]:
[1,1]:Defaults for this optimization level are:
[1,1]:enabled : True
[1,1]:opt_level : O2
[1,1]:cast_model_type : torch.float16
[1,1]:patch_torch_functions : False
[1,1]:keep_batchnorm_fp32 : True
[1,1]:master_weights : True
[1,1]:loss_scale : dynamic
[1,1]:Processing user overrides (additional kwargs that are not None)...
[1,1]:After processing overrides, optimization options are:
[1,1]:enabled : True
[1,1]:opt_level : O2
[1,1]:cast_model_type : torch.float16
[1,1]:patch_torch_functions : False
[1,1]:keep_batchnorm_fp32 : True
[1,1]:master_weights : True
[1,1]:loss_scale : dynamic
[1,0]:Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.
[1,0]:
[1,0]:Defaults for this optimization level are:
[1,0]:enabled : True
[1,0]:opt_level : O2
[1,0]:cast_model_type : torch.float16
[1,0]:patch_torch_functions : False
[1,0]:keep_batchnorm_fp32 : True
[1,0]:master_weights : True
[1,0]:loss_scale : dynamic
[1,0]:Processing user overrides (additional kwargs that are not None)...
[1,0]:After processing overrides, optimization options are:
[1,0]:enabled : True
[1,0]:opt_level : O2
[1,0]:cast_model_type : torch.float16
[1,0]:patch_torch_functions : False
[1,0]:keep_batchnorm_fp32 : True
[1,0]:master_weights : True
[1,0]:loss_scale : dynamic
[1,1]:restorer is finished
[1,0]:restorer is finished
[1,0]:02/01/2021 22:11:46 - INFO - main - Waiting on git info....
[1,0]:fatal: not a git repository (or any parent up to mount point /)
[1,0]:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]:02/01/2021 22:11:46 - INFO - main - Git branch:
[1,0]:fatal: not a git repository (or any parent up to mount point /)
[1,0]:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]:02/01/2021 22:11:46 - INFO - main - Git SHA:
[1,0]:fatal: not a git repository (or any parent up to mount point /)
[1,0]:Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
[1,0]:02/01/2021 22:11:47 - ERROR - main - Command '['git', 'status', '--short']' returned non-zero exit status 128.
[1,0]:Traceback (most recent call last):
[1,0]: File "/src/utils/save.py", line 51, in save_training_meta
[1,0]: cwd=git_dir, universal_newlines=True).strip()
[1,0]: File "/opt/conda/lib/python3.6/subprocess.py", line 356, in check_output
[1,0]: **kwargs).stdout
[1,0]: File "/opt/conda/lib/python3.6/subprocess.py", line 438, in run
[1,0]: output=stdout, stderr=stderr)
[1,0]:subprocess.CalledProcessError: Command '['git', 'status', '--short']' returned non-zero exit status 128.
[1,0]:02/01/2021 22:11:47 - WARNING - main - Git info not found. Saving code into zip instead...
[1,0]:02/01/2021 22:11:47 - INFO - main - Saving code from /src to /storage/tvr_default/code.zip...
[1,0]:[2021-02-01 22:13:23.215038: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40]
[1,0]:[2021-02-01 22:14:23.216625: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40][1,0]:
[1,0]:[2021-02-01 22:15:23.218786: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40]
[1,0]:[2021-02-01 22:16:23.223750: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
[1,0]:Stalled ranks:
[1,0]:0: [allgather.noname.40]

Weights of HeroForVideoQA not initialized from pretrained model

Hello, I'm running the train_videoQA.py , the program reports an error and shows that the model parameters are missing. What's the matter?

04/28/2021 02:00:52 - INFO - model.modeling_utils - Weights of HeroForVideoQA not initialized from pretrained model: ['qa_pool.weight', 'qa_pred_head.linear_1.weight', 'qa_pred_head.linear_1.bias', 'qa_pred_head.LayerNorm.weight', 'qa_pred_head.LayerNorm.bias', 'qa_pred_head.linear_2.weight', 'qa_pred_head.linear_2.bias', 'st_ed_pool.weight', 'st_ed_pred_head.linear_1.weight', 'st_ed_pred_head.linear_1.bias', 'st_ed_pred_head.LayerNorm.weight', 'st_ed_pred_head.LayerNorm.bias', 'st_ed_pred_head.linear_2.weight', 'st_ed_pred_head.linear_2.bias']
04/28/2021 02:00:52 - INFO - model.modeling_utils - Weights from pretrained model not used in HeroForVideoQA: ['q_feat_attn.query_input_proj.LayerNorm.weight', 'q_feat_attn.query_input_proj.LayerNorm.bias', 'q_feat_attn.query_input_proj.net.1.weight', 'q_feat_attn.query_input_proj.net.1.bias', 'q_feat_attn.query_pos_embed.position_embeddings.weight', 'q_feat_attn.query_pos_embed.LayerNorm.weight', 'q_feat_attn.query_pos_embed.LayerNorm.bias', 'q_feat_attn.query_self_attention.self.query.weight', 'q_feat_attn.query_self_attention.self.query.bias', 'q_feat_attn.query_self_attention.self.key.weight', 'q_feat_attn.query_self_attention.self.key.bias', 'q_feat_attn.query_self_attention.self.value.weight', 'q_feat_attn.query_self_attention.self.value.bias', 'q_feat_attn.query_self_attention.output.dense.weight', 'q_feat_attn.query_self_attention.output.dense.bias', 'q_feat_attn.query_self_attention.output.LayerNorm.weight', 'q_feat_attn.query_self_attention.output.LayerNorm.bias', 'q_feat_attn.modular_vector_mapping.weight', 'video_query_linear.weight', 'video_query_linear.bias', 'video_st_predictor.weight', 'video_ed_predictor.weight', 'vocab_padded']
Selected optimization level O2: FP16 training with FP32 batchnorm and FP32 master weights.

Feature extractor for custom dataset

Hello,

Congratulations for your EMNLP paper. Very exciting and impressive results. I was wondering whether you could share the data preprocessing scripts that you used to extract the video features used in your paper.

Thanks,
Alessandro

code for how2r and how2qa

Hi,
Thanks for sharing the code. How can I get the script、config、processed data for how2r and how2qa.
I'm looking forward to your replay~

model weights without pretraining

Hi @linjieli222,

Nice work! I has a question about model weights without pretraining. In the paper it says model parameters (w/o pretraining) are initialized with RoBERTa weights. As RoBERTa has 12 layers and HERO as 6/3 layers, I wonder the weights of which layers are loaded in cross-modal Transformer and Temporal Transformer?

Thanks.

OOM in pretraining

I tried to pretrain HERO mode from scratch in HowTo100M and TV datasets, and the code worked well at the begining, but crashed after thousands of iterations. I found that the memory usage was growing in training and finally out of memory. Have you met this problem?

Unable to download pretrained models & features

It seems that link provided in download_tvr.sh and download_didemo.sh is not working.
I got following error executing those scripts:

Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 20.60.20.68
Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|20.60.20.68|:443... connected.
HTTP request sent, awaiting response... 409 Public access is not permitted on this storage account.
2022-11-09 20:49:12 ERROR 409: Public access is not permitted on this storage account..

Could you provide another link, or resolve the issue above?

Thank you....

Error while pretraining with TV data

I think I found a bug when running the pretrain.py script. I downloaded the pretrain-tv dataset with bash scripts/download_tv_pretrain.sh /path/to/storage, and ran the pretrain.py script in docker. The only changes in the config file are smaller batch sizes. After 500 steps, the program does validation and runs into len() of a 0-d tensor when running validate_vsm.

The error is caused due to one of the input features has the shape torch.Size([1, 60, 768]), which caused loss_neg_ctx to be a scalar value, which makes len() complain. The simple fix that works for me is to add an inline if to detect if it's 0 dimensional and return 1 if it is. An alternative might be to force the scalar into a 1-d vector of 1 value, but I have not tested this solution.

Happy to open a PR if needed.

Single-channel results on multi-channel datasets

Hi,
Nice work! I understand HERO is best designed for multi-channel tasks, but I am curious about how HERO performs on How2QA and TVQA when not using the subtitle. Do you have these? It'd be helpful to understand the importance of this modality on different domains (YouTube ASR vs TV subtitles).
Best,
Antoine Yang

Pre-training based on HowTo100M dataset

As you have done in your paper, the videos of HowTo100M are segmented into 60s clips. I also processed the caption.json of this dataset to match the segmeted clips. When I pre-trained the model, the error "cuda out of memory" occurs? I guess there are two many subtitles in HowTo100M. How to solve this problem?

"lmdb.LockError: mdb_txn_begin" when using network file system.

We are trying to reproduce the results with the same settings including the GPU number.
However, we are struggling with horovod settings.
While loading the dataset from lmdb, it shows, again and again, the following error.
"lmdb.LockError: mdb_txn_begin"

So, we searched and in StackOverflow we found the following answer, and the answer points out LMDB is not fitting with a network file system (NFS).

https://stackoverflow.com/questions/61365680/lmdb-error-lmdb-lockerror-mdb-txn-begin-resource-temporarily-unavaliable

So, my question is how authors have trained the model using 16GPU (or more than 16) with NFS?
If it's not, also being curious, how authors trained using horovod with non-network file system?

or is there any alternative solving way for this problem?

Reproducing Results on TVC w/o pre-training, getting scores higher than what is reported in the paper Table 4

Firstly, I would like to thank you for providing the source code along with documentation and weights - it was really helpful.

I tried reproducing the results reported in the paper for TVC dataset (HERO w/o pre-training) - used the weights as mentioned from the checkpoint "pretrain-tv-init.bin" for RoBERTa weight initialization of 6-layer Cross Modal Transformer. The scores are shown below

Reported - 43.62 (Cider) | 10.75 (Bleu@4)
Reproduced - 47.52 (Cider) | 11.26 (Bleu@4)

I am getting ~4% better Cider score and ~1% better Bleu score, everything is same - is there any reason why I get better scores than those reported in paper @linjieli222 ? The difference is big and cannot be ignored, any insights would be helpful.

Thanks.

Frame-level feature

Hi,

Thanks for making your great work open-sourced. I am trying to do feature extraction myself, and wondering how frame-level feature is encoded with SlowFast. As pre-trained SlowFast receives a fixed number of frames as input for action recognition, did you sample multiple clips from a video at different location, or perform other operations such as pooling or concatenation?

I look forward to your reply.

How long does it take for pre-training on TV with MLM+MNCE from scratch?

@linjieli222
Hi, thanks for your great project!
As mentioned in your paper, the best pre-trained HERO needs to be trained on 16 V100 GPUs for about 3 weeks.
Due to the limitation of GPU and memory, I would like to conduct pre-training on TV with MLM+MNCE firstly. (that is, L2 in Table 1 in your paper)

I would like to ask three questions:

  1. How long does it take for pre-training on TV with MLM+MNCE from scratch? (L2 in Table 1 in your paper)

  2. Could you please show me the commands to conduct pre-training on TV with MLM+MNCE and fine-tuning on TVR from scratch? I am a novice in pre-training projects. :)

    I think I need to conduct this experiment by 7 steps:

    1/ download TV dataset
    2/ Text & Video feature extraction from TV dataset
      or directly use the Text & Video features provided by you
    3/ pre-training on TV with MLM+MNCE
    
    4/ download TVR dataset
    5/ Text & Video feature extraction from TVR dataset
      or directly use the Text & Video features provided by you
    6/ fine-tuning & inference on TVR
    7/ submit results to TVR codalab
    
  3. I find that the downloading of bash scripts/download_tvr.sh $PATH_TO_STORAGE is too slow, less than 1m/s. Do you have another download server?
    [Done. No need to reply this question.]

Checkpoint for TVQA task

Hi,
Thanks for sharing the code. Where can I download the model after it is finetuned for QA task (i.e., tvqa_default according to the train-tvqa-8gpu.json config )? I'd like to use the finetuned checkpoint for inference on a custom QA dataset directly. Thanks.

RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.

hi, i got a problem when run the command " horovodrun -np 2 python pretrain.py --config config/pretrain-tv-16gpu.json --output_dir ./pre_train_ckpt/ckpt/ ", could you please help me?

0%|          | 500/100000 [06:57<22:23:43,  1.23it/s][1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   -------------------------------------------
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   Step 500:
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   mlm_tv_all: 3384 examples trained at 8 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   mfm-nce_tv_all: 3192 examples trained at 7 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   fom_tv_all: 1968 examples trained at 4 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   vsm_tv_all: 3456 examples trained at 8 ex/s
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   ===========================================
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   Step 500: start running validation
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   validate on mlm_tv_all task
[1,0]<stderr>:02/24/2022 15:17:53 - INFO - __main__ -   start running MLM validation...
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ -   validation finished in 2 seconds, acc: 2.41
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ -   validate on mfm-nce_tv_all task
[1,0]<stderr>:02/24/2022 15:17:55 - INFO - __main__ -   start running MFM-NCE validation...
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ -   validation finished in 2 seconds, loss: 15.16, acc: 1.99 (average 350 negatives)
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ -   validate on fom_tv_all task
[1,0]<stderr>:02/24/2022 15:17:58 - INFO - __main__ -   start running FOM validation...
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ -   validation finished in 2 seconds, score: 1.92
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ -   validate on vsm_tv_all task
[1,0]<stderr>:02/24/2022 15:18:01 - INFO - __main__ -   start running VSM validation...
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>:  File "pretrain.py", line 621, in <module>
[1,1]<stderr>:    main(args)
[1,1]<stderr>:  File "pretrain.py", line 372, in main
[1,1]<stderr>:    validate(model, val_dataloaders, opts)
[1,1]<stderr>:  File "pretrain.py", line 403, in validate
[1,1]<stderr>:    val_log = validate_vsm(model, loader, opts)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
[1,1]<stderr>:    return func(*args, **kwargs)
[1,1]<stderr>:  File "pretrain.py", line 436, in validate_vsm
[1,1]<stderr>:    val_loss_st_ed = sum(all_gather_list(val_loss_st_ed))
[1,1]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/utils/distributed.py", line 190, in all_gather_list
[1,1]<stderr>:    out_buffer = hvd.allgather(in_buffer[:enc_byte+enc_size])
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 287, in allgather
[1,1]<stderr>:    return HorovodAllgather.apply(tensor, name)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 250, in forward
[1,1]<stderr>:    return synchronize(handle)
[1,1]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 443, in synchronize
[1,1]<stderr>:    mpi_lib.horovod_torch_wait_and_clear(handle)
[1,1]<stderr>:RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>:  File "pretrain.py", line 621, in <module>
[1,0]<stderr>:    main(args)
[1,0]<stderr>:  File "pretrain.py", line 372, in main
[1,0]<stderr>:    validate(model, val_dataloaders, opts)
[1,0]<stderr>:  File "pretrain.py", line 403, in validate
[1,0]<stderr>:    val_log = validate_vsm(model, loader, opts)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 49, in decorate_no_grad
[1,0]<stderr>:    return func(*args, **kwargs)
[1,0]<stderr>:  File "pretrain.py", line 427, in validate_vsm
[1,0]<stderr>:    model(batch, 'vsm', compute_loss=True)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 545, in __call__
[1,0]<stderr>:    result = self.forward(*input, **kwargs)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/apex/amp/_initialize.py", line 194, in new_fwd
[1,0]<stderr>:    **applier(kwargs, input_caster))
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 84, in forward
[1,0]<stderr>:    batch['c_attn_masks'])
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 400, in get_video_level_scores
[1,0]<stderr>:    modularized_query = vsm_allgather(modularized_query).contiguous()
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 452, in vsm_allgather
[1,0]<stderr>:    return VsmAllgather.apply(tensor, None)
[1,0]<stderr>:  File "/mnt/workspace/wangqifan.wqf/antmmf/hero-ex/model/pretrain.py", line 437, in forward
[1,0]<stderr>:    torch.tensor([ctx.dim], device=tensor.device)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 287, in allgather
[1,0]<stderr>:    return HorovodAllgather.apply(tensor, name)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 250, in forward
[1,0]<stderr>:    return synchronize(handle)
[1,0]<stderr>:  File "/opt/conda/lib/python3.6/site-packages/horovod/torch/mpi_ops.py", line 443, in synchronize
[1,0]<stderr>:    mpi_lib.horovod_torch_wait_and_clear(handle)
[1,0]<stderr>:RuntimeError: Mismatched data types: One rank had type int64, but another rank had type uint8.
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28050,1],1]
  Exit code:    1
-------------------------------------------------------------------------- 

Unable to download pretrained models & features

It seems that link provided in download_pretrained.sh and download_vcr.sh is not working.
I got following error executing those scripts:

--2022-12-09 09:58:39--  https://convaisharables.blob.core.windows.net/hero/pretrained/hero-tv-ht100.pt
Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 218.4.189.4
Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|218.4.189.4|:443... connected.
ERROR: no certificate subject alternative name matches
        requested host name ‘convaisharables.blob.core.windows.net’.
To connect to convaisharables.blob.core.windows.net insecurely, use `--no-check-certificate'.

when use --no-check-certificate:

--2022-12-09 09:58:49--  https://convaisharables.blob.core.windows.net/hero/pretrained/hero-tv-ht100.pt
Resolving convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)... 218.4.189.4
Connecting to convaisharables.blob.core.windows.net (convaisharables.blob.core.windows.net)|218.4.189.4|:443... connected.
WARNING: no certificate subject alternative name matches
        requested host name ‘convaisharables.blob.core.windows.net’.
HTTP request sent, awaiting response... 400 
2022-12-09 09:58:49 ERROR 400: (no description).

When I visit https://convaisharables.blob.core.windows.net/hero directly through my browser, the page shows ResourceNotFound.

Could you provide a available link or resolve the issue above?

Thank you😊

Reproducing VCMR results for didemo video only

Hi @linjieli222,
I have reproduced didemo_video_only, the results results I reproduced are much better than those in the paper.
The reproduced results are as follow:

672 03/02/2022 21:22:48 - INFO - main - validation finished in 3 seconds
673 03/02/2022 21:22:51 - INFO - main - start running full VCMR evaluationon didemo_video_only test split...
674 03/02/2022 21:22:54 - INFO - main - metrics_no_nms_VCMR
675 { '0.5-r1': 3.590012556504269,
676 '0.5-r10': 16.97737066800603,
677 '0.5-r100': 44.225404319437466,
678 '0.5-r5': 11.299871923656454,
679 '0.7-r1': 2.9875866398794573,
680 '0.7-r10': 14.11497237569061,
681 '0.7-r100': 37.99495228528378,
682 '0.7-r5': 9.289997488699145}

 And the results in paper:

图片

I don't find the reason for this result.

some questions of model details

Hi,thank you for your released code,the “HERO” is a very interesting and amazing work.
There are some questions about the model and hope for your response.

  1. For the cross-transformer initializated by 6 layers of the pretrained roberta, why not use all the 12 layers. As far as I know, most current models are based on pretrained bert-base or roberta-base model.
  2. Why did you use the order of [img, txt] instead of [txt, img]?
  3. Why did you pretrain the MLM task on the cross-transformer and mfm on the temporal-transformer?
  4. Due to the alignment data is hard to collect, so how much influence will it have if the local alignment is removed?

Thank you very much and looking forward to your reply!

I find a bug in svmr evaluation

def post_processing_svmr_nms(
        svmr_res, nms_thd=0.6, max_before_nms=1000, max_after_nms=100):
    """
    svmr_res: list(dict), each dict is
        {"desc": str,
         "desc_id": int,
         "predictions": list(sublist)  # each sublist is
            [video_idx (int), st (float), ed(float), score (float)],
            video_idx is the same.
         }
    """
    processed_svmr_res = []
    for e in svmr_res:
        # the predictions are sorted inside the nms func.
        e["predictions"] = temporal_non_maximum_suppression(
            e["predictions"][:max_before_nms],
            nms_threshold=nms_thd)[:max_after_nms]
        processed_svmr_res.append(e)
    return processed_svmr_res
def temporal_non_maximum_suppression(predictions, nms_threshold,
                                     max_after_nms=100):
    """
    Args:
        predictions:
            list(sublist), each sublist is
            [st (float), ed(float), score (float)],
            note larger scores are better and are preserved.
            For metrics that are better when smaller,
            please convert to its negative,
            e.g., convert distance to negative distance.
        nms_threshold: float in [0, 1]
        max_after_nms:
    Returns:
        predictions_after_nms:
        list(sublist),
        each sublist is [st (float), ed(float), score (float)]
    References:
        https://github.com/wzmsltw/BSN-boundary-sensitive-network/blob/7b101fc5978802aa3c95ba5779eb54151c6173c6/Post_processing.py#L42
    """
    if len(predictions) == 1:  # only has one prediction, no need for nms
        return predictions

    predictions = sorted(predictions, key=lambda x: x[2],
                         reverse=True)  # descending order

    tstart = [e[0] for e in predictions]
    tend = [e[1] for e in predictions]
    tscore = [e[2] for e in predictions]
    rstart = []
    rend = []
    rscore = []
    while len(tstart) > 1 and len(rscore) < max_after_nms:  # max 100 after nms
        idx = 1
        while idx < len(tstart):  # compare with every prediction in the list.
            if compute_temporal_iou(
                    [tstart[0], tend[0]],
                    [tstart[idx], tend[idx]]) > nms_threshold:
                # rm highly overlapped lower score entries.
                tstart.pop(idx)
                tend.pop(idx)
                tscore.pop(idx)
            else:
                # move to next
                idx += 1
        rstart.append(tstart.pop(0))
        rend.append(tend.pop(0))
        rscore.append(tscore.pop(0))

    if (len(rscore) < max_after_nms
            and len(tstart) >= 1):  # add the last, possibly empty.
        rstart.append(tstart.pop(0))
        rend.append(tend.pop(0))
        rscore.append(tscore.pop(0))

    predictions_after_nms = [
        [st, ed, s] for s, st, ed in zip(rscore, rstart, rend)]
    return predictions_after_nms

e['predictions'] in post_processing_svmr_nms has four elements [video_idx (int), st (float), ed(float), score (float)] each sublist, and the function temporal_non_maximum_suppression requires argument predictions with three elements [st (float), ed(float), score (float)] each sublist. So tstart, tend, tscore will be video_idx, st, ed, respectively.
I fix it by myself and test the code. But it seems that this problem have no effect on vcmr results?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.