Giter VIP home page Giter VIP logo

mdmmt's Introduction

Introduction

In this repository we present the testing code for article MDMMT: Multidomain Multimodal Transformer for Video Retrieval.

Presentation from CVPR-2021 Workshop "Large Scale Holistic Video Understanding".

This code helps:

  1. Create embeddings with CLIP, irCSN152 and VGGish;
  2. Create caption index files;
  3. Run test with created embeddings and captions index files.

Our pretrained model is available here.

Citation

@misc{dzabraev2021mdmmt,
      title={MDMMT: Multidomain Multimodal Transformer for Video Retrieval}, 
      author={Maksim Dzabraev and Maksim Kalashnikov and Stepan Komkov and Aleksandr Petiushko},
      year={2021},
      eprint={2103.10699},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Expected testing results:

MSRVTT
t2v/R1: 22.87123745819398
t2v/R5: 49.67056856187291
t2v/R10: 61.66722408026756
t2v/R50: 83.95819397993311
t2v/MedR: 6.0
t2v/MeanR: 53.69550323486328
v2t/R1: 5.075250836120401
v2t/R5: 13.777591973244148
v2t/R10: 19.695652173913043
v2t/R50: 41.44314381270903
v2t/MedR: 84.0
v2t/MeanR: 695.3896484375


LSMDC
t2v/R1: 17.31731731731732
t2v/R5: 38.23823823823824
t2v/R10: 47.447447447447445
t2v/R50: 72.87287287287288
t2v/MedR: 12.0
t2v/MeanR: 59.398399353027344
v2t/R1: 16.716716716716718
v2t/R5: 37.73773773773774
v2t/R10: 45.545545545545544
v2t/R50: 72.27227227227228
v2t/MedR: 14.0
v2t/MeanR: 60.97697448730469

ActivityNet
t2v/R1: 19.673503557974048
t2v/R5: 45.22812892423608
t2v/R10: 56.7182921724571
t2v/R50: 80.8915864378401
t2v/MedR: 7.0
t2v/MeanR: 72.35956573486328
v2t/R1: 19.715362076182505
v2t/R5: 44.72582670573462
v2t/R10: 57.157806613645874
v2t/R50: 80.87065717873587
v2t/MedR: 7.0
v2t/MeanR: 68.22499084472656

Downloads

mkdir -p ckpts
# https://github.com/facebookresearch/VMZ/
wget https://github.com/bjuncek/VMZ/releases/download/test_models/irCSN_152_ig65m_from_scratch_f125286141.pth -O ckpts/irCSN_152_ig65m_from_scratch_f125286141.pth

# https://github.com/tensorflow/models/tree/master/research/audioset/vggish
wget https://storage.googleapis.com/audioset/vggish_model.ckpt -O ckpts/vggish_model.ckpt

git clone https://github.com/openai/CLIP models/CLIP
git clone https://github.com/tensorflow/models/ models/tensorflow_models

Environment

It is recommended to use conda to install packages. It is recommended to create two environments. The first one for audio dumping, and the second for video

Audio environment

Use this environment for producing embeddings with tf_vggish

tqdm
ffmpeg=4.2.2
tensorflow-gpu
tf_slim
resampy
six
pysoundfile
numpy=1.20.2 # !!! Make sure that intel-mkl is not used. It causes segfault in np.fft.rfft !!!

Video environment

Use this environment for producing embeddings with CLIP and irCSN152

tqdm
pytorch=1.7.1 # !!! It is recommended to use pytorch=1.7.1; 1.8+ is not working with CLIP !!!
torchvision
ffmpeg=4.2.2
ftfy
regex

Create lists

Replace "<*_DATASET_ROOT>/ with directory where raw video files are located.

cat lists/LSMDC/fnames.lst | awk '{print "<LSMDC_DATASET_ROOT>/" $0}' > LSMDC.lst
cat lists/ActivityNet/fnames.lst | awk '{print "<ActivityNet_DATASET_ROOT>/" $0}' > ActivityNet_val.lst
cat lists/msrvtt/fnames.lst | awk '{print "<msrvtt_DATASET_ROOT>/" $0}' > msrvtt_test.lst

Embeddings

python dumper.py \
    --model_type=VMZ_irCSN_152 \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/msrvtt/VMZ_irCSN_152/test  \
    --lst=msrvtt_test.lst \
    --nworker_per_gpu=2 \
    --per_batch_size=8 \
    --fps=32 \
    --frame_size=224 \
    --frame_crop_size=224 \
    --frames_per_clip=32

python dumper.py \
    --model_type=CLIP \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/msrvtt/CLIP/test  \
    --lst=msrvtt_test.lst \
    --nworker_per_gpu=8 \
    --per_batch_size=128 \
    --fps=1 \
    --frame_size=228 \
    --frame_crop_size=228 \
    --frames_per_clip=1

PYTHONPATH=\
models/tensorflow_models/research/audioset/vggish:\
models/tensorflow_models/research/audioset/:\
$PYTHONPATH \
python dumper.py \
    --model_type=tf_vggish \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/msrvtt/tf_vggish/test \
    --lst=msrvtt_test.lst \
    --nworker_per_gpu=2



python dumper.py \
    --model_type=VMZ_irCSN_152 \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/ActivityNet/VMZ_irCSN_152/test \
    --lst=ActivityNet_val.lst \
    --nworker_per_gpu=3 \
    --per_batch_size=8 \
    --fps=32 \
    --frame_size=224 \
    --frame_crop_size=224 \
    --frames_per_clip=32

python dumper.py \
    --model_type=CLIP \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/ActivityNet/CLIP/test \
    --lst=ActivityNet_val.lst \
    --nworker_per_gpu=3 \
    --per_batch_size=128 \
    --fps=1 \
    --frame_size=228 \
    --frame_crop_size=228 \
    --frames_per_clip=1 \
    --num_readers=8

PYTHONPATH=\
models/tensorflow_models/research/audioset/vggish:\
models/tensorflow_models/research/audioset/:\
$PYTHONPATH \
python dumper.py \
    --model_type=tf_vggish \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/ActivityNet/tf_vggish/test \
    --lst=ActivityNet_val.lst \
    --nworker_per_gpu=3 \
    --per_batch_size=32



python dumper.py \
    --model_type=VMZ_irCSN_152 \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/LSMDC/VMZ_irCSN_152/test \
    --lst=LSMDC.lst \
    --nworker_per_gpu=2 \
    --per_batch_size=8 \
    --fps=32 \
    --frame_size=224 \
    --frame_crop_size=224 \
    --frames_per_clip=32

python dumper.py \
    --model_type=CLIP \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/LSMDC/CLIP/test \
    --lst=LSMDC.lst \
    --nworker_per_gpu=2 \
    --per_batch_size=128 \
    --fps=1 \
    --frame_size=228 \
    --frame_crop_size=228 \
    --frames_per_clip=1 \
    --num_readers=8

PYTHONPATH=\
models/tensorflow_models/research/audioset/vggish:\
models/tensorflow_models/research/audioset/:\
$PYTHONPATH \
python dumper.py \
    --model_type=tf_vggish \
    --gpus=0,1,2,3,4,5,6,7 \
    --dst_prefix=/ssd/ssd_srv79/dza/dumps/LSMDC/tf_vggish/test \
    --lst=LSMDC.lst \
    --nworker_per_gpu=2 \
    --per_batch_size=32

Create caption index

python create_capts.py \
	--dataset=msrvtt \
	--output_root=/tmp/capts/msrvtt/ \
	--modality=VIDEO:2048:/ssd/ssd_srv79/dza/dumps/msrvtt/VMZ_irCSN_152/test \
	--modality=CLIP:512:/ssd/ssd_srv79/dza/dumps/msrvtt/CLIP/test \
	--modality=tf_vggish:128:/ssd/ssd_srv79/dza/dumps/msrvtt/tf_vggish/test
mkdir -p /tmp/capts/msrvtt/symlinked-feats/
cp lists/msrvtt/test_list_full.txt /tmp/capts/msrvtt/symlinked-feats/test_list_full.txt

python create_capts.py \
	--dataset=ActivityNet \
	--output_root=/tmp/capts/ActivityNet/ \
	--modality=VIDEO:2048:/ssd/ssd_srv79/dza/dumps/ActivityNet/VMZ_irCSN_152/test \
	--modality=CLIP:512:/ssd/ssd_srv79/dza/dumps/ActivityNet/CLIP/test \
	--modality=tf_vggish:128:/ssd/ssd_srv79/dza/dumps/ActivityNet/tf_vggish/test
mkdir -p /tmp/capts/ActivityNet/symlinked-feats/
cp lists/ActivityNet/val.vids /tmp/capts/ActivityNet/symlinked-feats/val.vids

python create_capts.py \
	--dataset=lsmdc_publictest \
	--output_root=/tmp/capts/LSMDC/ \
	--modality=VIDEO:2048:/ssd/ssd_srv79/dza/dumps/LSMDC/VMZ_irCSN_152/test \
	--modality=CLIP:512:/ssd/ssd_srv79/dza/dumps/LSMDC/CLIP/test \
	--modality=tf_vggish:128:/ssd/ssd_srv79/dza/dumps/LSMDC/tf_vggish/test
mkdir -p /tmp/capts/LSMDC/symlinked-feats/
cp lists/LSMDC/test.vids /tmp/capts/LSMDC/symlinked-feats/test.vids

Test

python test.py --dataset_root=/tmp/capts/msrvtt/  --checkpoint=<PATH_TO_MODEL>/mdmmt_3mod.pth  --dataset_name=MSRVTT_full --gpu=2
python test.py --dataset_root=/tmp/capts/LSMDC/  --checkpoint=<PATH_TO_MODEL>/mdmmt_3mod.pth  --dataset_name=lsmdc_publictest --gpu=2
python test.py --dataset_root=/tmp/capts/ActivityNet/  --checkpoint=<PATH_TO_MODEL>/mdmmt_3mod.pth  --dataset_name=ActivityNet --gpu=2

WARNING

Do not use numpy with mkl backend. Sometimes np.fft.rfft produce segmentation fault.

mdmmt's People

Contributors

papermsucode avatar dzabraev avatar

Stargazers

 avatar 姬忠鹏 avatar Nanyang Wang avatar Yunusemre avatar  avatar Oktay Alp Kabaç avatar Yu Liu avatar Amir Sarfi avatar Wang Jiachen avatar Saar Eliad avatar Yali Du avatar Tsu-Jui Fu avatar Jianjie(JJ) Luo avatar  avatar wyb7 avatar Xinxin Zhu avatar Mengyin Liu avatar  avatar  avatar  avatar  avatar liusong avatar  avatar  avatar Stepan Komkov avatar Aleksandr Petiushko avatar

Watchers

刘国友 avatar  avatar  avatar  avatar

mdmmt's Issues

Code documentation

I think if you want other people to read your paper and use your model, share it, and appreciate it, one has to comment on the code a bit, to explain what is done at which stage. The faster people will train the model, the more happier and satisfied they will be. The code is poorly commented.

Question about paper. Appendix B.

Hello,

first of all thank you for your results and the code that you shared.
I have a question concerning the appendix B:Datasets combination. In that paragraph you explore how dataset expansion improves the model performance. The comparison is done to a 'baseline' results, obtained from default datasets. The question is about the model configuration that you used in these studies:

  • was the model pertained on HowTo100M?
  • which experts did you use?
  • do you have 'baseline' results for MDMMT with configuration where it was not pertained, but all three experts were used?

Kind regards,
Vlad

Can you please explain the function prepare_features?

How do u prepare the features of vgg, video and clip with this function?
Do u skip on frames or cut some of them?
how do u combine everything with bertvid?

From what I saw when I have for e.g. over 30 frames (of clip embeddings) you just take only the first 30. what happens to the other frames?

Using OpenCV instead of FFMPEG gives worse result

Hi,
Thanks a lot for this repo,
I'm trying to use OpenCV instead of ffmpeg in visual_compute_embs function.
When trying to process the same video with same texts mostly ffmpeg's scores higher than OpenCV ones.
From the visual debugging I did (saving the numpy arrays as frames) I saw that OpenCV preserves the color but doesn't preserve in ffmpeg. And still, the evaluation is better.

Any ideas how I can improve OpenCV's performance? And why this is the case?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.