Giter VIP home page Giter VIP logo

delving-deeper-into-the-decoder-for-video-captioning's Introduction

Delving Deeper into the Decoder for Video Captioning

PRs Welcome DeepLearning Github Watchers GitHub stars GitHub forks License

Table of Contents

  1. Description
  2. Requirement
  3. Manual
  4. Results
    1. Comparison on Youtube2Text
    2. Comparison on MSR-VTT
  5. Data
  6. Citation

Description

This repository is the source code for the paper named Delving Deeper into the Decoder for Video Captioning.
The paper has been accepted by ECAI 2020. The encoder-decoder framework is the most popular paradigm for video captioning task. There still exist some non-negligible problems in the decoder of a video captioning model. We propose three methods to improve the performance of the model.

  1. A combination of variational dropout and layer normalization is embeded into semantic compositional gated recurrent unit to alleviate the problem of overfitting.
  2. A unified, flexible method is proposed to evaluate the model performance on a validation set so as to select the best checkpoint for testing.
  3. A new training strategy called professional learning is proposed which develops the strong points of a captioning model and bypasses its weaknesses.

It is demonstrated in the experiments of MSVD and MSR-VTT datasets that our model has achieved the best results evaluated by BLEU, CIDEr, METEOR and ROUGE-L metrics with significant gains of up to 11.7% on MSVD and 5% on MSR-VTT compared with the previous state-of-the-art models.


If you need more information about how to generate training, validating and testing data for the datasets, please refer to Semantics-AssistedVideoCaptioning.


Professional Learning

Requirement

  1. Python 3.6
  2. TensorFlow-GPU 1.13
  3. pycocoevalcap (Python3)
  4. NumPy

Manual

  1. Make sure you have installed all the required packages.
  2. Download files in the Data section.
  3. cd path_to_directory_of_model; mkdir saves
  4. run_model.sh is used for training or testing models. Specify the GPU you want to use by modifying CUDA_VISIBLE_DEVICES value. name will be used in the name of saved model during training. Specify the needed data paths by modifying corpus, ecores, tag and ref values. test refers to the path of the saved model which is to be tested. Do not give a parameter to test if you want to train a model.
  5. After completing the configuration of the bash file, then bash run_model.sh for training or testing.

Results

Comparison on Youtube2Text

MSVD Results

Comparison on MSR-VTT

MSR-VTT Results


Data

MSVD

  • MSVD dataset and features: GoogleDrive
    • SHA-256 ca86eb2b90e302a4b7f3197065cad3b9be5285905952b95dbffb61cb0bf79e9c
  • Model Checkpoint: GoogleDrive
    • SHA-256 64089a49fe9de895c9805a85d50160404cb36ccb8c22a70a32fc7ef5a2abfff1

MSRVTT

  • MSRVTT dataset and features: GoogleDrive
    • SHA-256 611b297c4fbbdd58540373986453a991f285aed6cc18914ad930e1e7646f26fb
  • Model Checkpoint: GoogleDrive
    • SHA-256 fb04fd2d29900f7f8a712b6d2352e8227acd30173274b64a38fcea6a608e4a8e

Citation

@article{chen2020delving,
	title={Delving Deeper into the Decoder for Video Captioning},
	author={Haoran Chen and Jianmin Li and Xiaolin Hu},
	journal={CoRR},
    	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2001.05614},
	eprint={2001.05614},
	year={2020}
}

delving-deeper-into-the-decoder-for-video-captioning's People

Contributors

wingsbrokenangel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

delving-deeper-into-the-decoder-for-video-captioning's Issues

测试所用数据的生成代码

你好,可以提供下面四个文件的生成代码吗?

    --corpus ./msvd_feats/msvd_corpus_glove.pkl \
    --ecores ./msvd_feats/msvd_eco_norm.npy \
    --tag    ./msvd_feats/msvd_semantic_tag_res_avg.npy \
    --ref    ./msvd_feats/msvd_ref3.pkl \

另外MSR-VTT的测试数据可以用同样的方式生成吗?

About your glove embeddings

非常感谢您的分享,想请问一下:
1)glove词向量是如何提取到的呢,有链接么?
2)bos eos pad unk 我看到都有对应的词向量,是如何做到的呢?

run file missing

Hi, I do not find the run_model.sh file in this repo. Where is it? Thanks in advance.

关于msrvtt_eco_32_feats.npy的生成

我根据您提供的prepare_frames.py文件抽取了视频帧后,再根据 https://github.com/WingsBrokenAngel/ECO-efficient-video-understanding 配置好了环境,同时也下载好了您在 https://github.com/WingsBrokenAngel/Semantics-AssistedVideoCaptioning 提供的模型预训练权重ECO_full_kinetics.caffemodel(经过校验MD5确定下载无误)和对应的deploy.prototxt文件,但是在使用generate_eco_feature.py生成msrvtt_eco_32_feats.npy时,发现生成的结果与您在 https://github.com/WingsBrokenAngel/Semantics-AssistedVideoCaptioning 中的msrvtt_resnext_eco.npy结果有较大差距,尤其是ECO_full的512维3D卷积特征几乎完全不同,我也去对比了generate_res_feature.py生成的msrvtt_res_32_feats.npy文件与msrvtt_resnext_eco.npy中的相应部分,发现两者差距很小,请问您在使用generate_eco_feature.py生成msrvtt_eco_32_feats.npy的过程中时进行了别的处理吗?我要怎样才能生成和您一样的msrvtt_eco_32_feats.npy文件呢?

how to convert the test result into words

hi, after using the checkpoint to test, the output may be the word indexes?
{'testlen': 45403, 'reflen': 46693, 'guess': [45403, 38890, 32377, 25864], 'correct': [40964, 26695, 15019, 7511]}
how to convert them into words?
Thank you :)

Running Error

What is wrong with it? I just download the msr-vtt data files and the checkpoint.

1

Inference on single video

Hi,

do you have a demo.py/ipynb that I can use to run inference on a single video to see the captions generated? If not, can you describe how I can go about making this setup?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.