Giter VIP home page Giter VIP logo

video2commonsense's Introduction

Updates

04/15/2021:

  1. V2C-QA version update Training/Testing split annotation files as reported in the paper (5,555 unique answer sets). Unfortunately, we don't yet plan to release V2C-QA code.
  2. Updated with re-implemented baselines for V2C-completion task, architectural file and their pre-trained checkpoints (see files in ./other). Note that some new baseline numbers are different with previously reoprted due to in-consistent dataset version and re-implementions. While the V2C-transformer results remain mostly the same.
  3. We now remove the PPL score as official metric in Arxiv draft as we find it is controvertial to use only 5-GT CMS per caption as the PPL corpus for probability computation. But PPL can still be used for relative performance comparison.

01/05/2021:
Note: This implementation was complemented on PyTorch-1.1.0, there was reported some errors if newer version PyToch.

V2C: Understanding the Video by Enriching Captions with Commonsense Knowledge.

Introduction

This repository contains source code for our EMNLP-20 Long paper: Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning. We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset “Video-to-Commonsense (V2C)” that contains ∼ 9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. For more demos, information about this project, see the V2C Website.

Task Overview

Given the off-the-shelf video representations V, traditional video captioning task aims to produce the textual descriptions T. In addition to just generate the factual aspects of the observable videos, V2C aims to produce the hidden commonsense knowledge C, along with the captions. This is possible because of our curated V2C dataset originated from MSR-VTT video captioning dataset, by further labeling the complementary commonsense aspects per caption (i.e., intentions, future effects and attributes of the person). V2C proves that, when provided proper annotations, data-driven model is able to infer out the plausible pre-conditions or expected results of certain human activities. This provides a novel yet un-investigated perspective for video understanding challenges.

We identified 3 tasks in V2C, namely:

V2C-Completion, given video V and ground-truth caption T, produce C.

V2C-Generation, given video V, produce caption T along with C.

V2C-QA, given video V and a textual question Q, predict the answers as a VQA task.

This repo provides only the implementations for V2C-Completion and V2C-generation task. Please contact Jacob ([email protected]) for captioning related questions.

V2C-QA task is formulated as a 5,555-way classification task. Please contact Tejas Gokhale ([email protected]) and Pratyay Benerjee ([email protected]) for V2C-QA related questions.

V2C Dataset

We released the V2C dataset in V2C_annotations.zip, which consists:

V2C_annotations.zip
├── msrvtt_new_info.json                      # MSR-VTT captions and token dictionary.
├── v2c_info.json                             # V2C Raw, captions/CMS, and token dictionary.
├── V2C_MSR-VTT_caption.json                  # V2C Raw, captions/CMS after tokenization.
├── train_cvpr_humanRank_V2C_caption.json     # a human re-verified clean split for V2C annotations.
├── v2cqa_train.json                          # train split for V2C QA, consisting captions, CMS, and CMS related questions/answers.
└── v2cqa_test.json                           # test split for V2C QA.

Note: We use V2C raw split for V2C-Completion task and auto-evaluations; We provide a small scale but human re-verified V2C anntations (see above) with less noises for future video retrieval, human evaluation, or other related tasks;

Video Features

We use the most naive ResNet-152 frame-wise video representations from MSR-VTT dataset.

NOTE: By using more advanced video features, e.g., I3D or S3D can yield obvious performance boost but is not the objective for this work. Similarly, we suggest users to apply ``self-criti'' strategy after first-round training for better CIDER performances.

Architecture

Training and Testing

Before the training on V2C-completion task, organize the working directory as follows using previous features/annotations:

Video2CommonsenseBaseline
├── data                      
│   ├── feats
│   │   ├── resnet152
│   ├── msrvtt_new_info.json                     
│   ├── v2c_info.json                             
│   ├── V2C_MSR-VTT_caption.json                  
│   ├── train_cvpr_humanRank_V2C_caption.json     
│   └── v2cqa_v1_train.json                       
├── model
├── save                                          # for log/checkpoint output
├── pycocoevalcap                                 # COCO official evaluation scripts
├── others                                        # Files related to baseline models.
├── utils
├── train.py  
├── opts.py     
└── test.py                      

E.g., to initiate a training on intention prediction tasks (set --cms 'int'), with 1 RNN video encoder layer, and 6 transformer decoder layers with 8 attention heads, 64 head dim, and 1024 model dim, for 600 epochs under CUDA mode, and shows intermedia generation examples:

python train.py --cms 'int' --batch_size 128 --epochs 600 --num_layer 6 --dim_head 64 --dim_inner 1024 --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json --print_loss_every 20 --cuda --show_predict 

For evaluations:

python test.py  --cms 'int' --batch_size 64 --num_layer 6 --dim_head 64 --dim_inner 1024 --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json --load_checkpoint save/XX/XX.pth --cuda 

For time efficiency, we just used 1 caption per video during testing, but find close results when applied 10*3k video-caption pairs. Optimum parameters are to be decided after grid searching, but for the numbers in paper and pre-trained models:

Model Zoo

Download MODEL_ZOO.zip for the trained captioning models for intention, effects and attributes generations. Note: a new ``effect'' generation checkpoint is attached in the table link.

To reproduce the number in paper:

python test.py  --cms 'int' --batch_size 64 --num_layer 6 --dim_head 64 --dim_inner 1024 --num_head 8 --dim_vis_feat 2048 --dropout 0.1 --rnn_layer 1 --checkpoint_path ./save --info_json data/v2c_info.json --caption_json data/V2C_MSR-VTT_caption.json --load_checkpoint save/model_cap-int.pth --cuda 

Quantitative Results for V2C-completion

Attribute Model CIDER B1 B2 B3 B4 METEOR ROUGE-L
URL Attention + Video2Text - 36.5 - - - - -
URL Transformer Encoder + Decoder - 40.7 - - - - -
URL Video CMS Transformer - 47.3 - - - - -
Effect Model CIDER B1 B2 B3 B4 METEOR ROUGE-L
URL Attention + Video2Text 18.5 27.7 16.9 13.3 11.5 16.0 23.9
URL Transformer Encoder + Decoder 37.7 35.3 26.6 23.2 21.0 21.4 31.1
URL Video CMS Transformer 40.8 36.5 28.1 24.6 22.4 22.2 32.3
Intention Model CIDER B1 B2 B3 B4 METEOR ROUGE-L
URL Attention + Video2Text 23.2 54.3 40.0 27.4 24.7 19.4 45.6
URL Transformer Encoder + Decoder 57.4 58.3 45.7 36.3 31.1 27.4 53.2
URL Video CMS Transformer 62.0 60.8 48.4 39.1 34.1 28.5 54.6

Citations

Please consider citing this paper if you find it helpful:

@inproceedings{fang-etal-2020-video2commonsense,
    title = "{V}ideo2{C}ommonsense: Generating Commonsense Descriptions to Enrich Video Captioning",
    author = "Fang, Zhiyuan  and
      Gokhale, Tejas  and
      Banerjee, Pratyay  and
      Baral, Chitta  and
      Yang, Yezhou",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.emnlp-main.61",
    doi = "10.18653/v1/2020.emnlp-main.61",
    pages = "840--860",
}

License

V2C is released under the MIT license.

video2commonsense's People

Contributors

jacobswan1 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

video2commonsense's Issues

PPL indicator

cmses = cms_list[random_id].split(';')[1:]
res[eval_id] = [cms]
gts[eval_id] = cmses
eval_id += 1
ppl_corpus = ''
for c in cmses:
total_cms.add(c.lower())
ppl_corpus += ' ' + c.lower()
tokens = nltk.word_tokenize(ppl_corpus)
unigram_model = unigram(tokens)
ppl_scores.append(perplexity(c.lower(), unigram_model))
# Compute PPL score
print('Perplexity score: ', sum(ppl_scores)/len(ppl_scores))

hi! @jacobswan1
This PPL index calculates the index of a certain GT, not the PPL of the predicted sentence.
Why does this code work out as a value on paper?

Compressing Visual-linguistic Model via Knowledge Distillation

Sorry to bother you, but when will the code for your paper 'Compressing Visual-linguistic Model via Knowledge Distillation' in ICCV2021 be published? I noticed that the link you gave in your paper is not working, so I took the liberty to bother you.

The question of computing the token prediction Acc.

train.py:

compute the token prediction Acc.

non_pad_mask = cap_labels[:, 1:].ne(Constants.PAD)
n_word = non_pad_mask.sum().item()
cms_non_pad_mask = cms_labels[:, 1:].ne(Constants.PAD)
cms_n_word = cms_non_pad_mask.sum().item()
cap_loss /= n_word
cms_loss /= n_word

I'm a bit curious about the calculations. When computing the cap_loss and cms_loss, why are they both divided by n_word? And, why isn't cms_loss divided by cms_n_word? I'd appreciate your clarification. Thank you!

Error when run test.py

Hi, @jacobswan1

cms_list = cms_list.cuda()
throws an error "AttributeError: 'list' object has no attribute 'cuda'". If I remove this line, the error becomes "IndexError: tensors used as indices must be long, byte or bool tensors".

Could you check with it?

test.py error

Hi,when I run test.py,it can't show the predict cms :

video1
Predicted CMS:
GT CMS Caption: a woman in a colorful scarf is showing how to make a stew

the prediction line is empty,why?

Testing on different dataset

Hello
Thanks for the work, I'm not good at programming so please let me know if this question is even necessary.
I have tried image captioning before and normally I can receive captions for any random image during testing, but how do I do the same for video captioning using your trained model? An example for this would help cause your dataset has no videos just the features and captions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.