lukemelas / image-paragraph-captioning Goto Github PK

[EMNLP 2018] Training for Diversity in Image Paragraph Captioning

Python 98.44% Shell 1.56%

image-paragraph-captioning's Introduction

Training for Diversity in Image Paragraph Captioning

This repository includes a PyTorch implementation of Training for Diversity in Image Paragraph Captioning. Our code is based on Ruotian Luo's implementation of Self-critical Sequence Training for Image Captioning, available here..

Requirements

Python 2.7 (because coco-caption does not support Python 3)
PyTorch 0.4 (with torchvision)
cider (already included as a submodule)
coco-caption (already included as a submodule)

If training from scratch, you also need:

spacy (to tokenize words)
h5py (to store features)
scikit-image (to process images)

To clone this repository with submodules, use:

git clone --recurse-submodules https://github.com/lukemelas/image-paragraph-captioning

Train your own network

Download and preprocess cations

Download captions:
- Run download.sh in data/captions
Preprocess captions for training (part 1):
- Download spacy English tokenizer with python -m spacy download en
- First, convert the text into tokens: cd scripts && python prepro_text.py
- Next, preprocess the tokens into a vocabulary (and map infrequent words to an UNK token) with the following command. Note that image/vocab information is stored in data/paratalk.json and caption data is stored in data/paratalk\_label.h5

python scripts/prepro_labels.py --input_json data/captions/para_karpathy_format.json --output_json data/paratalk.json --output_h5 data/paratalk

Preprocess captions into a coco-captions format for calculating CIDER/BLEU/etc:
- Run scripts/prepro\_captions.py
- There should be 14,575/2487/2489 images and annotations in the train/val/test splits
- Uncomment line 44 ((Spice(), "SPICE")) in coco-caption/pycocoevalcap/eval.py to disable Spice testing
Preprocess ngrams for self-critical training:

python scripts/prepro_ngrams.py --input_json data/captions/para_karpathy_format.json --dict_json data/paratalk.json --output_pkl data/para_train --split train

Extract image features using an object detector
- We make pre-processed features widely available:
  - Download and extract parabu_fc and parabu_att from here into data/bu_data
- Or generate the features yourself:
  - Download the Visual Genome Dataset
  - Apply the bottom-up attention object detector here made by Peter Anderson.
  - Use scripts/make_bu_data.py to convert the image features to .npz files for faster data loading

Train the network

As explained in Self-Critical Sequence Training, training occurs in two steps:

The model is trained with a cross-entropy loss (~30 epochs)
The model is trained with a self-critical loss (30+ epochs)

Training hyperparameters may be accessed with python train.py --help.

A reasonable set of hyperparameters is provided in train_xe.sh (for cross-entropy) and train_sc.sh (for self-critical).

mkdir log_xe
./train_xe.sh

You can then copy the model:

./scripts/copy_model.sh xe sc

And train with self-critical:

mkdir log_sc
./train_xe.sh

Pretrained Network

You can download a pretrained captioning model here.

Citation

In case you would like to cite our paper/code (no obligation at all):

@article{melaskyriazi2018paragraph, 
  title={Training for diversity in image paragraph captioning},
  author={Melas-Kyriazi, Luke and Rush, Alexander and Han, George},
  journal={EMNLP},
  year={2018}
}

And Ruotian Luo's code, on which this repo is built:

@article{luo2018discriminability,
  title={Discriminability objective for training descriptive captions},
  author={Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
  journal={CVPR},
  year={2018}
}

image-paragraph-captioning's People

Contributors

Stargazers

Watchers

image-paragraph-captioning's Issues

Could you share the input_box_dir?

Thanks for sharing us with the pretrained features: parabu_fc and parabu_att! In order to analysis the results of the corresponding images, could you share the input_box_dir which contains bounding boxes location information with us? Thank you!

Calculating METEOR CIDEr BLEU

For some reason, the final CIDEr/METEOR/BLEU scores on the validation set are not printed at the end of each epoch when I train. Here's what my output looks like when I train:

...
Cider scores: 0.05055747138711029
Cider scores: 0.23762433183792392
Cider scores: 0.2016461526656507
Cider scores: 0.09484974868502609
Cider scores: 0.06714491373311673
Read data: 1.02520489693
iter 42201 (epoch 28), avg_reward = -0.011, data_time = 0.014, time/batch = 1.025
Cider scores: 0.12018177868980709
Cider scores: 0.1537472398105469
Cider scores: 0.02683920305320616
Cider scores: 0.17457595857620622
Cider scores: 0.13120042844450647
...
Cider scores: 0.14903162980352952
Cider scores: 0.1480045371928352
Cider scores: 0.1597950489871271
Cider scores: 0.0681437261069909
Cider scores: 0.05616059625852486
Read data: 1.14169120789
iter 42401 (epoch 29), avg_reward = 0.016, data_time = 0.014, time/batch = 1.142
Cider scores: 0.13915968413271335
Cider scores: 0.2236347322797569
Cider scores: 0.11668309405737172
Cider scores: 0.3331930390082059
Cider scores: 0.0617720056876462
...

Just a bunch of Cider scores per batch, but no CIDEr/METEOR/BLEU between epochs...Is there any way to change this so that the the final CIDEr/METEOR/BLEU scores on the validation set are printed once per batch, at the end of the epoch? Or is there a way to calculate the final CIDEr/METEOR/BLEU scores on the validation set given the final checkpoint generated by training?

Model Tuning

Thanks for sharing all the work you done!

I used bottom-up-attention model given by Peteanderson80 to extract image features because I need features more than parabu_att such as boxes' coordinates.

But I found that using the parabu_att extracted from bottom-up-attention only gives CIDEr 0.25.

I believed the hyperparameters provided in your code must be discovered after hundreds of tuning .
So I'm curious about how and what you work on tuning the model.
Thanks!!!

Loading Pre-trained Models

I’m not entirely sure how train.py loads pre-trained models. PyTorch’s documentation recommends torch.load() or torch.load_state_dict() — I see torch.load_state_dict() used for the optimizer, but neither used for the main model or dp_model variables.

I also see infos = cPickle.load(f) & histories = cPickle.load(f) which seem to resemble torch.load(), but the infos and histories variables don’t seem to be used to influence the model or dp_model variables. How are the weights loaded into the model or dp_model variables?

Problem with Pre-trained model

Hey luke, Thank you very much for sharing your work.

I'm trying to do some experiment with the uploaded pretrained model, to do some inference, But the pre-trained model is just generating a UNK word.

I used this flags
python eval.py --model data/model-best-i84000-score0.314696218186.pth --infos_path data/for_zip/infos_-best.pkl --image_folder data/image

Also, the eval_utils.py have the same problem with (ruotianluo/self-critical.pytorch#42 with KeyError: 'att_masks' )
I found a solution from this link, would be great to update the file with this modified code
ruotianluo/self-critical.pytorch#42

tmp = [data['fc_feats'][np.arange(loader.batch_size) * loader.seq_per_img], data['att_feats'][np.arange(loader.batch_size) * loader.seq_per_img], data['att_masks'][np.arange(loader.batch_size) * loader.seq_per_img] if 'att_masks' in data is not None else None]

Again, Thank you for sharing your work

'infos_xe-best.pkl': No such file or directory

When I trained the network, it didn't create a "infos_xe-best.pkl" file so I get an error when I run "./scripts/copy_model.sh xe sc". Any ideas on what I might be missing?

image features boxes

Thanks for sharing us with the pretrained features: parabu_fc and parabu_att!
In order to analysis the results of the corresponding images, could you share the pretrained features' bounding boxes with us? Thank you!

pre-trained model

Thanks for your studying and sharing!
I am just wondering what dataset is used for pre-trained model
Which one(MS-COCO or Visual Genome) is used?
if MS-COCO used, can I get the model pre-trained on visual genome dataset?

Questions about train_sc and repetition penalty

Thank you for your excellent work! It helps me a lot!
But I am confused about the following questions:

The last step in README.md named "And train with self-critical: mkdir log_sc ./train_xe.sh" may be "mkdir log_sc ./train_sc.sh" I guess? And if so, the second line in "train_sc.sh" cause an error: "unrecognized arguments: --caption_model topdown". I wonder how to fix this?
"train_sc.sh" has set the option "--block_trigrams" to "1", but in "train.py" line 130, you use the new defined "opt={'sample_max':0}" but not the global "opt". My understanding is that the "block_trigrams" will still be 0 and you don't use Repetition Penalty during sampling. Is that right?

Cannot Replicate Results Described in Paper

I cloned the repo last month (before the most recently updated bug pertaining to the evaluation was fixed) but I made the (one line?) fix locally. I then tried training a model from scratch and the following are the results I obtained:

epochs = 25 xe / 25 sc (as described in the paper)
Bleu_1: 0.419
Bleu_2: 0.262
Bleu_3: 0.165
Bleu_4: 0.101
METEOR: 0.166
ROUGE_L: 0.313
CIDEr: 0.257

epochs = 30 xe / 170 sc (default in the repo)
Bleu_1: 0.430
Bleu_2: 0.271
Bleu_3: 0.171
Bleu_4: 0.105
METEOR: 0.170
ROUGE_L: 0.312
CIDEr: 0.270

Here are the results the paper claims to achieve (using epochs = 25 xe / 25 sc):
Bleu_1: 43.54
Bleu_2: 27.44
Bleu_3: 17.33
Bleu_4: 10.58
METEOR: 17.86
CIDEr: 30.63

Any ideas for this discrepancy?

Calculating METEOR CIDEr BLEU scores

Once the model has been trained, how can the METEOR CIDEr BLEU-1 BLEU-2 BLEU-3 BLEU-4 results presented in the paper be obtained?

Also, does this implementation in its default state use repetition penalty? How can this setting be changed?

are you using repitition penalty in xe training?

Hello. Thanks for your work on image paragraph captioning and thanks for sharing the code.
In table 1, you report the xe results with repetition penality achieves better scores. So do you initialize the self-critical training from the xe model trained with the repetition penalty or without it? In your code, the xe training (_forward function) is without the repetition penalty, its only in self-critical training (when using _sample function. That means you initialize the self-critical training with the xe model trained without repetition penalty.

parabu_fc and parabu_att

Thank you for sharing your code, but the link to download parabu_fc and parabu_att is not available, can you send me an available link? Thanks a lot.

Difference between two cider submodules

I can see two cider submodules are used in eval.py and rewards.py separately. I am curious about the difference between them.

Return value of get_batch()

Line 115 of train.py is:

data = loader.get_batch('train’)

Where data is a dictionary with 8 keys: ['labels', 'bounds', 'masks', 'gts', 'att_masks', 'infos', 'att_feats', 'fc_feats’]

My understanding is the following:

Labels: the caption, where each word is represented by a number from the ordered vocabulary (why does this always start with 0?)
Bounds: {'wrapped': False, 'it_pos_now': 10, 'it_max': 14574} — not entirely sure what these mean
Masks: the i’th index is 1 whenever the i’th word exists
Gts: the labels array shifted over to the left by 1, getting rid of the starting 0 in labels
att_masks: not sure
Infos: information about the images in the batch

Is my limited understanding correct? I have no idea what att_feats and fc_feats are…what are they? Is the encoder output stored anywhere in this ‘data' dictionary?

Empty Drive Link

The Drive folder link that is supposed to contain object detector is empty. Please look into it

.

Visual Genome features

Thanks for your code!
It seems that the VG features are not available from the gdrive links.
Could you please renew the links?
Thanks!

Eval The Best Pretrain Model

Hello，I want to evaluate the best model you have to offer. The following results are obtained : {'CIDEr': 0.0866294015172791, 'Bleu_4': 0.05030253169596654, 'Bleu_3': 0.08462456687188975, 'Bleu_2': 0.14497869681798176, 'Bleu_1': 0.26704123980031547, 'ROUGE_L': 0.2612504329090221, 'METEOR': 0.13221872262709633} .What do you think about it? Are these results right?

are the scores in the paper reported on the validation or test set?

Hello @lukemelas. May I ask the scores in Table 1 are evaluated on the validation or test set of image paragraph from Stanford?