Giter VIP home page Giter VIP logo

neuraltalk's Introduction

#NeuralTalk

Warning: Deprecated. Hi there, this code is now quite old and inefficient, and now deprecated. I am leaving it on Github for educational purposes, but if you would like to run or train image captioning I warmly recommend my new code release NeuralTalk2. NeuralTalk2 is written in Torch and is SIGNIFICANTLY (I mean, ~100x+) faster because it is batched and runs on the GPU. It also supports CNN finetuning, which helps a lot with performance.

This project contains Python+numpy source code for learning Multimodal Recurrent Neural Networks that describe images with sentences.

This line of work was recently featured in a New York Times article and has been the subject of multiple academic papers from the research community over the last few months. This code currently implements the models proposed by Vinyals et al. from Google (CNN + LSTM) and by Karpathy and Fei-Fei from Stanford (CNN + RNN). Both models take an image and predict its sentence description with a Recurrent Neural Network (either an LSTM or an RNN).

Overview

The pipeline for the project looks as follows:

  • The input is a dataset of images and 5 sentence descriptions that were collected with Amazon Mechanical Turk. In particular, this code base is set up for Flickr8K, Flickr30K, and MSCOCO datasets.
  • In the training stage, the images are fed as input to RNN and the RNN is asked to predict the words of the sentence, conditioned on the current word and previous context as mediated by the hidden layers of the neural network. In this stage, the parameters of the networks are trained with backpropagation.
  • In the prediction stage, a witheld set of images is passed to RNN and the RNN generates the sentence one word at a time. The results are evaluated with BLEU score. The code also includes utilities for visualizing the results in HTML.

Dependencies

Python 2.7, modern version of numpy/scipy, perl (if you want to do BLEU score evaluation), argparse module. Most of these are okay to install with pip. To install all dependencies at once, run the command pip install -r requirements.txt

I only tested this code with Ubuntu 12.04, but I tried to make it as generic as possible (e.g. use of os module for file system interactions etc. So it might work on Windows and Mac relatively easily.)

Protip: you really want to link your numpy to use a BLAS implementation for its matrix operations. I use virtualenv and link numpy against a system installation of OpenBLAS. Doing this will make this code almost an order of time faster because it relies very heavily on large matrix multiplies.

Getting started

  1. Get the code. $ git clone the repo and install the Python dependencies
  2. Get the data. I don't distribute the data in the Git repo, instead download the data/ folder from here. Also, this download does not include the raw image files, so if you want to visualize the annotations on raw images, you have to obtain the images from Flickr8K / Flickr30K / COCO directly and dump them into the appropriate data folder.
  3. Train the model. Run the training $ python driver.py (see many additional argument settings inside the file) and wait. You'll see that the learning code writes checkpoints into cv/ and periodically reports its status in status/ folder.
  4. Monitor the training. The status can be inspected manually by reading the JSON and printing whatever you wish in a second process. In practice I run cross-validations on a cluster, so my cv/ folder fills up with a lot of checkpoints that I further filter and inspect with other scripts. I am including my cluster training status visualization utility as well if you like. Run a local webserver (e.g. $ python -m SimpleHTTPServer 8123) and then open monitorcv.html in your browser on http://localhost:8123/monitorcv.html, or whatever the web server tells you the path is. You will have to edit the file to setup the paths properly and point it at the right json files.
  5. Evaluate model checkpoints. To evaluate a checkpoint from cv/, run the evaluate_sentence_predctions.py script and pass it the path to a checkpoint.
  6. Visualize the predictions. Use the included html file visualize_result_struct.html to visualize the JSON struct produced by the evaluation code. This will visualize the images and their predictions. Note that you'll have to download the raw images from the individual dataset pages and place them into the corresponding data/ folder.

Lastly, note that this is currently research code, so a lot of the documentation is inside individual Python files. If you wish to work with this code, you'll have to get familiar with it and be comfortable reading Python code.

Pretrained model

Some pretrained models can be found in the NeuralTalk Model Zoo. The slightly hairy part is that if you wish to apply these models to some arbitrary new image (one not from Flickr8k/30k/COCO) you have to first extract the CNN features. I use the 16-layer VGG network from Simonyan and Zisserman, because the model is beautiful, powerful and available with Caffe. There is opportunity for putting the preprocessing and inference into a single nice function that uses the Python wrapper to get the features and then runs the pretrained sentence model. I might add this in the future.

Using the model to predict on new images

The code allows you to easily predict and visualize results of running the model on COCO/Flickr8K/Flick30K images. If you want to run the code on arbitrary image (e.g. on your file system), things get a little more complicated because we need to first need to pipe your image through the VGG CNN to get the 4096-D activations on top.

Have a look inside the folder example_images for instructions on how to do this. Currently, the code for extracting the raw features from each image is in Matlab, so you will need it installed on your system. Caffe also has a wrapper for Python, but I wasn't yet able to use the Python wrapper to exactly reproduce the features I get from Matlab. The example_images will walk you through the process, and you will eventually use predict_on_images.py to run the prediction.

Using your own data

The input to the system is the data folder, which contains the Flickr8K, Flickr30K and MSCOCO datasets. In particular, each folder (e.g. data/flickr8k) contains a dataset.json file that stores the image paths and sentences in the dataset (all images, sentences, raw preprocessed tokens, splits, and the mappings between images and sentences). Each folder additionally contains vgg_feats.mat , which is a .mat file that stores the CNN features from all images, one per row, using the VGG Net from ILSVRC 2014. Finally, there is the imgs/ folder that holds the raw images. I also provide the Matlab script that I used to extract the features, which you may find helpful if you wish to use a different dataset. This is inside the matlab_features_reference/ folder, and see the Readme file in that folder for more information.

License

BSD license.

neuraltalk's People

Contributors

alyxb avatar ericzeiberg avatar huyouare avatar karpathy avatar simov8 avatar vanessad avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

neuraltalk's Issues

How to generate json for new data?

Hi Andrej,
I was interested in using your algorithm for some new data. Basically, each images is associated with one sentence. Is there a convenient way to generate the json file as in your example (Flickr8k, etc). What is the structure of the json, and is there anyway to not using json format?

Thanks!
Wei

list index out of range error

I created coco_sample directory containing the following files.

  • COCO_val2014_000000463825.jpg
  • model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p (from here)
  • tasks.txt (containing one line COCO_val2014_000000463825.jpg)
  • vgg_feats.mat (from here)

I ran the following command.

python predict_on_images.py coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p -r coco_sample

I got an error message as below.

parsed parameters:
{
"beam_size": 1,
"checkpoint_path": "coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p",
"root_path": "coco_sample"
}
loading checkpoint coco_sample/model_checkpoint_coco_visionlab43.stanford.edu_lstm_11.14.p
image 0/123287:
/home/ec2-user/neuraltalk/imagernn/lstm_generator.py:227: RuntimeWarning: overflow encountered in exp
IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
PRED: (-14.587771) a man and a woman sitting on a bench in the middle of a park
image 1/123287:
Traceback (most recent call last):
File "predict_on_images.py", line 109, in
main(params)
File "predict_on_images.py", line 66, in main
img['local_file_path'] =img_names[n]
IndexError: list index out of range

Isn't it possible to run predict_on_images.py on a few images?

Encountered runtime warning while computing logistic function

@karpathy Thanks for open sourcing your image-to-sentences work. I got the code up & running with the Flickr30K dataset but encountered a runtime warning
" RuntimeWarning: overflow encountered in exp"

I have fixed it locally by using scipy.special.expit function. I have attached the patch below in case you want to "cherry-pick' my commit. Let me know if this patch is useful to you and whether you'd like me to make a PR with a fix:

From d3b8d3401a7ebeae1aff88538f1f5eff440b31cf Mon Sep 17 00:00:00 2001
From: Vimal Thilak
Date: Wed, 3 Dec 2014 15:16:28 -0800
Subject: [PATCH] [bugfix] Fix overflow runtime warning

  • Warning encountered in logistic function computation

Signed-off-by: Vimal Thilak

imagernn/lstm_generator.py | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/imagernn/lstm_generator.py b/imagernn/lstm_generator.py
index 011e333..af6797f 100644
--- a/imagernn/lstm_generator.py
+++ b/imagernn/lstm_generator.py
@@ -1,5 +1,6 @@
import numpy as np
import code
+import scipy.special

from imagernn.utils import initw

@@ -75,7 +76,7 @@ class LSTMGenerator:
IFOG[t] = Hin[t].dot(WLSTM)

   # non-linearities
  •  IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d])) # sigmoids; these are the gates
    
  •  IFOGf[t,:3*d] = scipy.special.expit(IFOG[t, :3*d])  #1.0/(1.0+np.exp(-IFOG[t,:3*d])) # sigmoids; these are the gates
    

    IFOGf[t,3_d:] = np.tanh(IFOG[t, 3_d:]) # tanh

    compute the cell activation

    @@ -224,7 +225,7 @@ class LSTMGenerator:
    C = np.zeros((1, d))
    Hout = np.zeros((1, d))
    IFOG[t] = Hin[t].dot(WLSTM)

  •  IFOGf[t,:3_d] = 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
    
  •  IFOGf[t,:3_d] = scipy.special.expit(-IFOG[t,:3_d])  # 1.0/(1.0+np.exp(-IFOG[t,:3_d]))
    

    IFOGf[t,3_d:] = np.tanh(IFOG[t, 3_d:])
    C[t] = IFOGf[t,:d] * IFOGf[t, 3_d:] + IFOGf[t,d:2*d] * c_prev
    if tanhC_version:

    2.0.1

image captioning

hi, i like to work on image captioning and i used a novel approach for image segmentation, and now i like to use these segmented image as a preprocessing step for image captioning, can u help me to give me an idea for my next step to do it? and if its possible may i have matlab codes for captioning?

Incorrect prediction while testing.

When I am evaluating and predicting on the datasets called example_images given by you after training flickr8k images, I get all the wrong outputs. For each of the images, the prediction is incorrect. Why is this happening?

MRFs for text segment alignments

Hi Andrej,
Thank you very much for open sourcing the code!
You paper talks about MRFs for decoding text segment alignments to images, but I couldn't find any code related to that. Am I missing something?

Thanks
Pradeep.

Running On Raw Images

How exactly would I go about getting a trained models predicition on an image (in some raw format) that I have?

Transfer Learning with word2vec?

Hi Andrej & Fei-fei,
I've been playing around with this and reading through the code -- many thanks for making it wonderful code to read! I was under the impression that it used pretrained word vector embeddings from Mikolov et al:
Image of mikolove ref slide

....but I don't see any evidence in the code where these vectors are loaded in. Are the word embeddings learned from scratch or are they in fact initialized in some way?

Many thanks!
chris moody

Aborting, cost seems to be exploding.

training with flickr8k aborts:

253/15000 batch done in 5.037s. at epoch 0.84. loss cost = 37.447347, reg cost = 0.000001, ppl2 = 26.10 (smooth 48.09)
254/15000 batch done in 5.082s. at epoch 0.85. loss cost = 39.408169, reg cost = 0.000001, ppl2 = 29.19 (smooth 47.91)
255/15000 batch done in 4.914s. at epoch 0.85. loss cost = 140.730310, reg cost = 0.000001, ppl2 = 237360.65 (smooth 2421.03)
Aboring, cost seems to be exploding. Run gradcheck? Lower the learning rate?

predict_on_images.py: error: too few arguments

Hello..
Thanks for the code and the very helpful read me files..
I tried to call the predict_on_images.py on the examples folder you supported but got this error
C:\neuraltalk-master>python predict_on_images.py
usage: predict_on_images.py [-h] [-r ROOT_PATH]
predict_on_images.py: error: too few arguments

I would appreciate any help ...

Regards

py_caffe_feat_extract

I think the bicubic implementation is of some problem.

The output image contains some obvious artifacts if you visualize it.
It's definitely not same as Matlab's imresize nor Opencv's resize(Inter_cubic).

I guess the vgg_feats.mat inside examples_images was produced by this function.
The results made by py_caffe_feat_extract were also slightly different with the ones made by opencv's resize(cubic).
Hope some one could fix the bug of the bicubic implementation some day.

Thanks a lot.

Have you implemented Visual-Semantic Alignments ?

Thanks for your kindness to release these codes!
It helps me a lot!
I am interested in your cvpr paper : Deep Visual-Semantic Alignments for Generating Image Descriptions. But I did not found anything about Visual-Semantic Alignments in this released code, have I missed something ? thanks !

CAFFE API error

When I tried to run the python scripts python_features/extract_features.py today, I met with a problem as follow:

Traceback (most recent call last):
  File "./extract_features.py", line 102, in <module>
    net = caffe.Net(args.model_def, args.model)
Boost.Python.ArgumentError: Python argument types in
    Net.__init__(Net, str, str)
did not match C++ signature:
    __init__(boost::python::api::object, std::string, std::string, int)
    __init__(boost::python::api::object, std::string, int)

Then I search this error on the Internet, and I find a same issue in caffe's issue page: Caffe#1905. I think it's an error caused by the update of Caffe's API.
So I change the code in extract_features.py#101 as: net = caffe.Net(args.model_def, args.model, caffe.TEST). It worked, but a new problem came out:

Traceback (most recent call last):
  File "./extract_features.py", line 102, in <module>
    caffe.set_phase_test()
AttributeError: 'module' object has no attribute 'set_phase_test'

I think the reason is that some APIs in python_features/extract_features.py are too old.

Question about usage of RCNN

Hello, I recently read your paper, and very much appreciate about you sharing your codes here.

By the way, on your paper it is indicated that you first extracted top regions of obtained by RCNN and then get the CNN features, however I do not see that object detection part in your implementation. Either in training and test phase, it seems not using object detection functionality. Is it because it still works fine using the holistic image?

Thank you.

Problems in gradient check

Hi,

When I try to run the gradient check, for Ws, the gradient check prints "VAL SMALL WARNING". I have printed the numerical gradients and analytical gradients in this case, and find that the numerical gradients are exactly zero, and analytical gradients are in the order of e-12.

I am confused about that, since the numerical gradients are zero, that means some words are not in the batch, so, changing its value will not affect the cost (in grad_check, we add delta to the word vectors). However, the analytical gradients are not zero, that means these words actually appear in the batch, and these word vectors are updated.

Why will this happen?

Thanks.

Bounding box

Hello Andrej,

Great work!

Is it possible to get the bounding box associated with words? Or is that part of the alignment/retrieval model?

Thanks!

training over new dataset

I am training it over new dataset. I am getting this error in save checkpoint
36/1850 batch done in 2.356s. at epoch 0.97. loss cost = 9.295156, reg cost = 0.000000, ppl2 = 4.59 (smooth 14.32)
evaluating val performance in batches of 100
Traceback (most recent call last):
File "driver.py", line 315, in
main(params)
File "driver.py", line 232, in main
val_ppl2 = eval_split('val', dp, model, params, misc) # perform the evaluation on VAL set
File "/root/neuraltalk/imagernn/imagernn_utils.py", line 38, in eval_split
ppl2 = 2 ** (logppl / logppln)
ZeroDivisionError: integer division or modulo by zero

Confusion about an equation in the paper

Thanks for making the code public. This is a great work!
This issue is not about the code, but I feel a little confused about the 11th equation in the paper, since the relevant code is not available.
eq
What does i indicates in the above equation? and does t refer to an index of an image fragment or a sentence fragment? Also maybe maximizing this term makes more sense? I would really appreciate it if you can point out my misunderstandings here. Thanks!

multiple hosts

Hi Andrej,

I really love this implementation.
The most intriguing part to me is your monitorcv to visualize the cross-validation. It could help a lot during training.

In the code, I found it could show up-to-40 results with different host names, but my computer has only one hostname (using python gethostname).
I bet it's my lack of related knowledge.
I guess we could run on separate hosts (with different parameters or models) using the same computer, right?

Could you please give some instructions on how to do so?

Thank you so much.
Best,
-Ethan

Use for sentence input to sentence output

Would it be possible to use this code to accept a sentence input, and output the most likely sentence, in order to sustain dialogue, instead of a picture input and sentence output? I believe there is a paper on this. Sorry this is not an issue, didn't know where to comment.

Maybe a mistake in lstm_generator.py

In the lstm_generator.py, line 71 Hin[t,1:1+d] = X[t] and 72 Hin[t,1+d:] = prev should be exchanged.
Because the hidden size is d, which is the dimension of the prev.
But i don't why it doesn't raise an error, anyone can explain this?

How can i use this code to train regions & snippets RNN model?

In this code, i only find how to use images and the images description sentences to train a multimodal RNN. But i don't see any founctions about how to use the regions & snippets to train the model.Just like the figure 5 or part 4.3 in the paper.
How can i train my own model? How can i get the result just like the figure 5?

multi-bleu.perl

Hi,

Is this the same script with the Moses's multi-bleu.perl? I've seen that there are some modifications to the original version. I've been investigating that why my baseline model's (Google NIC with VGG-E) BLEU-2-3-4 performance is really low but what I've found is we are not using the same evaluation scripts. I know that this task is different than machine translation task, though. So, my questions are,

  • What's the intention behind the BLEU evaluation script modification?
  • Is all captioning people evaluate their models with this approach?

Thanks in advance.

Best hyperparameters for RNN model

Hi,

When I try to train the RNN model, the performance is quite poor with default parameters since the default values are tuned for LSTM.

So, could you please share the tuned hyperparameters for RNN model?

Thanks.

predict_on_images.py error

usage: predict_on_images.py [-h] [-r ROOT_PATH] [-b BEAM_SIZE] checkpoint_path
predict_on_images.py: error: the following arguments are required: checkpoint_path
An exception has occurred, use %tb to see the full traceback.

this error happened. what should i do?

Size of Descriptive Sentences

Hi Andrej,

Is there a limit to the size of the descriptive sentences? Has it been tried with multiple sentences each describing different features of the image? For example, if an image had a descriptor "A dog in a park. A kite in the sky." could it generate two sentences if the training data was in a similar format? OR is it better to split the descriptive sentences into several single sentence examples and show the same image for each (ie. image A: dog in a park, image A: kite in the sky).

Also, is the matlab feature extractor GPU enabled?

Thanks!

R Kelly

For extra street credit, please adopt a R Kelly "real talk" meme photo in the Readme

Why optimizing the Ws matrix directly?

Other approaches like [Show and Tell] use a We matrix for word embedding which optimize the We , But in neuraltalk I found that it direcly optimize the Ws in which each raw represent a word. So Why do this way? or which way performs better?

eval_sentence_predictions.py: error: too few arguments

~/tf/neuraltalk-master$ python eval_sentence_predictions.py
usage: eval_sentence_predictions.py [-h] [-b BEAM_SIZE]
[--result_struct_filename RESULT_STRUCT_FILENAME]
[-m MAX_IMAGES] [-d DUMP_FOLDER]
checkpoint_path
eval_sentence_predictions.py: error: too few arguments

when i run this script i got this error,is checkpoint path error or others?thank you.

question about dropout implementation

Hi Andrej,

I have been learning a ton about RNNs and their implementation from looking through your code. I have a (perhaps silly) question about your dropout implementation. You claim that your code creates a mask that drops a fraction, drop_prob, of the units and then scales the remaining units by 1/(1-drop_prob). This doesn't seem correct to me since you are sampling using np.random.randn, which seems to sample from a normal distribution of mean 0 and variance 1.

For example, if you set drop_prob=1 (and ignore the fact that this makes your scale factor infinite) then you should be dropping all the units, but in reality you will be testing the boolean condition np.random.randn(some_shape)<(1-drop_prob). Since np.random.rand gives you negative values half the time (on average) you will only drop half the units (on average).

It seems like you want to be sampling from a uniform distribution from 0 to 1 in order for this to work properly.

Best,
Sam

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.