Giter VIP home page Giter VIP logo

neural-vqa-tensorflow's Introduction

Visual Question Answering in Tensorflow

Join the chat at https://gitter.im/neural-vqa-tensorflow/Lobby

This is a Tensorflow implementation of the VIS + LSTM visual question answering model from the paper Exploring Models and Data for Image Question Answering by Mengye Ren, Ryan Kiros & Richard Zemel. The model architectures vaires slightly from the original - the image embedding is plugged into the last lstm step (after the question) instead of the first. The LSTM model uses the same hyperparameters as those in the Torch implementation of neural-VQA. Model architecture

Requirements

Datasets

  • Download the MSCOCO train+val images and VQA data using Data/download_data.sh. Extract all the downloaded zip files inside the Data folder.
  • Download the pretrained VGG-16 tensorflow model and save it in the Data folder.

Usage

  • Extract the fc-7 image features using:
python extract_fc7.py --split=train
python extract_fc7.py --split=val
  • Training

    • Basic usage python train.py
    • Options
      • rnn_size: Size of LSTM internal state. Default is 512.
      • num_lstm_layers: Number of layers in LSTM
      • embedding_size: Size of word embeddings. Default is 512.
      • learning_rate: Learning rate. Default is 0.001.
      • batch_size: Batch size. Default is 200.
      • epochs: Number of full passes through the training data. Default is 50.
      • img_dropout: Dropout for image embedding nn. Probability of dropping input. Default is 0.5.
      • word_emb_dropout: Dropout for word embeddings. Default is 0.5.
      • data_dir: Directory containing the data h5 files. Default is Data/.
  • Prediction

    • python predict.py --image_path="sample_image.jpg" --question="What is the color of the animal shown?" --model_path = "Data/Models/model2.ckpt"
    • Models are saved during training after each of the complete training data in Data/Models. Supply the path of the trained model in model_path option.
  • Evaluation

    • run python evaluate.py with the same options as that in train.py, if not the defaults.

Implementation Details

  • fc7 relu layer features from the pretrained VGG-16 model are used for image embeddings. I did not scale these features, and am not sure if that can make a difference.
  • Questions are zero padded for fixed length questions, so that batch training may be used. Questions are represented as word indices of a question word vocabulary built during pre processing.
  • Answers are mapped to 1000 word vocabulary, covering 87% answers across training and validation datasets.
  • The LSTM+VIS model is defined in vis_lstm.py. The input tensors for training are fc7 features, Questions(Word indices upto 22 words), Answers(one hot encoding vector of size 1000). The model depicted in the figure is implemented with 2 LSTM layers by default(num_layers in configurable).

Results

The model achieved an accuray of 50.8% on the validation dataset after 12 epochs of training across the entire training dataset.

Sample Predictions

The fun part! Try it for yourself. Make sure you have tensorflow installed. Download the data files/trained model from this link and save them in the Data/ directory. Also download the pretrained VGG-16 model and save it as Data/vgg16.tfmodel. You can test for any sample image using:

python predict.py --image_path="Data/sample.jpg" --question="Which animal is this?" --model_path="Data/model2.ckpt"
Image Question Top Answers (left to right)
What color is the signal? red, green, yellow
What animal is this? giraffe, cow, horse
What animal is this? cat, dog, giraffe
What color is the frisbee that is in the dog's mouth? white, brown, red
What color is the frisbee that is upside down? red, white, blue
What are they playing with? frisbee, soccer ball, soccer
What is in the standing person's hand? bat, glove, ball
What are they doing? surfing, swimming, parasailing
What sport is this? skateboarding, parasailing, surfing

References

neural-vqa-tensorflow's People

Contributors

gitter-badger avatar paarthneekhara avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.