Create a deep learning architecture with two components: a CNN to transform the input image into a set of features, an RNN that turns those features into descriptive text aka captions.
The project is broken up into a few main parts in four Python notebooks
- 0_Dataset.ipynb : Loading and Visualizing COCO dataset to train the network. The Microsoft Common Objects in COntext (MS COCO) dataset is a large-scale dataset for scene understanding. The dataset is commonly used to train and benchmark object detection, segmentation, and captioning algorithms.
- 1_Preliminaries.ipynb : Design a CNN-RNN model for automatically generating image captions. Implemented a CNN to transform to transform the input image into a set of features and Implemented an RNN decoder using LSTM cells to generate captions.
-
2_Training.ipynb : Train the CNN-RNN model.
-
3_Inference.ipynb : Use your trained model to generate captions for images in the test dataset.
Trained the network around 10 hrs using GPU and achieved average loss of around 2%.