Link to Model - https://drive.google.com/file/d/1RBWUOFFjPXy1w-eMFN4mtQuZlm05QaaY/view?usp=sharing.
Couldn't upload files more than 25 mb on Github.
This Github repository contains code submitted as an entry by Team Nightwing for the AI Hackathon hosted by Analytics Club, IIT Bombay in collaboration with EarlySalary. Our model and technique placed second in the hackathon.
We used an end-to-end Deep Siamese Model to recognize faces in the dataset. The model created was based on an encoder-decoder paradigm, where a Deep-CNN layer was used as an encoder to generate vector embeddings for the input image pair, and a novel decoder layer has been used to predict the similarity between the input images.
- hackathon.ipynb - A cleaned Jupyter Notebook file, containing my implementation of the algorithm, i.e training and testing.
- run.ipynb - Jupyter Notebook consisting of code to load the model and the data and create a
csv
file of the predictions. - run.py - Python version of the executable run file. Just need to change the file directory variable values to your local system.
- predictions.csv -
csv
file containing predictions on the test set. - requirements.txt - Text file containing system requirements to run the code.
- work_summary.pdf - PDF File containing summary of the work done and detailed information about the problem modelling, the layers used and the hyperparameters.
We modelled the problem using a Siamese Network paradigm. We took a pair of images, and passed them through the same network (that is two different networks with the same architecture with weight sharing), embedded both of them separately and tried to get a similarity probability as an output using the Softmax function. This modelling seemed very natural, given the format of the training data.
Additionally, there existed a class imbalance in the training data. The training data only had 36% positive examples - examples of similar faces/people. Thus we used a weighted version of the Cross Entropy Loss as our objective function, to deal with this and improve the model’s performance.
The optimizer used is AdamW - which was proposed in the paper “Decoupled Weight Decay Regularization”. AdamW is a modification of Adam in PyTorch, where AdamW corrects the weight decay issue prevalent in many momentum based optimizers (Adam is a combination of RMSProp and Momentum based Gradient Descent). Hyperparameter Settings are given in the last section.
A pre-trained MTCNN layer has been used - implemented in facenet_pytorch
to pre-process images by intelligently cropping only the facial region of the images. All those images, in which the MTCNN model could not find any faces, have simply been resized to the size 160px * 160px
.
A pre-trained Inception ResNet V1 layer has been used - implemented in facenet_pytorch
to generate vector embeddings of the input cropped images. The pre-trained model has been trained on the VGGFace2 Dataset which contains 3.31 million images of 9131 subjects. We implemented the concept of Transfer Learning for this layer. We froze all but the last layer of this model, to allow finetuning of the ResNet model to better fit our Dataset.
The Neural Tensor Network along with the RBF Kernel similarity begin the decoder part of our model. The Neural Tensor Network has been used successfully by Bai et al in SimGNN for the problem of Graph Similarity Computation for calculating Graph Edit Distances. Additionally Neural Tensor Networks have proven their mettle in computing semantic similarity between a pair of word embeddings. This layer takes two image embeddings as an input and calculates 'K' similarity scores between the two embeddings (where K is a hyperparameter).
The RBF Kernel similarity, is just a sanity check to supplement the similarity scores of the Neural Tensor Network. Its output is just a single value. It takes in two image embeddings and calculates the value of exponential raised to the power of the square of the L2-Norm of the differnce between the two vector embeddings. Although, this layer is optional and the network gives the exact same performance even without it. Again this, is just a kind of a helper function to the NTN Layer and not the main layer.
The output from the Neural Tensor Network and the RBF Kernel are concatenated and jointly fed into the feedforward neural network layer which predicts if the two images are of the same person or not. Assuming a binary classification problem of same or not-same.
The model was trained on Google Colab's Free GPU Runtime. The model achieved a Training Acurracy of 93.55% The model obtained a Test Set Accuracy of 92.88%