Not Working Currently: Deployed Web App : http://product-matching-webapp.el.r.appspot.com/
Youtube Link:https://www.youtube.com/watch?v=uQq281Uzb9k
E-commerce has seen an incredible surge in terms of the users over the past few years. The transition to E-commerce has been further accelerated by the COVID-19 pandemic. Thus for each E-commerce companies, it is has become increasingly important to provide high-quality search results and recommendations. With millions of third-party sellers operating on their websites, the process of distinguishing between products have become increasingly difficult.
The Goal of this project is to develop an efficient strategy to find similar products available on an e-commerce website by utilizing the product's image and text label.
Each image cannot be compared one by one with the whole image dataset. This approach will be incredibly computational expensive and excessively time-intensive in nature due to the sheer size of images. The process of comparing texts directly also may not give the desired outcomes.
Hence a fine-tuned pre-trained CNN models can be used to generate image embeddings and a similar approach is utilized to convert text data into word embeddings using TfidfVectorizer and a Transformer. This approach produces an average F1 score of 0.87 when compared to a baseline score of 0.55.
Embeddings are a vector representation of data formed by converting high dimensional data (Image, text, sound files etc.) into relatively low dimensional tabular data. They make it easier to perform machine learning on large inputs.More Information
This dataset was provided by Shopee ,Shopee is s a Singaporean multinational technology company which focuses mainly on e-commerce.
Rather than creating our model for embedding generation, the best method is to use state of the art image models then fine-tune them on our dataset. Using these pre-trained models without any fine-tuning will provide an average result (average F1 score of 0.59) whereas the fine-tuned model performs much better (average F1 score of 0.73). The image below represents the model used to generate image embeddings.
The process of model fine-tuning is borrowed from the facial recognition system, ArcFace Margin Layer is used instead of a softmax layer in the model during the fine-tuning process.
Unlike Softmax, it explicitly optimizes feature embeddings to enforce higher similarity between same class data, this, in turn, leads to a higher quality of embeddings being generated.
After embeddings generation, the goal is to generate accurate predictions using KNearestNeighbour Algorithm and Cosine Similarity. Due to a large number of input data, the sklearn framework cannot be utilized as it leads to Out of Memory Error.Hence the RAPIDS library is used, it is an open-source framework used to accelerate the data science process by providing the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.
Predictions from all the different image models are merged by utilizing either of the prediction approaches to be discussed later in the document to generate the final image based predictions.
The product’s text label is converted into word embeddings using two different approaches, TfidfVectorizer and Sentence Transformer are used to encode every text label.
TF-IDF (term frequency–inverse document frequency) is used to under stand the relevance of a word present in the document,TfidfVectoriver utilizes the tfidf values to generate word embeddings for each and every label
Sentence Transformers is a framework which provides an easy method to generate vector representations of text by utilizing transformer networks like BERT,RoBERTa etc,for this application a pretrained transformer is used to generate sentence embeddings for finding the semantic similarity between text data.
After embedding generation, both the tfidf and transformer embeddings can be used by either of the prediction approaches for final prediction calculation. This dataset was provided by Shopee from their indonesian division for a data science competition, Shopee is s a Singaporean multinational technology company that focuses mainly on e-commerce. Cosine similarity tells us the similarity between two different vectors by calculating the cosine of the angle between two vectors and determines whether the two vectors lie in the same directions, to generate the final predictions a minimum threshold distance is decided and all data points with a similarity value greater than the threshold value are the required predictions. (Higher the similarity value, closer the relation between data points).
NearestNeighbour is a common algorithm used to find the required number of nearest data points according to a chosen metric. This allows us to find accurate predictions by deciding a minimum threshold distance. All data points with a distance less than the decided threshold will be the required predictions. (Lower the distance, closer the relation between data points).
First Approach performs slightly better than the second approach. The 1st approach allows the predictions to be developed using the merged embeddings (both image and text embeddings) whereas the second approach uses merging independents predictions.
Implementations of both the prediction methods are performed using the open-source library developed by NVIDIA called RAPIDS.
The Metric used to judge the performance is the Average F1 Score. For each data entry, the F1 score is calculated and then the mean of all F1 Scores is taken.F1 score measures a test's accuracy, it is calculated using precision and recall of the test.
The used dataset provides the predictions using using pHash, pHash is a fingerprint of a multimedia file derived from various features, If pHash are 'close' enough, then datapoint is similar.