Product Matching for E-Commerce using Deep Learning

Deployment stopped due to GCP charges

Not Working Currently: Deployed Web App : http://product-matching-webapp.el.r.appspot.com/

Youtube Link:https://www.youtube.com/watch?v=uQq281Uzb9k

E-commerce has seen an incredible surge in terms of the users over the past few years. The transition to E-commerce has been further accelerated by the COVID-19 pandemic. Thus for each E-commerce companies, it is has become increasingly important to provide high-quality search results and recommendations. With millions of third-party sellers operating on their websites, the process of distinguishing between products have become increasingly difficult.

The Goal of this project is to develop an efficient strategy to find similar products available on an e-commerce website by utilizing the product's image and text label.

Few Examples:

Example 1:

Example 2:

Structure of Web App

Why don't we just compare image and text directly?

Each image cannot be compared one by one with the whole image dataset. This approach will be incredibly computational expensive and excessively time-intensive in nature due to the sheer size of images. The process of comparing texts directly also may not give the desired outcomes.

Hence a fine-tuned pre-trained CNN models can be used to generate image embeddings and a similar approach is utilized to convert text data into word embeddings using TfidfVectorizer and a Transformer. This approach produces an average F1 score of 0.87 when compared to a baseline score of 0.55.

What are Embeddings

Embeddings are a vector representation of data formed by converting high dimensional data (Image, text, sound files etc.) into relatively low dimensional tabular data. They make it easier to perform machine learning on large inputs.More Information

This dataset was provided by Shopee ,Shopee is s a Singaporean multinational technology company which focuses mainly on e-commerce.

Approach Utilized

Image-based strategy

Rather than creating our model for embedding generation, the best method is to use state of the art image models then fine-tune them on our dataset. Using these pre-trained models without any fine-tuning will provide an average result (average F1 score of 0.59) whereas the fine-tuned model performs much better (average F1 score of 0.73). The image below represents the model used to generate image embeddings.

The process of model fine-tuning is borrowed from the facial recognition system, ArcFace Margin Layer is used instead of a softmax layer in the model during the fine-tuning process.

Advantage of ArcFace Layer

Unlike Softmax, it explicitly optimizes feature embeddings to enforce higher similarity between same class data, this, in turn, leads to a higher quality of embeddings being generated.

After embeddings generation, the goal is to generate accurate predictions using KNearestNeighbour Algorithm and Cosine Similarity. Due to a large number of input data, the sklearn framework cannot be utilized as it leads to Out of Memory Error.Hence the RAPIDS library is used, it is an open-source framework used to accelerate the data science process by providing the ability to execute end-to-end data science and analytics pipelines entirely on GPUs.

Predictions from all the different image models are merged by utilizing either of the prediction approaches to be discussed later in the document to generate the final image based predictions.

Text-based strategy

The product’s text label is converted into word embeddings using two different approaches, TfidfVectorizer and Sentence Transformer are used to encode every text label.

TfidfVectorizer

TF-IDF (term frequency–inverse document frequency) is used to under stand the relevance of a word present in the document,TfidfVectoriver utilizes the tfidf values to generate word embeddings for each and every label

Sentence Transformers

Sentence Transformers is a framework which provides an easy method to generate vector representations of text by utilizing transformer networks like BERT,RoBERTa etc,for this application a pretrained transformer is used to generate sentence embeddings for finding the semantic similarity between text data.

After embedding generation, both the tfidf and transformer embeddings can be used by either of the prediction approaches for final prediction calculation. This dataset was provided by Shopee from their indonesian division for a data science competition, Shopee is s a Singaporean multinational technology company that focuses mainly on e-commerce.

Prediction generation using Cosine Similarity and KNearestNeigbour Algorithm + Merging Approach

Cosine Similarity

Cosine similarity tells us the similarity between two different vectors by calculating the cosine of the angle between two vectors and determines whether the two vectors lie in the same directions, to generate the final predictions a minimum threshold distance is decided and all data points with a similarity value greater than the threshold value are the required predictions. (Higher the similarity value, closer the relation between data points).

KNearestNeighbour Algorithm

NearestNeighbour is a common algorithm used to find the required number of nearest data points according to a chosen metric. This allows us to find accurate predictions by deciding a minimum threshold distance. All data points with a distance less than the decided threshold will be the required predictions. (Lower the distance, closer the relation between data points).

Merging Approach

First Approach

Second Approach

First Approach performs slightly better than the second approach. The 1st approach allows the predictions to be developed using the merged embeddings (both image and text embeddings) whereas the second approach uses merging independents predictions.

Implementations of both the prediction methods are performed using the open-source library developed by NVIDIA called RAPIDS.

Results

Metric Used and how its calculated:

The Metric used to judge the performance is the Average F1 Score. For each data entry, the F1 score is calculated and then the mean of all F1 Scores is taken.F1 score measures a test's accuracy, it is calculated using precision and recall of the test.

Setting baseline using pHash

The used dataset provides the predictions using using pHash, pHash is a fingerprint of a multimedia file derived from various features, If pHash are 'close' enough, then datapoint is similar.

harsh-miv / product-matching-using-deep-learning Goto Github PK

product-matching-using-deep-learning's Introduction

Product Matching for E-Commerce using Deep Learning

Deployment stopped due to GCP charges

Youtube Link:https://www.youtube.com/watch?v=uQq281Uzb9k

Few Examples:

Example 1:

Example 2:

Structure of Web App

Why don't we just compare image and text directly?

What are Embeddings

Approach Utilized

Image-based strategy

Advantage of ArcFace Layer

Text-based strategy

TfidfVectorizer

Sentence Transformers

Prediction generation using Cosine Similarity and KNearestNeigbour Algorithm + Merging Approach

Cosine Similarity

KNearestNeighbour Algorithm

Merging Approach

First Approach

Second Approach

Results

Metric Used and how its calculated:

Setting baseline using pHash

Baseline Average F1 Score:0.55

Image Only Score(Fintuned CNN):0.72

Text Only Score: 0.62

Final Merged Score(Image+Text): 0.87

product-matching-using-deep-learning's People

Contributors

Stargazers

Watchers

Forkers

product-matching-using-deep-learning's Issues

Recommend Projects

Recommend Topics

Recommend Org