Giter VIP home page Giter VIP logo

categorizing-amazon-products's Introduction

Categorizing amazon products

Amazon has a ton of products in various categories so the task is to create a classification model which can correctly classify the products category.

Project sturcture

  1. Data_cleaning.ipynb: It has the steps to first summarize the description and keep only the top 2 relevant sentences for the title, next data cleaning is implemented on the title and the summarized decription and the resulting dataframe is stored in a csv file which can be used for feature engineering and modeling. Keyword extraction is implemented on the cleaned title and description and only the noun keyswords are kept and the resulting dataframe is stored in another csv file to be used for feature engineering and modeling.

  2. Modeling_naive_approach.ipynb: The dataset is highly imbalanced so to balance the data first the keywords of the title for the categories with less than 600 products are grouped together to create a document and cosine similarity is used to merge the closely related documents. The entire data is used for modeling purposes resulting in feeding a lot of features to the model and consequently a very high running time along with more memory usage and model overfitting the training data.

  3. Modeling_keywords_naive_approach.ipynb: The major difference between this and Modeling_naive_approach.ipynb is that in this file i am using keywords that are Nouns instead of the entire sentences.

  4. Cosine_similarity_keywords.ipynb: The major difference between this and Modeling_keywords_naive_approach.ipynb is that in this file the threshold of 600 products is changed to 800 products. While performing cosine similarity i have added a threshold for remapping i.e. if the max is not more than 0.35 then it would not change the mapping to a different category. I have also tweaked with the parameters in TD-IDF part to reduce the size of the matrix being fed to the model. Parameters in the models are also tuned to reduce the amount of overfitting which is present in 2nd and 3rd ipynb.

  5. Topic_modeling_keywords.ipynb: The major difference between this and Cosine_similarity_keywords.ipynb is that in this file instead of using cosine similarity to perform feature engineering to merge categories, i implemented Latent Dirichlet Allocation (LDA) with a threshold of 0.5.

  6. Topic_modeling_keywords_new_vectorization.ipynb: The major difference between this and Topic_modeling_keywords.ipynb is that in this file instead of using the max_features in TF-IDF which takes the top features of the entire data I took the top 10 feature names for each category and used those feature names in vectorizing the data.

P.S. - I tried implementing Linear Discriminant Analysis for dimensionality reduction in 4th and 5th ipynb but it gave very poor results.

Raw Dataset

Caution: While training the models you might run into memory issues to resolve the error for windows system you can follow the steps -

a. Press windows + x, click on system

b. Navigate to 'Advanced system settings' in the right side, a system properties pop up will open

c. In the 'advanced tab' click on settings, a performance options pop up will open

d. In the 'advanced tab' click on change, a virtual memory pop up will open

e. Deselect 'Automatically manage paging file size for all drives', select the drive on which the code is running and then click custom size

f. Try to keep maximum size around 33000 MB for the code to run, click on set, ok and restart your system for the changes to take effect

categorizing-amazon-products's People

Contributors

siddhantmest avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.