Giter VIP home page Giter VIP logo

search_relevance's Introduction

Product Search Relevance

This is the source code for the HomeDepot's data science challenge on Kaggle [+]. Our solution gives the RMSE score of 0.456 and scored at top 3.5% spot in the leadboard.

Team:

  • Arman Akbarian
  • Ali Narimani
  • Hamid Omid

Overview of the ML pipeline:

alt tag

As you can see the feature engineering part involves 4 parts:

  • basic features: at xgb.py before merging other features
  • extended features: at feat_eng_extended.py
  • extended tf-idf: at feat_eng_tfidf.py (followed by model_xgb.py)
  • basic tf-idf: done at the pipeline at xgb.py

Text preprocessing is done by preprocessing.py.

Misc Files Description:

  • spell_corrector.py: builds spell_corr.py a python dictionary for correcting spellings in search query.
  • data_trimming.py: a pre-cleaning process needed by spell_corrector.py and Synonyms.py
  • data_selection.py: feature selection after extended features are built.
  • nlp_utils.py: a NLP utility funciton and wrapper classes for quick prototyping
  • ngrams.py: builts n-grams
  • Synonyms.py: finds the synonyms of the words in search query
  • project_params.py: sets global variables for the project, I/O directories etc...

Preparation:

In project_params.py edit the following to the correct path

 - ``input_root_dir``: the path to the original .csv files in your system

 - ``output_root_dir``: a path with few GB disk space avaiable for I/O

Requirements:

Get all Requirements

To do all of the following (1,2,3) you can simply use:

``make install``

Packages

To install the required packages execute:

``pip install -r requirements.txt``

The project needs the following python packages: -numpy -pandas -scikit-learn -nltk

You may also need to install nltk data:

  • in python:

    >>> nltk.download()

  • or via command line:

    python -m nltk.downloader all

  • or makefile will take take care of everythin:

    make install

(Particularly nlp_utils uses WordNet data.)

Data

The data, can be downloaded from the competition's homepage [+].

Initial Test:

run make testutils to check if things work!

Note: This documentation and repo is under construction, I will hopefully clean up the repo and modify all of the scripts such that the pipeline works perfectly on a single machine

search_relevance's People

Contributors

rmanak avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.