Giter VIP home page Giter VIP logo

nlp_with_disaster_tweets's Introduction

This repository contains code comparing Latent Semantic Indexing (LSI) with a basic model without dimension reduction and two supervised learning methods based on LSI and Partial Least Squares: Semantic Indexing based on Partial Least Squares (SIPLS) and Local Semantic Indexing based on Partial Least Squares (LSIPLS). We compare those different dimension reduction techniques through a binary classification problem which is posed on the platform kaggle.com: "Real or Not? NLP with Disaster Tweets" (https://www.kaggle.com/c/nlp-getting-started/overview). Support Vector Machines are used as classifiers (except for SIPLS which has an own classification method).

To download the corresponding training and test set, it is neccessary to posess a kaggle account and to agree to the competition rules. Please do that and download the data into a subdirectory called "data". Some participants discovered that the ground truth of the test set was openly available. You can download it for example from people presenting their notebooks in which they used the original labels of the test set (https://www.kaggle.com/szelee/a-real-disaster-leaked-label). Please download this file into your data folder, too. It should be named submission.csv.

The output of "main.py" will be the test and training scores of the of the respective model depending on the dimension of the room in which the data is projected by LSI, SIPLS or LSIPLS, the hyperparameters which have been chosen by GridSearchCV for the corresponding model and a plot of those scores. Feel free to use this code and modify it in any way you need.

There are some constants defined which can be found in global_parameters.py. If you for example want to change the maximum dimension of the projection room for which the models should be computed, you can do so by changing MAX_DIM. Also the hyperparameters corresponding to SVC under which GridSearchCV tries to find the ones best suited for every model which uses SVC for classification can be found there. If you want to take a look at the scores and best parameters for the models projecting in rooms of dimensions 1 to 15, you can download the pickle files in which these information are saved from https://www.dropbox.com/s/yescvgmh9hzngcg/Score%20and%20parameter%20files%20for%20dimensions%201%2C...%2C15.rar?dl=0. Save them in your data folder and execute main.py.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.