Giter VIP home page Giter VIP logo

textclassification's Introduction

TextClassification

This repo contains ipython notebook that implements different classifier for the text classification.

Method

Importing Libraries:

First of all we need to import all the required libraries

Importing Data:

The given data file is loaded using pandas read_csv and both the columns has been given the name as desc and label .

Data Preparation:

Now the text needs to be preprocessed before proceeding further. Text may contain numbers, special characters, unwanted spaces and stopwords. And here in this problem, we don't need the presence of all these. These will unnecessarily add confusion and complexity. Hence, we will remove all the special characters, unwanted spaces and stop words from our text.

Also the data is divided to train and test set.

Feature Engineering:

Raw text data needs to be transformed into feature vectors. The following methods will be applied to obtain relevant features from our dataset.

  • Count Vectors features
  • TF-IDF Vectors features
    • Word level
    • N-Gram level
    • Character level

Modle Building:

The final step in the text classification framework is to train a classifier using the features created. There are many different machine learning models which can be used to train a model to classify text. We will implement following different baseline classifiers for this purpose;

  • Naive Bayes Classifier (Has been proved to be very effective for text classification with small and simple dataset)
  • Logistic Regression (Works well with small dataset which is linearly separable. Data with very high dimensions tend to be linearly separable. Hence it works pretty welll with text data)
  • Random Forest (Handles high dimensional and sparse data well)
  • Extreme Gradient Boost (Popular, fast and accurate. Used for the purpose of comparison)

How to run

  • Download the repository. It contains the input data file and required trained model files along with the source code.

  • cd source/

  • jupyter notebook (This launches the jupyter notebook)

  • Now open the TextClassification.ipynb file and click on run all. This shows the performance of test set on different type feature and model. At the end it asks to enetr a text for classification.

textclassification's People

Contributors

dm02 avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.