Giter VIP home page Giter VIP logo

nlp-disaster-tweet-classification's Introduction

NLP-Disaster-Tweet-Classification

Overview

This project focuses on Natural Language Processing which is part of machine learning and a way for computers to learn and analyze the human language. The Kaggle competition can be found here: https://www.kaggle.com/c/nlp-getting-started/overview and the data was downloaded from https://www.kaggle.com/c/nlp-getting-started/data

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). For this challenege, we are building a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t.

Initial findings in the data:

  • We can see from the initial training data EDA that there are 5 columns and 7,613 observations
  • It looks like 'text' and 'target' are the features we'll focus on and can probably drop the other 3. There were also several null values in both 'keyword' and 'location'.
  • Of the 7,613 'text' observations, 7,503 are unique so there must be 110 duplicates, we can take care of that
  • More than half of the training tweets are NOT true disaster tweets
  • We also have a test set of 4 columns (just missing the 'target' feature) and 3,263 observations

image

After removing duplicates from the tweet dataframe 'text' column, 57% are now 0 or NON disaster tweets and 42% ARE classified as disaster tweets

Text Cleaning & Preprocessing

In natural language processing, text needs to be cleaned and preprocessed before being fed into a model. Removing punctuations, stop words (such as 'I', 'is', 'the', etc), setting text to lowercase, port stemming (leaving the root word be removing tense) and also tokenizing or using a countvectorizer helps the model process and analyze text easier.

Below, I created the clean_text() function to take in each text line, set it to lowercase, remove punctuation, remove English stopwords, stem the words, then join as string for the return value.

Model Architecture

First, we will split the training data using sklearn's train_test_split into 80/20 training/test. For the model, I'll use sklearn's logistic regression since we are predicting a binary output (0 for NOT disaster tweet, 1 for disaster tweet)

Model Evaluation

For the Kaggle competition, F1 scores are used to determine accuracy between predicted values and true values.

image

Findings and Conclusion

After making the predictions of the X_test data and comparing to the y_test 'target' data, the F1 Score for the model was about 73%. Not bad but could probably get better with other text cleaning or perhaps a different model for classification. We can also see in the confusion matrix above, that the model predicted 755 True Positives and 433 True Negatives correctly. For the incorrect predictions, there were 194 False Negatives and 119 False Positives.

nlp-disaster-tweet-classification's People

Contributors

friedunit avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.