Brief Summary of Project

This project trained several models using different training datasets with Logistic Regression and Bernouilli Naive Bayes models to fulfill the sentiment analysis task.

Five training datasets were used to train the classification model, including sentiment 140, Apple Twitter Sentiment, Twitter US Airline Sentime, Depression Sentiment, and Russia invade tweets. These models generated were then tested on the [Putin tweets] dataset to demonstrate their accuracy in predicting tweet content related to Russian president Putin. ([Putin tweets] is provided in this file)

How to use the codes? - usage example

Introduction: There are five .py files: preprocess.py, building_model.py, evaluatemodel.py, predicting.py, and analyzing.py. The preprocess.py and evaluate.py are two helper files for builing_model.py and predicting.py. Finally, the analyzing.py is for analyzing our dataset. building_model.py:

Firstly, you should import the dataset. Then, you should choose different commands and modify the parameters following the comments based on the dataset you upload. After that, the code will preprocess the data, split it into train and test datasets, and transform X_train into tf-idf features. Afterward, the code will create and evaluate a Bernoulli Naive Bayes model and a Logistic Regression model. Finally, you can save the vectorizer and models into pickle files.

How to use the [predicting_model.py]: First, download the vectorizer and models from pickle files. Second, download the text and labels of the test dataset. Third, use the models to make predictions. Fourth, calculate the specificity scores and metrics.

How to use the [analyzing.py]: The file has two functions. Firstly, it can create the wordnet plot and list out the top negative and non-negative words in a few datasets. Secondly, it can label the dataset using VADER models.

Writeup

Models: in this task, overall Logistic Regression performed better than Naive Bayes. In the future, more types of Naive Bayes, such as Multinomial, could be explored.
Data size: in this task, larger training dataset performed slightly better than smaller dataset.To better understand the relationship between corpus size and performance, we could try out more training datasets from the same topic but different corpus sizes.
Topic: In the future, more datasets of different topics could be explored, especially the tweet content regarding other controversial political figures.
Label: The standard of negative and non-negative content may vary from person to person. A better way to label the testing dataset could be to involve more people labeling the data.

jchen255 / tweets-sentiment-analysis Goto Github PK

tweets-sentiment-analysis's Introduction

Brief Summary of Project

How to use the codes? - usage example

Writeup

tweets-sentiment-analysis's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent