This project trained several models using different training datasets with Logistic Regression and Bernouilli Naive Bayes models to fulfill the sentiment analysis task.
Five training datasets were used to train the classification model, including sentiment 140, Apple Twitter Sentiment, Twitter US Airline Sentime, Depression Sentiment, and Russia invade tweets. These models generated were then tested on the [Putin tweets] dataset to demonstrate their accuracy in predicting tweet content related to Russian president Putin. ([Putin tweets] is provided in this file)
Introduction: There are five .py files: preprocess.py, building_model.py, evaluatemodel.py, predicting.py, and analyzing.py. The preprocess.py and evaluate.py are two helper files for builing_model.py and predicting.py. Finally, the analyzing.py is for analyzing our dataset. building_model.py:
Firstly, you should import the dataset. Then, you should choose different commands and modify the parameters following the comments based on the dataset you upload. After that, the code will preprocess the data, split it into train and test datasets, and transform X_train into tf-idf features. Afterward, the code will create and evaluate a Bernoulli Naive Bayes model and a Logistic Regression model. Finally, you can save the vectorizer and models into pickle files.
How to use the [predicting_model.py]: First, download the vectorizer and models from pickle files. Second, download the text and labels of the test dataset. Third, use the models to make predictions. Fourth, calculate the specificity scores and metrics.
How to use the [analyzing.py]: The file has two functions. Firstly, it can create the wordnet plot and list out the top negative and non-negative words in a few datasets. Secondly, it can label the dataset using VADER models.
- Models: in this task, overall Logistic Regression performed better than Naive Bayes. In the future, more types of Naive Bayes, such as Multinomial, could be explored.
- Data size: in this task, larger training dataset performed slightly better than smaller dataset.To better understand the relationship between corpus size and performance, we could try out more training datasets from the same topic but different corpus sizes.
- Topic: In the future, more datasets of different topics could be explored, especially the tweet content regarding other controversial political figures.
- Label: The standard of negative and non-negative content may vary from person to person. A better way to label the testing dataset could be to involve more people labeling the data.