Giter VIP home page Giter VIP logo

dsi-project-5-rotten-sentiment's Introduction

Rotten Sentiment with Tomatoes

Data Science Immersive Project Five: Sentiment Analysis on Rotten Tomatoes

Table of Contents

Introduction

For Project Five in my Data Science Immersive course, I chose to take on the Kaggle competition for sentiment analysis of movie reviews on Rotten Tomatoes.

Rotten Tomatoes has become one of the go-to places to help determine if a movie is any good based on reviews from professional critics. RT uses the Tomatometer which represents the percentage of professional critic reviews that are positive for a given film.

Tomatometer

  • When at least 60% of reviews are positive, a red tomato is displayed to indicate its “Fresh” status.
  • Anything less than 60% and a green “splat” is displayed indicating its “Rotten” status.
  • Certified Fresh goes to movies/TV that meet certain strict requirements.
    • A steady Tomatometer score of 75% or higher.
    • At least five reviews from Top Critics.
    • Films in wide release must have a minimum of 80 reviews.
    • Films in limited release must have a minimum of 40 reviews.
    • Only individual seasons of a TV show are eligible, and each must have a minimum of 20 reviews.

Audience Score

  • Full popcorn bucket - 3.5 stars or higher
  • Tipped over popcorn bucket - Less than 3.5 starts
  • Plus sign for movies that do not have audience ratings

Dataset

The datasets were available on Kaggle and were provided by the Rotten Tomatoes. There is a Train set as well as a Test set. The data is comprised of phrases and sentiment ID's (positive, slightly positive, neutral, slightly negative, or negative).

Process

Used a function to clean the phrases in the Phrase column eliminating "English" stop words and using the WordNet Lemmatizer in NLTK to group together inflections of a word for easier analysis.

Using CountVectorizer(), I had to use max_features of 2000. The reason for this is that anything bigger than that would create memory errors on my laptop. This also keeps my laptop from having issues doing a train, test, split and running all the different models that I ran. Unfortunately, using the max_features parameter for the CountVectorizer will not get me the best model for submitting to Kaggle.

I used several different models on my final features including Multinomial Naive Bayes, Bernoulli Naive Bayes, Random Forest, Extra Trees, Bagging Classifers, and simple Logistic Regression.

Results

The best model ended up being Random Forest with a 61.3% score. I initially ran a GridSearch using several parameters that took over 14 hours to complete on my desktop at home to find the best parameters. I did the same with Logistic Regression. Due to time constraints I couldn't do GridSearches for all the other classifiers that I used.

Insert image here with graph on different model results and their Kaggle scores.

While the Random Forest Classifier had the best overall accuracy score with 61.3%, when entered into the Kaggle competition, it ended up with the lowest Kaggle Score at 44.9%. Inversely, the lowest accuracy score by model was the Multinomial Naive Bayes model with 56.7%, but it had the highest Kaggle Score among my models 48.7%. As you can see, these are actually not very good results at all.

Lessons Learned

After the course is completed, I definitely want to come back to this project to get better results. There are several things to try:

  • Neural Networks: I initially tried this but was having with the Test Loss associated with my models. The NN would also take a very long time to run so it would be easier on a better computer.
  • Amazon Web Services: Since the resultant dataset was so large (and would be much larger if I tried to use the n-grams parameter), I could only use a small percentage of the features. Running the data and models on an AWS instance would be more practical for this competition

Resources

Kaggle
Rotten Tomatoes

dsi-project-5-rotten-sentiment's People

Contributors

aggieed97 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.