The goal of this project is to determine the main contributing factors of the gap in movie critic reviews and moviegoer ratings. The analysis of the gap is done using data of movie characteristics such as genre, synopsis, actors, runtime, etc. scraped from Rotten Tomatoes. We then use spaCy NLP libraries to do text analysis of the movie synopsis. With the desired features, we then use scikit-learn machine learning library to determine the gap in audience and critic scores. In addition, the scraped movie synopses present us a unique opportunity to use OpenAI GPT-2 language modules to generate synopses of invented movie titiles.
- Trung Bui
- Minh Hua
- Quang Pham
Who do you know that didn't love Forrest Gump? Is this the critics "hating the hype"?
Believe the hype!
Pandas Library
spaCy NLP library
OpenAI GPT-2 https://github.com/openai/gpt-2
Objective: Determine the contributing factor gap in audience and critic scores First, data on movie characteristics is scraped from Rotten Tomatoes Next, we use pandas to generate important features on studio, actors, etc. Then, we use spaCy and NLP libraries to determine the implications of the movie synopses. Finally, using scikit-learn, we determine leading factors behind the gap with the given data
Bonus: Using the existing synopses and titles, we generate movie synopses based on an invented title and vice versa.
This is a scatterplot of data generated from a day's worth of images from 10:00 to 22:00 military time(10:00 am to 10:00 pm):
A cleaner representation of the data can be shown in the form of a histogram with variable time and people as weights:
- Method to track individuals to tell who is leaving or entering in order to improve data accuracy (foviated vision)
- Use scraped data on movie and movie reviews to generate movie reviews
- Improve the data sample to include writers, actors' previous performance, budget and box office, etc.