Giter VIP home page Giter VIP logo

movie-genre-prediction's Introduction

Text Classification of Movie Plot Summaries to Predict Movie Genres

Business and Data Understanding

Movies are one of the most popular means of entertainment. There are large volumes of movie data being generated and shared on the internet every second.The genre of a movie can be deciphered from its synopsis much of the time. This project is a multiclass text classification problem which uses movie plot summaries to predict movie genres. NLP preprocessing techniques were implemented on the summaries to improve predictive capability. The dataset is from Kaggle which contains about 34.8k plot summary descriptions scraped from Wikipedia. In this project, only movies that were assigned one genre were used.

Columns:

  • Release Year - year of release
  • Title - title of the movies
  • Origin/Ethnicity - country of origin of the movies
  • Director - director names associated with the movies
  • Cast - cast name associated with the movies
  • Wiki Page - wikipedia page of the movies
  • Plot - plot summary of the movies

Preprocessing

  • The data was subset to include only the movies that have one genre.
  • I label-encoded the genres.
  • The 'Plot' column was then 'cleaned' using NLP preprocessing techniques (removed stopwords, punctuations, digits, changed text to lower case, lemmatization, tokenization)

Exploratory Data Analysis

Frequency distribution of the genres showed the most prevalent genre was 'drama' with 'comedy' following close behind. The least common genre was 'romance'.

distribution of top six genres in the dataset

The visuals below show the word frequency distribution of plot summaries per genre. There were quite a few commonly occuring words among the genres.

PlotWordFreqDrama

PlotWordFreqComedy

PlotWordFreqHorror

PlotWordFreqAction

PlotWordFreqThriller

PlotWordFreqRomance

Modeling and Results

The metric used in this project is the 'accuracy score'

The first simple model which predicted the most frequent class achieved an accuracy score of 41% which served as the baseline.

Grid search on both tfidf and count_vectorizer transformed data:

Both Logistic regression performed the best, however logisitic regression performed better on the test set.

Best model was logistic regression with tfidf transformed summaries with the following parameters:

  • C = 1.0
  • class_weight = 'balanced'
  • multi_class = 'ovr'
  • penalty = 'l2'
  • solver = 'liblinear'

Comparisonofmodelscores

Testing Result

I implemented the model on test data and achieved an acurracy score of 64% which is a significant increase from the baseline

Confusionmatrixoftestpredictionsvsactual

Cosine Similarity of the genres using words in their plot summaries

This showed that the plot words in each genre were highly similar to the others

cosinesimilarityofthegenresusingplotwords

Conclusion

  • The confusion matrix above confirms what I suspected when looking at the most common words among the genres.
  • Due to the fact that some of the genres had words in common with another genre, false predictions of those genres were mostly as the genres they had common words with.
  • The decision tree models performed the least favorably.

Navigation

├── data                                                <- folder containing data
│   ├── wiki_movie_plots_deduped.csv                    <- contains movie plot data scraped from Wikipedia
├── images                                              <- Visualizations used in README and pdf file
├── .gitignore                                          <- files to ignore
├── README.md                                           <- This README file
├── predict_movie_genre_from_plot_summary.ipynb         <- final notebook
└── predict_movie_genre_from_plot_summary.pdf           <- final presentation

Contact Information

With questions or feedback on this repository, please reach out via:

movie-genre-prediction's People

Contributors

wonuabimbola avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.