Giter VIP home page Giter VIP logo

marvel-dialogue-nlp's Introduction

marvel-dialogue-nlp

A machine learning project that will use Natural Language Processing (NLP) to classify who says a line of dialogue

MCU Banner

Streamlit App

To view an interactive summary of this project, see its Streamlit app.

About the Project

With over 21 different movies, all spanning the same cinematic universe, Marvel movies are an interesting creation of character, dialogue, and plot. A key defining feature of these movies is the large amount of characters represented and developed. I wanted to explore NLP, and figured that exploring the dialogue in these movies would be a very fun thing to do. The problem whished to be accomplished in this project is an NLP classification problem. The goal was to create a model that can predict a character's name given a line of their dialogue from a Marvel Cinematic Universe (MCU) movie. Data was taken from Marvel released scripts and transformed into labels of names and feature documents of their dialogue.

Results

In this project, 18 different models were buit and compared. Models 1-12 use Naive Bayes, SVM, and Random Forest classifiers in different architecture combinations and can be read about in the old models directory Model 13 is the Naive Bayes classifier with the best performance and presented as the production model. Models 14-18 are derived from model 13, but manipulated the data or larger architecture to try to achieve better results. Model 14 is an ensemble method that trains a model for every character and can be read about in the One vs. Rest Models notebook. Model 15 allows the use of movie titles and authors as features and can be read about in the All Features Model notebook. Models 16, 17, and 18 were inspired by the correlation between the number of words in a line and its correct prediction, shown in the section below. These models attempt to train on less sparse vectors and can be read about in the Word Count Models notebook.

Model 13 uses scikit-learn to implement a Naive Bayes Classifier. Hyperparameter selection is done using cross validation (10 folds). The model is also evaluated using cross validation (10 folds). With hyperparameter selection, this results in nested cross validation. Stop words, which are words that provide no value to predictions (I, you, the, a, an, ...), are not removed from predictions. Hyperparameter selection showed better performance keeping all words rather than removing NLTK's list of stop words. Words are stemmed using NLTK's SnowballStemmer. Word counts are also transformed with term frequencies and inverse document frequencies using scikit-learn's implementation.

Model 13 performs with 29.041% balanced accuracy. The model's performance isn't great, but it's still fun to interact with! Over the course of the project it's been shown that more data only results in marginal increases in performance. Above, it's shown that accuracy increases as the number of words in a line increases. In other words, it seems that spoken dialogue is too short to predict in this case. The Naive Bayes classifier is a Bag of Words model, meaning that the order of the words is ignored. By using a Word Embeddings model, which does not ignore the order of words, accuracy could possibly be increased. Deep learning also might have success on this dataset.

To see the code for the model, see the Production Model notebook.

About the Dataset

This repository contains a newly created dataset to train and test models on, as well as several Jupyter Notebooks that describe the process used to create each .csv. This dataset uses a combination of original scripts and transcripts from the Marvel Cinematic Universe (MCU) movies. The original script .pdfs were obtained from Script Slug, though other copies of the Marvel released scripts can be found online elsewhere. Transcripts were taken from Fandom's Transcripts Wiki. Transcripts were copied and pasted into .txt files, and then processed using pandas. See the table below to find out what movies are in the dataset, and where the dialogue came from.

If you spot a mistake in the dataset, please let me know so I can correct it.

If you would like to use the dataset, it is available on Kaggle.

MCU.csv

The end file, mcu.csv, contains columns character and line that hold the dialogue for several movies from the MCU. There are more columns that provide additional features for context, such as movie titles and author indicators. See /data/MCU.ipynb for more details on those features and the dataset's creation.

Individual Movies

For individual movies, the .csv files found in /data/cleaned/ should not be used. Instead, load the file mcu.csv and use pandas or a similar library to select the rows that match a given movie.

Other Files of Interest

The file mcu.csv contains all of the data, but there are other files that might be of interest for anyone using this dataset. The file mcu_subset.csv is a subset of the original data, only containing dialogue for ['TONY STARK', 'STEVE ROGERS', 'NATASHA ROMANOFF', 'THOR', 'NICK FURY', 'PEPPER POTTS', 'BRUCE BANNER', 'JAMES RHODES', 'LOKI', 'PETER PARKER'], who are the top 10 characters by number of lines, number of words, and movie appearances.

The file characters.csv contains statistics summarizing each character's involvement in the MCU. The movie name columns in that table describe how many lines they have in that movie.

The file movies.csv contains metadata about the movies used in this project. For details on the creation of these files, see /data/MCU.ipynb.

Movies Included

Movie Year Is Transcript Lines Source Link CSV Issues
Iron Man 2008 834 Script Slug
Iron Man 2 2010 ✔️ 1010 Fandom's Transcripts Wiki Not proofread
Thor 2011 1007 Script Slug
Captain America: The First Avenger 2011 ✔️ 688 Fandom's Transcripts Wiki Not proofread
The Avengers 2012 1027 Script Slug
Iron Man 3 2013 ✔️ 1043 Fandom's Transcripts Wiki Not proofread
Thor: The Dark World 2013 ✔️ 734 Fandom's Transcripts Wiki Not proofread, transcript duplication
Captain America: The Winter Soldier 2014 ✔️ 841 Fandom's Transcripts Wiki Not proofread
Avengers: Age of Ultron 2015 ✔️ 980 Fandom's Transcripts Wiki Not proofread
Ant Man 2015 ✔️ 867 Fandom's Transcripts Wiki Not proofread
Captain America: Civil War 2016 ✔️ 987 Fandom's Transcripts Wiki Not proofread
Guardians of the Galaxy: Vol 2 2017 993 Script Slug
Spider-Man: Homecoming 2017 ✔️ 1558 Fandom's Transcripts Wiki Not proofread
Thor: Ragnarok 2017 961 Script Slug
Black Panther 2018 834 Script Slug
Avengers: Infinity War 2018 ✔️ 990 Fandom's Transcripts Wiki
Captain Marvel 2019 ✔️ 775 Fandom's Transcripts Wiki Not proofread
Avengers: Endgame 2019 1229 Script Slug

Movies not included

Movie Year Source Link Transcript Issues
The Incredible Hulk 2008 Fandom's Transcripts Wiki Poor / messy transcript
Guardians of the Galaxy 2014 Fandom's Transcripts Wiki Poor / messy transcript
Doctor Strange 2016 Fandom's Transcripts Wiki Poor / messy transcript
Ant-Man and the Wasp 2018 Fandom's Transcripts Wiki Incomplete transcript
Spider-Man: Far From Home 2019 Fandom's Transcripts Wiki Incomplete transcript

Libaries Used

  • numpy
  • pandas
  • sklearn
  • nltk
  • matplotlib
  • seaborn
  • statsmodels

marvel-dialogue-nlp's People

Contributors

prestondunton avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.