marvel-dialogue-nlp

A machine learning project that will use Natural Language Processing (NLP) to classify who says a line of dialogue

Streamlit App

To view an interactive summary of this project, see its Streamlit app.

About the Project

With over 21 different movies, all spanning the same cinematic universe, Marvel movies are an interesting creation of character, dialogue, and plot. A key defining feature of these movies is the large amount of characters represented and developed. I wanted to explore NLP, and figured that exploring the dialogue in these movies would be a very fun thing to do. The problem whished to be accomplished in this project is an NLP classification problem. The goal was to create a model that can predict a character's name given a line of their dialogue from a Marvel Cinematic Universe (MCU) movie. Data was taken from Marvel released scripts and transformed into labels of names and feature documents of their dialogue.

Results

In this project, 18 different models were buit and compared. Models 1-12 use Naive Bayes, SVM, and Random Forest classifiers in different architecture combinations and can be read about in the old models directory Model 13 is the Naive Bayes classifier with the best performance and presented as the production model. Models 14-18 are derived from model 13, but manipulated the data or larger architecture to try to achieve better results. Model 14 is an ensemble method that trains a model for every character and can be read about in the One vs. Rest Models notebook. Model 15 allows the use of movie titles and authors as features and can be read about in the All Features Model notebook. Models 16, 17, and 18 were inspired by the correlation between the number of words in a line and its correct prediction, shown in the section below. These models attempt to train on less sparse vectors and can be read about in the Word Count Models notebook.

Model 13 uses scikit-learn to implement a Naive Bayes Classifier. Hyperparameter selection is done using cross validation (10 folds). The model is also evaluated using cross validation (10 folds). With hyperparameter selection, this results in nested cross validation. Stop words, which are words that provide no value to predictions (I, you, the, a, an, ...), are not removed from predictions. Hyperparameter selection showed better performance keeping all words rather than removing NLTK's list of stop words. Words are stemmed using NLTK's SnowballStemmer. Word counts are also transformed with term frequencies and inverse document frequencies using scikit-learn's implementation.

Model 13 performs with 29.041% balanced accuracy. The model's performance isn't great, but it's still fun to interact with! Over the course of the project it's been shown that more data only results in marginal increases in performance. Above, it's shown that accuracy increases as the number of words in a line increases. In other words, it seems that spoken dialogue is too short to predict in this case. The Naive Bayes classifier is a Bag of Words model, meaning that the order of the words is ignored. By using a Word Embeddings model, which does not ignore the order of words, accuracy could possibly be increased. Deep learning also might have success on this dataset.

To see the code for the model, see the Production Model notebook.

About the Dataset

This repository contains a newly created dataset to train and test models on, as well as several Jupyter Notebooks that describe the process used to create each .csv. This dataset uses a combination of original scripts and transcripts from the Marvel Cinematic Universe (MCU) movies. The original script .pdfs were obtained from Script Slug, though other copies of the Marvel released scripts can be found online elsewhere. Transcripts were taken from Fandom's Transcripts Wiki. Transcripts were copied and pasted into .txt files, and then processed using pandas. See the table below to find out what movies are in the dataset, and where the dialogue came from.

If you spot a mistake in the dataset, please let me know so I can correct it.

If you would like to use the dataset, it is available on Kaggle.

MCU.csv

The end file, mcu.csv, contains columns character and line that hold the dialogue for several movies from the MCU. There are more columns that provide additional features for context, such as movie titles and author indicators. See /data/MCU.ipynb for more details on those features and the dataset's creation.

Individual Movies

For individual movies, the .csv files found in /data/cleaned/ should not be used. Instead, load the file mcu.csv and use pandas or a similar library to select the rows that match a given movie.

Other Files of Interest

The file mcu.csv contains all of the data, but there are other files that might be of interest for anyone using this dataset. The file mcu_subset.csv is a subset of the original data, only containing dialogue for ['TONY STARK', 'STEVE ROGERS', 'NATASHA ROMANOFF', 'THOR', 'NICK FURY', 'PEPPER POTTS', 'BRUCE BANNER', 'JAMES RHODES', 'LOKI', 'PETER PARKER'], who are the top 10 characters by number of lines, number of words, and movie appearances.

The file characters.csv contains statistics summarizing each character's involvement in the MCU. The movie name columns in that table describe how many lines they have in that movie.

The file movies.csv contains metadata about the movies used in this project. For details on the creation of these files, see /data/MCU.ipynb.

Movies Included

Movie	Year	Is Transcript	Lines	Source Link	CSV Issues
Iron Man	2008	❌	834	Script Slug
Iron Man 2	2010	✔️	1010	Fandom's Transcripts Wiki	Not proofread
Thor	2011	❌	1007	Script Slug
Captain America: The First Avenger	2011	✔️	688	Fandom's Transcripts Wiki	Not proofread
The Avengers	2012	❌	1027	Script Slug
Iron Man 3	2013	✔️	1043	Fandom's Transcripts Wiki	Not proofread
Thor: The Dark World	2013	✔️	734	Fandom's Transcripts Wiki	Not proofread, transcript duplication
Captain America: The Winter Soldier	2014	✔️	841	Fandom's Transcripts Wiki	Not proofread
Avengers: Age of Ultron	2015	✔️	980	Fandom's Transcripts Wiki	Not proofread
Ant Man	2015	✔️	867	Fandom's Transcripts Wiki	Not proofread
Captain America: Civil War	2016	✔️	987	Fandom's Transcripts Wiki	Not proofread
Guardians of the Galaxy: Vol 2	2017	❌	993	Script Slug
Spider-Man: Homecoming	2017	✔️	1558	Fandom's Transcripts Wiki	Not proofread
Thor: Ragnarok	2017	❌	961	Script Slug
Black Panther	2018	❌	834	Script Slug
Avengers: Infinity War	2018	✔️	990	Fandom's Transcripts Wiki
Captain Marvel	2019	✔️	775	Fandom's Transcripts Wiki	Not proofread
Avengers: Endgame	2019	❌	1229	Script Slug

Movies not included

Movie	Year	Source Link	Transcript Issues
The Incredible Hulk	2008	Fandom's Transcripts Wiki	Poor / messy transcript
Guardians of the Galaxy	2014	Fandom's Transcripts Wiki	Poor / messy transcript
Doctor Strange	2016	Fandom's Transcripts Wiki	Poor / messy transcript
Ant-Man and the Wasp	2018	Fandom's Transcripts Wiki	Incomplete transcript
Spider-Man: Far From Home	2019	Fandom's Transcripts Wiki	Incomplete transcript

Libaries Used

numpy
pandas
sklearn
nltk
matplotlib
seaborn
statsmodels

jaynehow / marvel-dialogue-nlp Goto Github PK