GoodReads-Multilabel-Genre-Prediction

A multilabel genre predictor of books for DSAI Project

In this project, we predict the genres (self-curated top 30) that a book can be classified into based on the plot description, brightness (luminance) of cover images and numRatings. The data set is obtained from Zenodo.

Libraries
Files
Overview
- Train Test Split & Scaling
- Multilabel Classifiers
Results
Things we learnt from the project
References
Contribution

Libraries

Libraries used:

General: numpy, pandas, matplotlib, seaborn, ast
Models & Classifiers: sklearn
Textual Data: nltk, unicodedata
Image Processing: imageio, scipy.cluster.vq, PIL

Jupyter Notebook Files

Data Preprocessing, Scraping & Cleaning link to ipynb1
Exploratory Analysis link to ipynb2
Models & Results link to ipynb3

Genre list

top30genrelist = ['fiction',
'fantasy',
'romance',
'young adult',
'contemporary',
'adult',
'nonfiction',
'history',
'novels',
'mystery',
'historical fiction',
'audiobook',
'science fiction',
'paranormal',
'literature',
'adventure',
'classics',
'thriller',
'childrens',
'magic',
'humor',
'contemporary romance',
'crime',
'suspense',
'middle grade',
'chick lit',
'biography',
'teen',
'horror',
'philosophy']

Project Overview

Scaling & Split

IterativeStratification was used to train test split the data set via the skmultilearn library, which essentially splits each input into subsets (where each label is considered individually) and then it distributes the samples starting with fewest "positive" samples and working up to the inputs that have the most labels.

MinMax Scaling was utilised.

We scale our data because many machine learning algorithms are sensitive to data scales.

For algorithms such as logistic regression, neural network, etc. that use gradient descent as an optimization technique, scaling helps the gradient descent converge more quickly towards the minima

For Distance algorithms like SVM which uses distances between data points to determine their similarity, higher weightage is given to features with higher magnitude.

Multilabel Classifiers

Three techniques are used to classify the movies into various multi-labels:

Binary Relevance: This consists of fitting one classifier per class, via OneVsRest. For each classifier, the class is fitted against all the other classes. The union of all classes that were predicted is taken as the multi-label output. Logistic Regression Machine Learning Model was used.
Label Powerset: In this approach, we transform the multi-label problem to a multi-class problem with 1 multi-class classifier trained on all unique label (genre) combinations found in the training data. Each plot in the test data set is classified into one of these unique combinations. Naive Bayes algorithm was used.
Label Powerset with Clustering: We use clustering technique (k-means) to reduce the number of possible classes into a manageable number. Linear SVC was then used.

Results

Models were finally evaluated for their F1-Scores.

Pipeline	Precision	Recall	F1-Score
TF-IDF + Binary Relevance + Logistic Regression	0.65	0.26	0.29
TF-IDF + Label Powerset + Naive Bayes	0.58	0.42	0.43
TF-IDF + Label Powerset with Clustering + Linear SVC	0.29	0.26	0.27

Things we learnt from the project

Web Scraping
Image Feature Extraction using Pillows
Textual Analysis: Using NLTK library to clean strings
Multi-Hot Binary Encoding
Word Cloud
Iterative Stratification for Train-Test Split
MinMax Scaling
TF-IDF Vectorizer
Multi-Label Classification Algorithms: Binary Relevance, Label Powerset, Label Powerset with Clustering
Logistic Regression, Naïve Bayes, Linear SVC
Concepts of Precision, Recall and F1-Score

References

Contribution

Nathaniel:

Extracting RGB using Pillows Library, General Data Cleaning
Genre Analysis and Cleaning
Genre Distribution
Word Cloud
Iterative Stratification (Train/Test Split)
Logistic Regression
Model Analysis (Precision, Recall, F1-Score)

Marcus:

Web Scraping
Multi-Hot Binary Encoding of Genre, General Data Cleaning
Genre Analysis and Cleaning
Multi-Genre Distribution plot
Min-Max Scaling
Linear Support Vector Machine Classifier and K-Means Clustering

Yan Chi:

Cleaning description using NLTK, General Data Cleaning
Genre Analysis and Cleaning
Heatmap plot
Analysing Numerical Variables with respect to genre
TF-IDF Vectorizing
Naïve Bayes

by Nathaniel, Marcus & Yan Chi

natisaver / goodreads-multilabel-genre-prediction Goto Github PK