The Die-Hard Rock Fan Data Scientist

Name	Date
Hamza Benhmani	June 22, 2019

Note that the opinions expressed in this project are not mine.

Research Question

Would it be possible for me, an opinionated Rock fan to prove what is and is not Rock?

Added in GitBook

{% tabs %} {% tab title="AWS" %} For AWS do this. {% endtab %}

{% tab title="Azure" %} For GCP do this. {% endtab %}

{% tab title="GCP" %} For AWS do this. {% endtab %} {% endtabs %}

Abstract

Thanks to Spotify's publicly accessible API, some useful information about the audio features of all the songs on the platform is available. Some have used this API to compile interesting datasets like this one.

The music genre debate/question is an ongoing one. It is very difficult to clearly define what constitutes or how to properly define a given genre, or what genre or genres a given song belongs to. This is particularly hard for me as someone that thinks that only rock music is good music, and that more often than not, a lot of people mistakenly describe some bad music as rock music.

We will look at the data available in the dataset linked to above (in which I trust the rock classficiation done for the songs in the list, given that I spent months going through its content) in order to see if we could train machines to properly judge this and save us the argument time.

The result is a model that can to some extent predict if a song is a rock song based on its audio features, but it needs to be improved.

Introduction

Adri Molina, the person who made the dataset available on Kaggle compiled a list of songs with their audio features using the spotipy library (A Python library for the Spotify Web API), with each row/song having a music genre assigned to it.

Adri's work, according to Adri, is a simplification of the html list generated by this GitHub project which itself is inspired and was based off of this great website called Every Noise At Once.

Here is some information on the dataset after some cleanup:

RangeIndex: 131554 entries, 0 to 131553
Data columns (total 12 columns):
Danceability        131554 non-null float64
Energy              131554 non-null float64
Key                 131554 non-null float64
Loudness            131554 non-null float64
Mode                131554 non-null float64
Speechness          131554 non-null float64
Acousticness        131554 non-null float64
Instrumentalness    131554 non-null float64
Liveness            131554 non-null float64
Valence             131554 non-null float64
Tempo               131554 non-null float64
Genre               131554 non-null object
dtypes: float64(11), object(1)
memory usage: 12.0+ MB

Shape:

(131554, 12)

Columns

['Danceability', 'Energy', 'Key', 'Loudness', 'Mode', 'Speechness', 'Acousticness', 'Instrumentalness', 'Liveness', 'Valence', 'Tempo', 'Genre']

Methods

Data Cleansing

Before starting to explore different Machine Learning models, and tune the parameters, I needed to clean the dataset in order to have good results, and to be able to prepare the dataset. Some of the data manipulations I had to do are:

The Genre column had 625 unique genre values which is too many for my needs, so I aggregated these values into a small subset by creating a list of genres I want to track:
```
['rock', 'pop', 'electro', 'folk', 'hiphop', 'metal', 'rap']
```
then rewriting any value that contains any of these keywords, to just have the keyword itself, and dropped the rows where the genre has none of these. This left me with 49798 rows, compared to the initial 131554.
Since I'm only interested in separating the greatest, and in fact, the only good music genre to exist, I had to next transform the Genre target to have only 1 if the song is categorised as a rock song, and 0 if not 🤘

Machine Learning Model and Tuning

I chose to work with a K-Nearest Neighbors algorithm because I'm attempting to do a predictive analysis where I'm assuming that similar songs or in other words, songs belonging to the same genre exist in close proximity given the Audio Features. I had to work with the K-Nearest Neighbors Classifier algorithm since my dependent variable (whether the song is a rock song or not) is a binary variable/target.

Results

In order to choose the optimal number of n_neighbors to use, I plotted a multi line chart to show the error rate and f1 score results based on n_neighbors from 1 to 10, and got the following results:

Based on this I decided to use 4 as my n_neighbors for good prediction results, but to not overfit the model.

Here is an example of the error rate and f1 score based on the above:

F1 Score: 0.5613426823119307
Error Rate: 0.21474469305794608

And the confusion matrix to explain the average results:

[[13112   669]
 [ 3105   544]]

Discussion

The method somehow solves the problem with an average accuracy, but still needs to be improved. some ways this could be improved are:

Keep looking in order to see if there is a better algorithm that would work better with this problem and dataset.
Add more relevant information to the dataset in order to get more detailed and accurate results.

References

https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

https://www.kaggle.com/grasslover/spotify-music-genre-list

https://www.quora.com/How-are-musical-genres-defined

http://everynoise.com/engenremap.html

https://en.wikipedia.org/wiki/F1_score

https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

deepstqte / cebd1160-project Goto Github PK

cebd1160-project's Introduction

The Die-Hard Rock Fan Data Scientist

Research Question

Abstract

Introduction

Methods

Data Cleansing

Machine Learning Model and Tuning

Results

Discussion

References

cebd1160-project's People

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent