Giter VIP home page Giter VIP logo

song_popularity_detection's Introduction

Song Popularity

Overview

Music has been an integral part of our culture all through-out human history. Being able to identify the factors which makes a song popular has major impact to businesses that thrive on popular music, namely radio stations, record labels, and digital and physical music marketplaces. Predicting popular songs can be applied to the problem of predicting preferred songs for a given population.

Dataset

The data was part of a competition arrange by ML Space hosted on Kaggle. Based on features like acousticness, danceability, energy, instrumentalness, key, liveness, loudness, etc we had train a model which would assign a popularity score to it, highest score of popularity being 1 and lowest was 0. There were 15 numerical fearures present in this dataset. The model performance metrics used was ROC curve(AUC).

EDA

Below are some of the key observations dervied from analyzing the data -

  1. The data suffered from high number of Null values both in train and test sets. Below is a distribution of Null values over each fearure for both train and test sets.

Hence we see out of all the 15 features, ther are 7 feature contains Null values for almost 10% of the data. There are overall approximately 22000 records which contained in atleast one of the columns.

  1. The data suffered highly from Multi-Collinearity issue. Multicollinearity is the occurrence of high intercorrelations among two or more independent variables in a multiple regression model. Multicollinearity can lead to skewed or misleading results. Below is a heat-map to show multicollinearity of the data I am working with.

Note: Strong positive correlation exist between loudness and energy. Strong negative correlation exist between loudness and acousticness, and acousticness and energy.

  1. Class Imbalance and Class overlapping problem - Class Imbalance is very common in classification problems. In this data, the number of non-popular songs out number the number of popular songs, hence there are roughly 2 non-poular song for 1 popular song.

In addition, we found that there was class overlapping issue. Class overlap is caused due to ambiguous regions in the data where the prior probability of two or more classes are approximately equal. Due to this the distribution of non-popular song across a feature overlaps the distribution of popular songs over the same feature. Below I have shown a distrubtion of the poular and non-poular songs over a few features.

Feature Handling

  1. It was observed from instrumentalness that is closed clustered around the mean. The distribution of Instrumentalness is shown below. So to have a smoother distibution, the log of instrumentalness was used as a Feature.

  1. Null Imputation - Different imputation techniques were used and based on model performance the .
  • Simple Imputation technique - Imputing the missing data with mean value of each column.
  • KNN Imputer technique - Imputes missing values using the weighted or unweighted mean of the desired number of nearest neighbors
  • Iterative Imputer - This is part of an experimental library and this technique uses round-robin linear regression, modeling each feature with missing values as a function of other features, in turn.

All the error related to each imputation technique is documented under the Model Accuracy Data.xlsx file.

  1. Sampling issue - Used SMOTE (Synthetic minority Oversampling Technique) pipeline to catter with the class imbalanced problem. The pipeline combines SMOTE with random undersampling technique of the majority class. First oversample the minority class percent the number of examples of the majority class (e.g. about 1,000), then use random undersampling to reduce the number of examples in the majority class to have 50 percent more than the minority class (e.g. about 2,000).

Model

Trained models like SVC, KNN and RandomForest with both Iterative and KNN Imputer. Also, I used GridSearchCV for hypertuning of parameters. Addition to these models, I tried to train XGBoost and LGBM which accepts missing values and internally impute the missing value based on the available data.

Performance

Performance metrics used is ROC curve (AUC). The accuracy scores are documented under Model Accuracy Data.xlsx and the highlighted rows are the ones yielding the best performance for that model.

song_popularity_detection's People

Contributors

ruparna25 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.