Data Scientist Nanodegree

Spark for Big Data

Project: Analysing Customer Churn with PySpark

Blog Version

Definition
Analysis
- Data Exploration
- Data Visualisation
Methodology
Conclusion
- Reflection
- Challenges
Files
Software Requirements
References
Acknowledgement

I. Definition

Project Overview

You might have heard of the two music streaming giants: Apple Music and Spotify. Which is better than the other? Well, that depends on multiple factors, like the UI/UX of their app, the frequency of new content, user-curated playlists and subscribers count. The factor which we are studying is called the churn rate. Churn rate has a direct impact on the subscribers' count and also the long term growth of the business.

So what is this churn rate anyway?

For a business, the churn rate is a measure of the number of customers leaving the service or downgrading their subscription plan within a given period of time.

Problem Statement

Imagine you are working on the data team for a popular digital music service similar to Spotify or Pandora. Millions of users stream their favourite songs to your service every day either using the free tier that plays advertisements between the songs or using the premium subscription model where they stream music at free but pay a monthly flat rate. Users can upgrade, downgrade or cancel their service at any time so it is crucial to make sure your users love the service.

In this project, our aim is to identify the customer churn for Sparkify (a Spotify-like fictional music streaming service). This does not include the variety (like the genre, curated playlists, top regional charts) of music that the service provides. It mainly explores user behaviour and how we can identify the possibility that a user will churn. Such customers are those who decide to downgrade their service, i.e. going from paid subscription to free, or entirely leaving the service.

Our target variable is isChurn. It cannot be interpreted directly from the JSON file, but we will use feature engineering to create it. isChurn column is 1 for users who visited the Cancellation Confirmation page and 0 otherwise.

Metrics

Out of 225 unique users, only 52 churned (23.5%). So, accuracy will not be a good metric to handle this imbalance. We will instead use F1-Score to evaluate our model.

F1-Score is the harmonic mean of precision and recall.

Precision = True Positive / (True Positive + False Positive)

Recall = True Positive / (True Positive + False Negative)

II. Analysis

Data Exploration

Name of the input data file is mini_sparkify_event_data.json in data directory. The shape of our feature space is 286500 rows and 18 columns. data/metadata.xlsx contains information about the features.

A preview of data:

I have used toPandas() method above because 18 columns cannot be displayed in a user-friendly way by PySpark's built-in .show() method. Still, if you like to see what it looks like with .show() method, here you go (you will not like it).

Feature Space

Univariate Plots

Distribution of pages

Cancellations are less. That is what we have to predict.

We will remove the starting classes: Cancel and Cancellation Confirmation in our modelling section to avoid lookahead bias.

Most commonly browsed pages include activities like the addition to a playlist, home page, and thumbs up.

I have not inclued NextPage category because it is present around 80% of the time and will highly skew the distribution.
Distribution of levels (free or paid)

70% of churned users are paying customers. Customer retention is relatively more important for paid ones because they are directly connected with the revenue of the company.
Song length

No additional information is available from this. It just shows that most songs are 4 minutes long.
What type of device user is streaming from?

This is what we have expected. Windows is the most used platform.

Multivariate Plots

Gender distribution

Males are more in number.
Distribution of pages based on churn

No strong conclusion can be drawn from this graph. It also shows the same common actions as seen from the univariate plot here
Distribution of hour based on churn

We can see that non-churn users are more active during day time.
Behaviour across weekdays

Activity is more on weekdays. Especially for churned users. However, this change is not significant.
Behaviour at the month level

Non-churn users are generally less active at the start of the month as compared to churn users, and the opposite is the case at the EOM.

Target Space

For page column, we have 22 distinct values:

About
Add Friend
Add to Playlist
Cancel
Cancellation Confirmation
Downgrade
Error
Help
Home
Login
Logout
NextSong
Register
Roll Advert
Save Settings
Settings
Submit Downgrade
Submit Registration
Submit Upgrade
Thumbs Down
Thumbs Up
Upgrade

We are interested in the fifth category. Our target variable is isChurn. It is 1 if the user visited Cancellation Confirmation page and 0 otherwise.

Data Visualisation

This imbalanced data suggests that we should not use accuracy as our evaluation metric. We will use F1 score instead and use under-sampling to further optimise it.

III. Methodology

Data Preprocessing

Handling null values

First, we will remove null values for some columns. There are two distinct number of null values observed: 8346 and 58392.

58392 is 20% of the data (286500) and 8346 is merely 2%. So we keep the columns which have 2% nans and see for the 20% one's whether we can impute the missing values in some way.

These are the columns with 20% missing values. Seems like it is difficult to impute them. We will drop the respective rows with null values for these columns.

Implementation

We have the same training and testing features for all the models. PySpark's ML library(pyspark.ml) has access to the most common machine learning classification algorithms. Others are still in development, like Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

The ones which we'll be using are:

Logistic Regression
Random Forest Classifier
Gradient Boosting Trees

Refinement

Since the class distribution is highly imbalanced, we will perform random undersampling to optimize our F1 score.

F1 is the harmonic mean of precision and recall. Precision and recall are calculated in the following way:

This article will deepen your understanding on why to use F1 score when evaluating your model on imbalanced data set.

Comparison of average metrics before and after undersampling.

Model	Average Metrics Before	Average Metrics After
Logistic Regression	0.717, 0.684, 0.681	0.486, 0.344, 0.192
Random Forest Classifier	0.710, 0.699, 0.698	0.540, 0.537, 0.499
Gradient Boosting Tree Classifier	0.710, 0.705, 0.684	0.629, 0.627, 0.616

GBTClassifier provided a fairly good F1 score of 0.64 after undersampling.

IV. Conclusion

Reflection

I enjoyed the data pre-processing part of the project. For data visualisation part, instead of using “for” loops to get arrays for our bar chart, I first converted our Spark data frame to Pandas data frame using toPandas()method. Data visualisation is easier from then on.

The shape of our final data is just 225 x 32. This is too small to generalise our model. Just 225 users for a streaming company? That’s nothing. 12 GB data might provide some useful results. If you want more statistically significant results then I suggest you run this notebook on Amazon EMR for the 12 GB dataset. I have skipped that part for now because it costs $30 for one week.

Challenges

Some of the challenges which I faced in this project are:

Official documentation of PySpark as compared to Pandas
For the sake of mastering Spark, we only used the most common machine learning classification models instead of using the advanced ones
Highly imbalanced data led to a poor F1 score
If you run this notebook on your local machine without any change, then it will take around an hour to run completely

V. Files

Folders gbtModel, lrModel and rfModel

Saved models before under-sampling
Folders new_gbt_model, new_lr_model and new_rf_model

Saved models after under-sampling
helper.py

Helper functions for Plotly visualisations

VI. Software Requirements

This project uses Python 3.6.6 and the necessary libraries are mentioned in requirements.txt file.

VII. References

Python API docs for Logistic Regression

Python API docs for Random Forest Classifier

Python API docs for GBTClassifier

Knowledge about F1 score and why it is a better metric for imbalanced data set

VIII. Acknowledgements

Thanks to Udacity for providing such a challenging project to work on. At first, I was vexed with functional programming in Python, but the instructors were very clear in their approach. Now I am looking for more projects that are build atop Spark.

sanjeevai / sparkify-capstone Goto Github PK

sparkify-capstone's Introduction