Blog is available at ==> https://medium.com/@kapur_naveen/sparkify-udacity-capstone-project-1e8959772937
Git repository link ==> https://github.com/nakapoor/Sparkify-Churn-Prediction
Sparkify is a music app. The data provided here for the analysiss contains users information while intracting with the app. A user can contain many entries based on his actions. Along with users app activity data, data also shows a section of churned users through the cancellation of the account.
The purpose of the project is to understand the users activity data and to identify the characteristics of churned users from the behavioral data of these users, and take proper measures to retain the potential users who are likely to be churn based on the available data. It is important to stop the Users Churn as a result the model must be good at capturing True positives for users who are about to churn. Now F1 score conveys the balance between the precision and the recall(True Positive Rate.) as a result we will pick F1 Score as metrics for model performance evaluation.
These are the below steps taken to achive the purpose.
Data contains few points where userid is missing these may be guest users. I am dropping any such data where userid is missing.
Data set does not contains the target attribute so we need to understand the data and define the target attribute. Defining churn : I have tried to identify the action "Cancellation Confirmation" which leads to the cancellation of the the user subscription and marked users with page entry as "Cancellation Confirmation" as churned users.
df.filter(df.page=="Cancellation Confirmation").select("userId").dropDuplicates().show(10) +------+ |userId| +------+ | 125| | 51| | 54| |100014| | 101| | 29| |100021| | 87| | 73| | 3| +------+
df.select(["userId", "page", "time", "level", "song", "sessionId"]).where(df.userId == "125").sort("time").show(50)
+------+--------------------+-------------------+-----+--------------------+---------+ |userId| page| time|level| song|sessionId| +------+--------------------+-------------------+-----+--------------------+---------+ | 125| NextSong|2018-10-12 04:05:44| free| paranoid android| 174| | 125| NextSong|2018-10-12 04:11:21| free|Hypnotize(Album V...| 174| | 125| NextSong|2018-10-12 04:15:11| free| I'm On My Way| 174| | 125| NextSong|2018-10-12 04:18:34| free|Leader Of Men (Al...| 174| | 125| NextSong|2018-10-12 04:22:04| free| Love You Down| 174| | 125| NextSong|2018-10-12 04:28:35| free|Don't Leave Me Be...| 174| | 125| NextSong|2018-10-12 04:32:08| free| They're Red Hot| 174| | 125| NextSong|2018-10-12 04:35:06| free| Kota| 174| | 125| Roll Advert|2018-10-12 04:35:17| free| null| 174| | 125| Cancel|2018-10-12 04:35:18| free| null| 174| | 125|Cancellation Conf...|2018-10-12 04:35:18| free| null| 174| +------+--------------------+-------------------+-----+--------------------+---------+
df_withchurn.dropDuplicates(["userId", "gender"]).groupby(["churn", "gender"]).count().sort("churn").show()
+-----+------+-----+ |churn|gender|count| +-----+------+-----+ |false| M| 89| |false| F| 84| | true| F| 20| | true| M| 32| +-----+------+-----+
Worked on creating new features as mentioned below :
For modeling purpose, I am using 3 algorithms
a) Logistic regression
b) DecisionTreeClassifier
c) GBTClassifier
a) Logistic regression :
Precision : 0.25
Recall : 0.17647058823529413
F1 Score : 0.20689655172413793
b) DecisionTreeClassifier
Precision : 0.5625
Recall : 0.5294117647058824
F1 Score : 0.5454545454545455
c) GBTClassifier
Precision : 0.5
Recall : 0.47058823529411764
F1 Score : 0.48484848484848486
As we have already discussed that It is important to identify the Users who are about to churn as a result we need model that is good at capturing and reflecting True positives for users who are about to churn and F1 score conveys the balance between the precision and the recall(True Positive Rate.) as a result we will pick F1 Score as metrics for model performance evaluation.
For this problem we have started with the simplest linear model algorithm Logestic Regression. For logestic regression we saw that the F1 Score is very low so we plan to move to a tree based algorithm and we saw the huge improvement in the evaluation metrics and the F1 Score jumps to 0.55 . Out of curosity I tried with the more advanced algorithm that is GBTClassifier. but from the results we see that DecisionTreeClassifier shows best results for the recall(True Positive Rate) in all 3 algorithms so I preffer to stick to DecisionTreeClassifier for model building.
+--------------+----------+-------+---------+ |Val Results |Precision |Recall |F1 Score | +--------------+----------+-------+---------+ |LR classifiers| 0.2500 | 0.1764| 0.2068 | |DT Classifiers| 0.5625 | 0.5294| 0.5454 | |GBT Classifier| 0.5000 | 0.4705| 0.4848 | +--------------+----------+-------+---------+
Conclusion : DecisionTreeClassifier are performing best in all 3 algorithms. Looking at the results I can say that lot of work is required for improving the predictions further. Also Major challage which can be seen is training results through cross-validation is shwing good results , but testing with validation sets is not as good as the training .
For LR : training : 0.701 testing : 0.250
For DT Classifiers : training : 0.811 testing : 0.545
From improvement perspective I am considering the below 2 approaches a) Undersampling to optimize the F1 score. b) I will look forward to build ensambling models for this problem.