Predicting churn rate on a music streaming example “Sparkify”

8 min readAug 25, 2021

Service businesses spend tons of money and time on customer retention by various efforts such as improve quality and customer service. In order to determine which strategy is more effective, it is vital to understand customers’ behavior and be able to know what kind of customer would churn.

Project Overview

We will look into a similar case, a music streaming application named Sparkify, which is just like Spotify and Pandora. This project is the capstone project of Udacity Data Scientist nanodegree. Contents and dataset are provided by Insight Data Science. The data we will analyze is the event log of Sparkify users, and since the size of this data is rather large, we use Apache Spark, a big data analytics tool, to perform data wrangling and model building. The goal of this project is to build a model to predict the churn of Sparkify users and evaluate the model performance.

Project Statement

Imagine we are the data team of Sparkify focusing on understanding our customers. A portion of Sparkify users cancel their accounts, and we hope to anticipate which users will churn in the future so that other teams can provide remedies as well as improve our service. We decided on the approach to tackle this problem by pulling useful information of user behaviors from the dataset and feed them into machine learning models, which is capable of predicting user churns.

In order to build a machine learning model, we first need to preliminarily explore our data and clean it. Then, we define target, which is churn in this case. Next, we proceed to exploratory data analysis to get an insight of user behaviors. Based on the analysis, we will be able to pull our features from the dataset and feed them into our models.

We split the dataset into training, test, and validation set. Once machine learning models are built and fine-tuned, we pick the best-performing model based on test set evaluation. In the end, this model will produce the final result on the validation set, which the model has never “seen” to give us a sense of how the model will perform in the real world.

Metrics

Since we are figuring out if a user churns, it’s a binary classification. Metrics used on performance evaluation is accuracy, the percentage of correctly predicting the class on target.

Exploratory Data Analysis

Data Exploration

The data we are given is user activity log which contains action records. Each row is one action performed by a user. Data fields include user information, song, artist, the page where the user is at, and timestamp, shown as following.

There are 286,500 records, so we will use Spark, a big data analytics tool, for our data analysis and modeling. Looking at statistic summary of userId, we notice there are empty string. In fact, 8346 records are empty userId, which should refer to guests who are just exploring Sparkify since they only visit pages like “Home”, “About”, and “Register”. We will get rid of guests from the dataset because churn doesn’t apply to them.

We also report the percentage of missing values from each data field. Artist, length, song fields have the highest percentage, because these fields are missing when users are not listening to songs.

In Exploratory Data Analysis step, we first define churn, which can happen to both paid and free users, when a user is at “Cancellation Confirmation”, meaning the account is canceled. Since we are predicting churned users, we add a Churn column such that all record rows of a user who sees “Cancellation Confirmation” will be flagged 1 and 0 otherwise. Then, churn rate of the dataset turns out to be about 16%.

Churned users is a small portion of the dataset

Data Visualization

Examining the two groups of users, we find users who stayed and users who churned listened to 1108 and 699 songs respectively, on average. The following bar plot shows the average number of specific pages that the two use group see.

Stayed users could potentially spend more time on Sparkify, so it makes sense that they see the first three types of pages more. It’s interesting that churned users are shown more roll advertisements. It’s worth checking if the advertisement is a driving factor of churn, and we may need additional data.

Methodology

Data Preprocessing

Next, we want to extract features for our machine learning model. Songs and pages above are all indicative of user behaviors, so we keep track of the number of these metrics over time for each user, meaning these features are chronologically cumulative. This includes, for each user, the number of songs listened, number of thumbs down, number of Help pages visited, number of errors encountered, and number of roll advertisements seen.

In addition, we label encode page column by using StringIndexer from PySpark ml library. To dig deeper into users’ music listening behaviors, the number of songs listened per day per user is a good measurement, so we add it to our feature list. Since the number of songs has a different scale than the other feature, we normalize them with Normalizer. Churn column is the target and the features are put together to form a vector, which is required.

In the modeling process, we will split the dataset and pick a winner model based on test set performance. Dataset is randomly split into three sets — training, test, and validation(percentage: 80%, 10%, 10% respectively). The validation set is completely out of the model building process, which would give us a more realistic performance evaluation.

Implementation

Three models used are Logistic Regression, Random Forest, and Linear Support Vector Classification. We fine-tune two hyperparameters for each model by grid search technique and use cross validation with 3 folds on the training set. CrossValidator allows us to easily implement all fine-tuning steps, and automatically select the best model so we can evaluate the performance on the test set. The following shows ranges of hyperparamters grid search.

Algorithms from left to right are: Logistic Regression, Random Forest, and Linear Support Vector

The evaluation metric of cross-validation is F1 score which seeks a balance between precision and recall, because churned users are a small portion of our data, meaning a large number of true negatives.

Here are the parameters and scores of the best models for each of three algorithms from the cross validation.

Refinement

The area under ROC was the initial metric used as it’s the default metric of Spark Evaluator. However, given the imbalance of target distribution, F1 score seemed a better choice. The result was also promising that performance on both test and validation sets is slightly higher.

Results

Model Evaluation and Validation

We take the best models from cross validation for each of the three models and evaluated them on the test set, and all three models score more than 83% accuracy, and scores are close. The Random Forest model has the highest accuracy, 84.65% on predicting churn, and its accuracy on the validation set is as high as 84.71%, which is very good.

Since Random Forest model is our pick for final result, we report its feature importance in the bar chart.

Justification

Number of ads shown to user is at the top by feature importance of Random Forest model. It echoes our finding in exploratory data analysis that churned user see more roll ads than stayed user on average. We may need more data or inputs from other teams to figure out the reasons behind it.

The reasons why accuracy on the validation set is high could be a proper random split on the dataset before any transformation, and feature normalization is also helpful.

Conclusion

Reflection

Like a typical data science project, we examine the dataset by understanding data fields, looking for invalid or missing values and clean them. In exploratory data analysis, we start to actually define target and features that are required to build machine learning models. In this step, data visualization is very helpful to grasp relationships between variables. Then, we preprocess data to extract features, normalize scale and split dataset for training and validation. We are ready to build models. Cross validation is important to make sure the performance results we get is not by chance. We fine-tune models to further improve performance. Finally, we pick the winner model on test set performance and report its result on validation set.

I have learnt a lot working on this project, particularly, the stories under the hood that data and models are telling. A thorough data analysis before building models is essential, because we may find interesting insights like number of roll ads in this case. It is also critical to look at the parameters of the models after they are built. Feature importance gives us a better understanding of the relationships between features in predicting churns.

Further Improvement

The predicted churn field is event-based and doesn’t directly help us determine if a user would churn. A potential solution is to take the average of churn column for each user and treat it as the likelihood of churn. This column can also be a feature for a different model that predicts which user churns.

The above score may have room for improvement because we could extract more interesting features from the dataset such as cohort. Users who joined Sparkify now may behave differently than users who joined a year ago. Dividing them into cohorts gives us a new feature, which may help model performance as it is not correlated to other features.

Acknowledgment

To see how each step, including data visualization and modeling, is performed in actual code, please go to my GitHub repo here.