Giter VIP home page Giter VIP logo

telco-customer-churn-on-icp4d's Introduction

Predict Customer Churn using Watson Machine Learning and Jupyter Notebooks on Cloud Pak for Data

In this Code Pattern, we use IBM Cloud Pak for Data to go through the whole data science pipeline to solve a business problem and predict customer churn using a Telco customer churn dataset. Cloud Pak for Data is an interactive, collaborative, cloud-based environment where data scientists, developers, and others interested in data science can use tools (e.g., RStudio, Jupyter Notebooks, Spark, etc.) to collaborate, share, and gather insight from their data as well as build and deploy machine learning and deep learning models.

When the reader has completed this Code Pattern, they will understand how to:

  • Use Jupyter Notebooks to load, visualize, and analyze data
  • Run Notebooks in IBM Cloud Pak for Data
  • Build, test and deploy a machine learning model using Spark MLib on Cloud Pak for Data.
  • Deploy a selected machine learning model to production using Cloud Pak for Data
  • Create a front-end application to interface with the client and start consuming your deployed model.

architecture diagram

Flow

  1. User loads the Jupyter notebook into the Cloud Pak for Data platform.
  2. Telco customer churn data set is loaded into the Jupyter Notebook, either directly from the github repo, or as Virtualized Data after following the Data Virtualization Tutorial from the IBM Cloud Pak for Data Learning Path.
  3. Preprocess the data, build machine learning models and save to Watson Machine Learning on Cloud Pak for Data.
  4. Deploy a selected machine learning model into production on the Cloud Pak for Data platform and obtain a scoring endpoint.
  5. Use the model for credit prediction using a frontend application.

Included components

Featured technologies

  • Jupyter Notebooks: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text.
  • Pandas: An open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
  • Seaborn: A Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
  • Spark MLib: Apache Spark's scalable machine learning library.

Prerequisites

Steps

  1. Create a new Project
  2. Create a Space for Machine Learning Deployments
  3. Upload the dataset if you are not on the Cloud Pak for Data Learning Path.
  4. Import notebook to Cloud Pak for Data
  5. Run the notebook
  6. Deploying the model using the Cloud Pak for Data UI
  7. Testing the model
  8. Create a Python Flask app that uses the model

1. Create a new project

  • Launch a browser and navigate to your Cloud Pak for Data deployment.

  • Go the (☰) menu and click Projects:

(☰) Menu -> Projects

  • Click on New project. In the dialog that pops up, select the project type as Analytics project and click Next:

Start a new project

  • Click on the top tile for Create an empty project:

Create an empty project

  • Give the project a unique name, an optional description and click Create:

Pick a name

2. Create a Space for Machine Learning Deployments

Before we create a machine learning model, we will have to set up a deployment space where we can save and deploy the model.

Follow the steps in this section to create a new deployment space. If you already have a deployment space set up, you can skip this section and follow the steps to upload the dataset.

  • Navigate to the left-hand (☰) hamburger menu and choose Analyze -> Analytics deployments:

(☰) Menu -> Analytics deployments

  • Click on New deployment space +:

Add New deployment space

  • Click on the top tile for 'Create an empty space':

Create empty deployment space

  • Give your deployment space a unique name, an optional description, then click Create.

Create New deployment space

3. Upload the dataset

git clone https://github.com/IBM/telco-customer-churn-on-icp4d/
cd telco-customer-churn-on-icp4d
  • In your project, on the Assets tab click the 01/00 icon and the Load tab, then either drag the data/Telco-Customer-Churn.csv file from the cloned repository to the window or navigate to it using browse for files to upload:

Add data set

4. Import notebook to Cloud Pak for Data

  • In your project, either click the Add to project + button, and choose Notebook, or, if the Notebooks section exists, to the right of Notebooks click New notebook +:

Add notebook

  • On the next screen, select the From URL tab, give your notebook a name and an optional description, provide the following URL as the Notebook URL, and choose the Python 3.6 environment as the Runtime:
https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/notebooks/Telco-customer-churn-ICP4D.ipynb

Add notebook name and URL

  • When the Jupyter notebook is loaded and the kernel is ready then we can start executing cells.

Notebook loaded

Important: Make sure that you stop the kernel of your notebook(s) when you are done, in order to conserve memory resources!

Stop kernel

Note: The Jupyter notebook included in the project has been cleared of output. If you would like to see the notebook that has already been completed with output, you can refer examples/Telco-customer-churn-ICP4D-with-output.ipynb.

5. Run the notebook

Spend some time looking through the sections of the notebook to get an overview. A notebook is composed of text (markdown or heading) cells and code cells. The markdown cells provide comments on what the code is designed to do.

You will run cells individually by highlighting each cell, then either click the Run button at the top of the notebook or hitting the keyboard short cut to run the cell (Shift + Enter but can vary based on platform). While the cell is running, an asterisk ([*]) will show up to the left of the cell. When that cell has finished executing a sequential number will show up (e.g. [17]).

Please note that some of the comments in the notebook are directions for you to modify specific sections of the code. Perform any changes as indicated before running / executing the cell.

Notebook sections

With the notebook open, you will notice:

  • Section 1.0 Install required packages will install some of the libraries we are going to use in the notebook (many libraries come pre-installed on Cloud Pak for Data). Note that we upgrade the installed version of Watson Machine Learning Python Client. Ensure the output of the first code cell is that the python packages were successfully installed.

Install required packages

  • Section 2.0 Load and Clean data will load the data set we will use to build out the machine learning model. In order to import the data into the notebook, we are going to use the code generation capability of Watson Studio.
    • Highlight the code cell shown in the image below by clicking it. Ensure you place the cursor below the commented line.
    • Click the 01/00 "Find data" icon in the upper right of the notebook to find the data asset you need to import.
    • If you are following the Cloud Pak for Data Learning Path, choose the Files tab, and pick the virtualized data set that has all three joined tables (i.e. User<xyz>.BILLINGPRODUCTSCUSTOMERS). Click Insert to code and choose pandas DataFrame.

Add remote Pandas DataFrame

  • Otherwise, if you are using this notebook without virtualized data, you can use the Telco-Customer-Churn.csv file version of the data set that has been included in this project and was uploaded to the Cloud Pak for Data project in Step 3. Choose the Files tab. Select the Telco-Customer-Churn.csv file. Click Insert to code and choose pandas DataFrame.

Add local Pandas DataFrame

  • The code to bring the data into the notebook environment and create a Pandas DataFrame will be added to the cell.
  • Run the cell and you will see the first five rows of the dataset.

Generated code to handle Pandas DataFrame

IMPORTANT: Since we are using generated code to import the data, you will need to update the next cell to assign the df variable. Copy the variable that was generated in the previous cell (it will look like data_df_1, data_df_2, etc) and assign it to the df variable (for example df=df_data_1).

  • Continue to run the remaining cells in section 2 to explore and clean the data.

  • Section 3.0 Create a model cells will run through the steps to build a model pipeline.

    • We will split our data into training and test data, encode the categorial string values, create a model using the Random Forest Classifier algorithm, and evaluate the model against the test set.
    • Run all the cells in section 3 to build the model.

Building the pipeline and model

  • Section 4.0 Save the model will save the model to your project.

  • We will be saving and deploying the model to the Watson Machine Learning service within our Cloud Pak for Data platform. In the next code cell, be sure to update the wml_credentials variable.

    • The url should be the hostname of the Cloud Pak for Data instance.
    • The username and password should be the same credentials you used to log into Cloud Pak for Data.
  • Update the MODEL_NAME variable and provide a unique and easily identifiable model name. Next, update the DEPLOYMENT_SPACE_NAME variable, providing the name of your deployment space which was created in Step 2 above.

Provide model and deployment space name

Update WML credentials

Continue to run the cells in the section to save the model to Cloud Pak for Data. We'll be able to test it out with the Cloud Pak for Data tools in just a few minutes!

Note: You can use the following cell for cleaning up any previously created models and deployments.

Clean up models and deployments

6. Deploying the model using the Cloud Pak for Data UI

Now that we have created a model and saved it to our respository, we will want to deploy the model so it can be used by others.

We will be creating an online deployment. This type of deployment will make an instance of the model available to make predictions in real time via an API.

Although we use the Cloud Pak for Data UI to deploy the model here, the same can also be done programmatically.

  • Navigate to the left-hand (☰) hamburger menu and choose Analyze -> Analytics deployments:

Analytics Analyze deployments

  • Choose the deployment space you setup previously by clicking on the name of the space.

Deployment space

  • In your space overview, click the model name that you just built in the notebook:

select model

  • Click Create deployment on the top-right corner.

Actions Deploy model

  • On the 'Create a deployment' screen, choose Online for the Deployment Type, give the Deployment a name and an optional description and click Create:

Online Deployment Create

  • The Deployment will show as In progress and then switch to Deployed when done.

Status Deployed

7. Testing the model

Cloud Pak for Data offers tools to quickly test out Watson Machine Learning models. We begin with the built-in tooling.

  • Click on the deployment. The Deployment API reference tab shows how to use the model using cURL, Java, Javascript, Python, and Scala. Click on the corresponding tab to get the code snippet in the language that you want to use:

Deployment API reference

Test the saved model with built-in tooling

  • To get to the built-in test tool, click on the Test tab. Click on the Provide input data as JSON icon and paste the following data under Body:
{
   "input_data":[
      {
         "fields":[
            "gender",
            "SeniorCitizen",
            "Partner",
            "Dependents",
            "tenure",
            "PhoneService",
            "MultipleLines",
            "InternetService",
            "OnlineSecurity",
            "OnlineBackup",
            "DeviceProtection",
            "TechSupport",
            "StreamingTV",
            "StreamingMovies",
            "Contract",
            "PaperlessBilling",
            "PaymentMethod",
            "MonthlyCharges",
            "TotalCharges"
         ],
         "values":[
            [
               "Female",
               0,
               "No",
               "No",
               1,
               "No",
               "No phone service",
               "DSL",
               "No",
               "No",
               "No",
               "No",
               "No",
               "No",
               "Month-to-month",
               "No",
               "Bank transfer (automatic)",
               25.25,
               25.25
            ]
         ]
      }
   ]
}
  • Click the Predict button and the model will be called with the input data. The results will display in the Result window. Scroll down to the bottom (Line #114) to see either a "Yes" or a "No" for Churn:

Testing the deployed model

Test the deployed model with cURL

Now that the model is deployed, we can also test it from external applications. One way to invoke the model API is using the cURL command.

NOTE: Windows users will need the cURL command. It's recommended to download gitbash for this, as you will also have other tools and you will be able to easily use the shell environment variables in the following steps. Also note that if you are not using gitbash, you may need to change export commands to set commands.

  • In a terminal window (or command prompt in Windows), run the following command to get a token to access the API. Use your Cloud Pak for Data cluster username and password:
curl -k -X GET https://<cluster-url>/v1/preauth/validateAuth -u <username>:<password>
  • A json string will be returned with a value for "accessToken" that will look similar to this:
{"username":"snyk","role":"Admin","permissions":["access_catalog","administrator","manage_catalog","can_provision"],"sub":"snyk","iss":"KNOXSSO","aud":"DSX","uid":"1000331002","authenticator":"default","accessToken":"eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VybmFtZSI6InNueWstYWRtaW4iLCJyb2xlIjoiQWRtaW4iLCJwZXJtaXNzaW9ucyI6WyJhZG1pbmlzdHJhdG9yIiwiY2FuX3Byb3Zpc2lvbiIsIm1hbmFnZV9jYXRhbG9nIiwibWFuYWdlX3F1YWxpdHkiLCJtYW5hZ2VfaW5mb3JtYXRpb25fYXNzZXRzIiwibWFuYWdlX2Rpc2NvdmVyeSIsIm1hbmFnZV9tZXRhZGF0YV9pbXBvcnQiLCJtYW5hZ2VfZ292ZXJuYW5jZV93b3JrZmxvdyIsIm1hbmFnZV9jYXRlZ29yaWVzIiwiYXV0aG9yX2dvdmVycmFuY2VfYXJ0aWZhY3RzIiwiYWNjZXNzX2NhdGFsb2ciLCJhY2Nlc3NfaW5mb3JtYXRpb25fYXNzZXRzIiwidmlld19xdWFsaXR5Iiwic2lnbl9pbl9vbmx5Il0sInN1YiI6InNueWstYWRtaW4iLCJpc3MiOiJLTk9YU1NPIiwiYXVkIjoiRFNYIiwidWlkIjoiMTAwMDMzMTAwMiIsImF1dGhlbnRpY2F0b3IiOiJkZWZhdWx0IiwiaWp0IjoxNTkyOTI3MjcxLCJleHAiOjE1OTI5NzA0MzV9.MExzML-45SAWhrAK6FQG5gKAYAseqdCpublw3-OpB5OsdKJ7isMqXonRpHE7N7afiwU0XNrylbWZYc8CXDP5oiTLF79zVX3LAWlgsf7_E2gwTQYGedTpmPOJgtk6YBSYIB7kHHMYSflfNSRzpF05JdRIacz7LNofsXAd94Xv9n1T-Rxio2TVQ4d91viN9kTZPTKGOluLYsRyMEtdN28yjn_cvjH_vg86IYUwVeQOSdI97GHLwmrGypT4WuiytXRoQiiNc-asFp4h1JwEYkU97ailr1unH8NAKZtwZ7-yy1BPDOLeaR5Sq6mYNIICyXHsnB_sAxRIL3lbBN87De4zAg","_messageCode_":"success","message":"success"}
  • Use the export command to save the "accessToken" part of this response in the terminal window to a variable called WML_AUTH_TOKEN.
export WML_AUTH_TOKEN=<value-of-access-token>
  • Back on the model deployment page, gather the URL to invoke the model from the API reference by copying the Endpoint, and export it a variable called URL:

Model Deployment Endpoint

export URL=https://blahblahblah.com

Now run this curl command from a terminal window to invoke the model with the same payload that was used previously:

curl -k -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' --header "Authorization: Bearer  $WML_AUTH_TOKEN" -d '{"input_data": [{"fields": ["gender","SeniorCitizen","Partner","Dependents","tenure","PhoneService","MultipleLines","InternetService","OnlineSecurity","OnlineBackup","DeviceProtection","TechSupport","StreamingTV","StreamingMovies","Contract","PaperlessBilling","PaymentMethod","MonthlyCharges","TotalCharges"],"values": [["Female",0,"No","No",1,"No","No phone service","DSL","No","No","No","No","No","No","Month-to-month","No","Bank transfer (automatic)",25.25,25.25]]}]}' $URL

A json string similar to the one below will be returned with the response, including a "Yes" or "No" at the end indicating the prediction of whether the customer will churn or not.

{"predictions":[{"fields":["gender","SeniorCitizen","Partner","Dependents","tenure","PhoneService","MultipleLines","InternetService","OnlineSecurity","OnlineBackup","DeviceProtection","TechSupport","StreamingTV","StreamingMovies","Contract","PaperlessBilling","PaymentMethod","MonthlyCharges","TotalCharges","gender_IX","Partner_IX","Dependents_IX","PhoneService_IX","MultipleLines_IX","InternetService_IX","OnlineSecurity_IX","OnlineBackup_IX","DeviceProtection_IX","TechSupport_IX","StreamingTV_IX","StreamingMovies_IX","Contract_IX","PaperlessBilling_IX","PaymentMethod_IX","label","features","rawPrediction","probability","prediction","predictedLabel"],"values":[["Female",0,"No","No",1,"No","No phone service","DSL","No","No","No","No","No","No","Month-to-month","No","Bank transfer (automatic)",25.25,25.25,1.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,[18,[0,4,5,6,14,15,16,17],[1.0,1.0,2.0,1.0,1.0,2.0,25.25,25.25]],[10.806165651100262,9.193834348899738],[0.5403082825550131,0.45969171744498694],0.0,"No"]]}]}

8. Create a Python Flask app that uses the model

You can also access the online model deployment directly through the REST API. This allows you to use your model for inference in any of your apps. For this code pattern, we'll be using a Python Flask application to collect information, score it against the model, and show the results.

Install dependencies

NOTE: This application only runs on Python 3.6 and above, so the instructions here are for Python 3.6+ only.

The general recommendation for Python development is to use a virtual environment (venv). To install and initialize a virtual environment, use the venv module:

In a terminal, go to the flaskapp folder within the cloned repo directory.

git clone https://github.com/IBM/telco-customer-churn-on-icp4d/
cd telco-customer-churn-on-icp4d/flaskapp

Initialize a virtual environment with venv.

# Create the virtual environment using Python. 
# Note, it may be named python3 on your system.
python -m venv venv       # Python 3.X

# Source the virtual environment. Use one of the two commands depending on your OS.
source venv/bin/activate  # Mac or Linux
./venv/Scripts/activate   # Windows PowerShell

TIP To terminate the virtual environment use the deactivate command.

Finally, install the Python requirements.

pip install -r requirements.txt

Update environment variables

It is best practice to store configurable information as environment variables, instead of hard-coding any important information. To reference our model and supply an API key, we will pass these values in via a file that is read; the key-value pairs in this file are stored as environment variables.

Copy the env.sample file to .env.

cp env.sample .env

Edit the .env file and fill in the MODEL_URL as well as the AUTH_URL, AUTH_USERNAME, and AUTH_PASSWORD.

  • MODEL_URL is your web service URL for scoring which you got from the section above
  • AUTH_URL is the preauth url of your CloudPak4Data and will look like this: https://<cluster_url>/v1/preauth/validateAuth
  • AUTH_USERNAME is your username with which you login to the CloudPak4Data environment
  • AUTH_PASSWORD is your password with which you login to the CloudPak4Data environment

NOTE: Alternatively, you can fill in the AUTH_TOKEN instead of AUTH_URL, AUTH_USERNAME, and AUTH_PASSWORD. You will have generated this token in the section above. However, since tokens expire after a few hours and you would need to restart your app to update the token, this option is not suggested. Instead, if you use the username/password option, the app can generate a new token every time for you so it will always use a non-expired token.

# Copy this file to .env.
# Edit the .env file with the required settings before starting the app.

# 1. Required: Provide your web service URL for scoring.
# E.g., MODEL_URL=https://<cluster_url>/v4/deployments/<deployment_space_guid>/predictions
MODEL_URL=


# 2. Required: fill in EITHER section A OR B below:

# ### A: Authentication using username and password
#   Fill in the authntication url, your CloudPak4Data username, and CloudPak4Data password.
#   Example:
#     AUTH_URL=<cluster_url>/v1/preauth/validateAuth
#     AUTH_USERNAME=my_username
#     AUTH_PASSWORD=super_complex_password
AUTH_URL=
AUTH_USERNAME=
AUTH_PASSWORD=

# ### B: (advanced) Provide your bearer token.
#   Uncomment the "AUTH_TOKEN=" below and fill in your Bearer Token.
#   You can generate this token by followin the lab instuctions. This token should start with "Bearer ".
#   Note: that hese tokens will expire after a few hours so you'll need to generate a new one again later.
#   Example:
#       TOKEN=Bearer abCdwFghIjKLMnO1PqRsTuV2wWX3YzaBCDE4.fgH1r2... (and so on, tokens are long).
# AUTH_TOKEN=


# Optional: You can override the server's host and port here.
HOST=0.0.0.0
PORT=5000

Start the application

Start the flask server by running the following command:

python telcochurn.py

Use your browser to go to http://0.0.0.0:5000 and try it out.

TIP: Use ctrl+c to stop the Flask server when you are done.

Sample output

Enter some sample values into the form:

Input a bunch of data...

Click the Submit button and the churn percentage is returned:

Get the churn percentage as a result

Learn more

  • Artificial Intelligence Code Patterns: Enjoyed this Code Pattern? Check out our other AI Code Patterns.
  • Data Analytics Code Patterns: Enjoyed this Code Pattern? Check out our other Data Analytics Code Patterns.
  • AI and Data Code Pattern Playlist: Bookmark our playlist with all of our Code Pattern videos.
  • With Watson: Want to take your Watson app to the next level? Looking to utilize Watson Brand assets? Join the With Watson program to leverage exclusive brand, marketing, and tech resources to amplify and accelerate your Watson embedded commercial solution.
  • IBM Watson Studio: Master the art of data science with IBM's Watson Studio.

License

This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.

Apache Software License (ASL) FAQ

telco-customer-churn-on-icp4d's People

Contributors

rhagarty avatar sandhya-nayak avatar sanjeevghimire avatar scottdangelo avatar stevemar avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

telco-customer-churn-on-icp4d's Issues

License for data (usage)

Is the provided data license free, under an Apache 2.0 license or something else?

Many thanks in advance!

pin python versions

To prevent breaking changes, we should pin all versions of python packages that are installed.

Running `Total Charges data distribution` cell kills kernel

Running this cell:

# Total Charges data distribution
histogram = sns.distplot(df.iloc[:, totalCharge], hist=True)
plt.show()

consistently results in the kernel dying.
This is on an ICP4D cluster, size unknown. It might be a resource issue...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.