Integrating PCA in Pipelines - Lab

Introduction

In a previous section, you learned about how to use pipelines in scikit-learn to combine several supervised learning algorithms in a manageable pipeline. In this lesson, you will integrate PCA along with classifiers in the pipeline.

Objectives

In this lab you will:

Integrate PCA in scikit-learn pipelines

The Data Science Workflow

You will be following the data science workflow:

Initial data inspection, exploratory data analysis, and cleaning
Feature engineering and selection
Create a baseline model
Create a machine learning pipeline and compare results with the baseline model
Interpret the model and draw conclusions

Initial data inspection, exploratory data analysis, and cleaning

You'll use a dataset created by the Otto group, which was also used in a Kaggle competition. The description of the dataset is as follows:

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). They are selling millions of products worldwide every day, with several thousand products being added to their product line.

A consistent analysis of the performance of their products is crucial. However, due to their global infrastructure, many identical products get classified differently. Therefore, the quality of product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights the Otto Group can generate about their product range.

In this lab, you'll use a dataset containing:

A column id, which is an anonymous id unique to a product
93 columns feat_1, feat_2, ..., feat_93, which are the various features of a product
a column target - the class of a product

The dataset is stored in the 'otto_group.csv' file. Import this file into a DataFrame called data, and then:

Check for missing values
Check the distribution of columns
... and any other things that come to your mind to explore the data

# Your code here

# Your code here

# Your code here

# Your code here

# Your code here

# Your code here

If you look at all the histograms, you can tell that a lot of the data are zero-inflated, so most of the variables contain mostly zeros and then some higher values here and there. No normality, but for most machine learning techniques this is not an issue.

# Your code here

Because there are so many zeroes, most values above zero will seem to be outliers. The safe decision for this data is to not delete any outliers and see what happens. With many 0s, sparse data is available and high values may be super informative. Moreover, without having any intuitive meaning for each of the features, we don't know if a value of ~260 is actually an outlier.

# Your code here

Feature engineering and selection with PCA

Have a look at the correlation structure of your features using a heatmap.

# Your code here

Use PCA to select a number of features in a way that you still keep 80% of your explained variance.

# Your code here

# Your code here

Create a train-test split with a test size of 40%

This is a relatively big training set, so you can assign 40% to the test set. Set the random_state to 42.

# Your code here

# Your code here

Create a baseline model

Create your baseline model in a pipeline setting. In the pipeline:

Your first step will be to scale your features down to the number of features that ensure you keep just 80% of your explained variance (which we saw before)
Your second step will be to build a basic logistic regression model

Make sure to fit the model using the training set and test the result by obtaining the accuracy using the test set. Set the random_state to 123.

# Your code here

# Your code here

# Your code here

Create a pipeline consisting of a linear SVM, a simple decision tree, and a simple random forest classifier

Repeat the above, but now create three different pipelines:

One for a standard linear SVM
One for a default decision tree
One for a random forest classifier

# Your code here
# ⏰ This cell may take several minutes to run

Pipeline with grid search

Construct two pipelines with grid search:

one for random forests - try to have around 40 different models
one for the AdaBoost algorithm

Random Forest pipeline with grid search

# Your code here 
# imports

# Your code here
# ⏰ This cell may take a long time to run!

Use your grid search object along with .cv_results to get the full result overview

# Your code here

AdaBoost

# Your code here
# ⏰ This cell may take several minutes to run

Use your grid search object along with .cv_results to get the full result overview:

# Your code here

Level-up (Optional): SVM pipeline with grid search

As extra level-up work, construct a pipeline with grid search for support vector machines.

Make sure your grid isn't too big. You'll see it takes quite a while to fit SVMs with non-linear kernel functions!

# Your code here
# ⏰ This cell may take a very long time to run!

Use your grid search object along with .cv_results to get the full result overview:

# Your code here

Note

Note that this solution is only one of many options. The results in the Random Forest and AdaBoost models show that there is a lot of improvement possible by tuning the hyperparameters further, so make sure to explore this yourself!

Summary

Great! You've gotten a lot of practice in using PCA in pipelines. What algorithm would you choose and why?

sciencelee / dsc-pca-and-pipelines-v2-1-online-ds-sp-000 Goto Github PK

dsc-pca-and-pipelines-v2-1-online-ds-sp-000's Introduction