Giter VIP home page Giter VIP logo

dsc-pca-and-pipelines-v2-1-online-ds-sp-000's Introduction

Integrating PCA in Pipelines - Lab

Introduction

In a previous section, you learned about how to use pipelines in scikit-learn to combine several supervised learning algorithms in a manageable pipeline. In this lesson, you will integrate PCA along with classifiers in the pipeline.

Objectives

In this lab you will:

  • Integrate PCA in scikit-learn pipelines

The Data Science Workflow

You will be following the data science workflow:

  1. Initial data inspection, exploratory data analysis, and cleaning
  2. Feature engineering and selection
  3. Create a baseline model
  4. Create a machine learning pipeline and compare results with the baseline model
  5. Interpret the model and draw conclusions

Initial data inspection, exploratory data analysis, and cleaning

You'll use a dataset created by the Otto group, which was also used in a Kaggle competition. The description of the dataset is as follows:

The Otto Group is one of the world’s biggest e-commerce companies, with subsidiaries in more than 20 countries, including Crate & Barrel (USA), Otto.de (Germany) and 3 Suisses (France). They are selling millions of products worldwide every day, with several thousand products being added to their product line.

A consistent analysis of the performance of their products is crucial. However, due to their global infrastructure, many identical products get classified differently. Therefore, the quality of product analysis depends heavily on the ability to accurately cluster similar products. The better the classification, the more insights the Otto Group can generate about their product range.

In this lab, you'll use a dataset containing:

  • A column id, which is an anonymous id unique to a product
  • 93 columns feat_1, feat_2, ..., feat_93, which are the various features of a product
  • a column target - the class of a product

The dataset is stored in the 'otto_group.csv' file. Import this file into a DataFrame called data, and then:

  • Check for missing values
  • Check the distribution of columns
  • ... and any other things that come to your mind to explore the data
# Your code here
# Your code here
# Your code here
# Your code here
# Your code here
# Your code here

If you look at all the histograms, you can tell that a lot of the data are zero-inflated, so most of the variables contain mostly zeros and then some higher values here and there. No normality, but for most machine learning techniques this is not an issue.

# Your code here

Because there are so many zeroes, most values above zero will seem to be outliers. The safe decision for this data is to not delete any outliers and see what happens. With many 0s, sparse data is available and high values may be super informative. Moreover, without having any intuitive meaning for each of the features, we don't know if a value of ~260 is actually an outlier.

# Your code here

Feature engineering and selection with PCA

Have a look at the correlation structure of your features using a heatmap.

# Your code here

Use PCA to select a number of features in a way that you still keep 80% of your explained variance.

# Your code here
# Your code here

Create a train-test split with a test size of 40%

This is a relatively big training set, so you can assign 40% to the test set. Set the random_state to 42.

# Your code here
# Your code here

Create a baseline model

Create your baseline model in a pipeline setting. In the pipeline:

  • Your first step will be to scale your features down to the number of features that ensure you keep just 80% of your explained variance (which we saw before)
  • Your second step will be to build a basic logistic regression model

Make sure to fit the model using the training set and test the result by obtaining the accuracy using the test set. Set the random_state to 123.

# Your code here
# Your code here
# Your code here

Create a pipeline consisting of a linear SVM, a simple decision tree, and a simple random forest classifier

Repeat the above, but now create three different pipelines:

  • One for a standard linear SVM
  • One for a default decision tree
  • One for a random forest classifier
# Your code here
# ⏰ This cell may take several minutes to run

Pipeline with grid search

Construct two pipelines with grid search:

  • one for random forests - try to have around 40 different models
  • one for the AdaBoost algorithm

Random Forest pipeline with grid search

# Your code here 
# imports
# Your code here
# ⏰ This cell may take a long time to run!

Use your grid search object along with .cv_results to get the full result overview

# Your code here 

AdaBoost

# Your code here
# ⏰ This cell may take several minutes to run

Use your grid search object along with .cv_results to get the full result overview:

# Your code here 

Level-up (Optional): SVM pipeline with grid search

As extra level-up work, construct a pipeline with grid search for support vector machines.

  • Make sure your grid isn't too big. You'll see it takes quite a while to fit SVMs with non-linear kernel functions!
# Your code here
# ⏰ This cell may take a very long time to run!

Use your grid search object along with .cv_results to get the full result overview:

# Your code here 

Note

Note that this solution is only one of many options. The results in the Random Forest and AdaBoost models show that there is a lot of improvement possible by tuning the hyperparameters further, so make sure to explore this yourself!

Summary

Great! You've gotten a lot of practice in using PCA in pipelines. What algorithm would you choose and why?

dsc-pca-and-pipelines-v2-1-online-ds-sp-000's People

Contributors

sumedh10 avatar sciencelee avatar h-parker avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.