Giter VIP home page Giter VIP logo

supervisedml's Introduction

Credit Risk Analysis

Overview of Project

Fast Lending is a peer-to-peer leading services company who is looking to utilize machine learning to predict credit risk. The goal is to provide a quicker and more reliable loan experience and to lead to a more accurate identification of good candidates for loans leading to lower default rates. Credit risk is an inherently unbalanced classification problem, as good loans outweigh risky loans. Therefore, this project employed different supervised machine learning techniques to train and evaluate models with unbalanced classes using python libraries, imbalanced-learn and scikit-learn in order to determine the most accurate model.

Results

The models are evaluated on their effectiveness based on these four scores:

  • Balanced Accuracy Score: measures how accurate the model predicts credit risk
  • Precision Score:
    • For High Risk: True Positive/(True Positive + False Positive)
    • For Low Risk: True Negative/(True Negative + False Negative

example1

  • Recall (or Sensitivity) Score:
    • For High Risk: True Positive/(True Positive + False Negative)
    • For Low Risk: True Negative/(True Negative + False Positive)
  • F1 Score (harmonic mean): 2(Precision * Recall)/(Precision + Recall)
    • Best score is 1.0 and the worst score is 0.0.

Naive Random Oversampling

RandomOverSampler randomly selects minority classes and adds them to the training set until majority and minority outcomes are equal.

  • Balanced Accuracy Score: 64.4%
  • Precision Score:
    • High Risk: 0.01 (1% of the predicted high risk applicants are actually high risk)
    • Low Risk: 1.00 (100% of the predicted low risk applicants are actually low risk)
  • Recall Score:
    • High Risk: 0.69 (69% of high risk applicants are classified as high risk)
    • Low Risk: 0.59 (59% of low risk applicants are classified as low risk)
  • F1 Score:
    • High Risk: 0.02
    • Low Risk: 0.74

1 1 1 2 1 3

SMOTE

Synthetic Minority Oversample Technique, like RandomOverSampler, increases the size of the minority class by synthesizing new values based on the closest existing value.

  • Balanced Accuracy Score: 66.3%
  • Precision Score:
    • High Risk: 0.01 (1% of the predicted high risk applicants are actually high risk)
    • Low Risk: 1.00 (100% of the predicted low risk applicants are actually low risk)
  • Recall Score:
    • High Risk: 0.63 (63% of high risk applicants are classified as high risk)
    • Low Risk: 0.69 (69% of low risk applicants are classified as low risk)
  • F1 Score:
    • High Risk: 0.02
    • Low Risk: 0.82

2 1 2 2 2 3

ClusterCentroid

An algorithm that decreases the size of the majority class by generating synthetic data points, centroids, that represent clusters of the sample data.

  • Balanced Accuracy Score: 54.5%
  • Precision Score:
    • High Risk: 0.01 (1% of the predicted high risk applicants are actually high risk)
    • Low Risk: 1.00 (100% of the predicted low risk applicants are actually low risk)
  • Recall Score:
    • High Risk: 0.69 (69% of high risk applicants are classified as high risk)
    • Low Risk: 0.40 (40% of low risk applicants are classified as low risk)
  • F1 Score:
    • High Risk: 0.01
    • Low Risk: 0.57

3 1 3 2 3 3

SMOTEEN

Synthetic Minority Oversampling Technique and Edited Nearest Neighbors model combines aspects of both oversampling using SMOTE and undersampling by dropping out the outliers of each of the classes of data.

  • Balanced Accuracy Score: 64.5%
  • Precision Score:
    • High Risk: 0.01 (1% of the predicted high risk applicants are actually high risk)
    • Low Risk: 1.00 (100% of the predicted low risk applicants are actually low risk)
  • Recall Score:
    • High Risk: 0.72 (72% of high risk applicants are classified as high risk)
    • Low Risk: 0.57 (57% of low risk applicants are classified as low risk)
  • F1 Score:
    • High Risk: 0.02
    • Low Risk: 0.72

4 1 4 2 4 3

BalancedRandomForestClassifier

A model that randomly undersamples each bootstrap sample by creating 2 trees of the same size and equal size to the minority class to represent one for the majority class and one for the minority class.

  • Balanced Accuracy Score: 78.8%
  • Precision Score:
    • High Risk: 0.04 (4% of the predicted high risk applicants are actually high risk)
    • Low Risk: 1.00 (100% of the predicted low risk applicants are actually low risk)
  • Recall Score:
    • High Risk: 0.67 (67% of high risk applicants are classified as high risk)
    • Low Risk: 0.91 (91% of low risk applicants are classified as low risk)
  • F1 Score:
    • High Risk: 0.07
    • Low Risk: 0.95

5 1 5 2 5 3

EasyEnsembleClassifier

A model that builds sequences of classifiers by resampling the majority class. The classifiers are an ensembler of adaptive boosting (AdaBoost) learners trained on different balanced (through undersampling) bootstrap examples.

  • Balanced Accuracy Score: 92.5%
  • Precision Score:
    • High Risk: 0.07 (7% of the predicted high risk applicants are actually high risk)
    • Low Risk: 1.00 (100% of the predicted low risk applicants are actually low risk)
  • Recall Score:
    • High Risk: 0.91 (91% of high risk applicants are classified as high risk)
    • Low Risk: 0.94 (94% of low risk applicants are classified as low risk)
  • F1 Score:
    • High Risk: 0.14
    • Low Risk: 0.97

6 1 6 2 6 3

Summary

Ranking of models from most accurate to least accurate for identifying high risk candidates:

  • EasyEnsembleClassifer: 92.5% accuracy, 7% precision, 91% recall, and 14% F1 Score
  • BalancedRandomForestClassifer: 78.8% accuracy, 4% precision, 67% recall and 7% F1 Score
  • SMOTE: 66.3% accuracy, 1% precision, 63% recall and 2% F1 Score
  • SMOTEENN: 64.5% accuracy, 1% precision, 72% recall and 2% F1 Score
  • RandomOverSampler: 64.4% accuracy, 1% precision, 69% recall and 2% F1 Score
  • ClusterCentroids: 54.5% accuracy, 1% precision, 69% recall and 1% F1 Score

After evaluating all six techniques, the EasyEnsembleClassifier has the highest accuracy score (92.5%) and the highest precision score (7%) and recall score (91%) for identifying high risk candidates. Therefore, using the EasyEnsembleClassifier is the recommended algorithm for the credit card data set. However, the downside to the model is that disportionately more candidates are classified as high risk (91%) versus being an actual high risk (7%). Since credit card companies would rather classify low risk candidates as high risk versus high risk candidates as low risk the downside is not significant enough to rule out the algorithm. The downside can also be mediated by creating a separate algorithm that will further sort through the candidates that are identified as high risk in order to further rule out low risk candidates.

supervisedml's People

Contributors

ksommerdorf avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.