Giter VIP home page Giter VIP logo

data245's Introduction

10/28/21 - 10/29/21
I took a closer look at the features (for data understanding purposes) and removed redundant/duplicate variables. 
Then, I checked the correlation between the numerical variables; some features had stronger correlation such as Amount and MonthlyPayment.
However, these pairs of variables weren't perfectly correlated and turned out to be useful/important according to RandomForest, so they were kept.
After cleaning, the full dataset consisted of 130 variables while the truncuated dataset contained 19 "important" features from RandomForest.

Estimators tested were LR, DT, MLP, SVC, SGD, HGB, and RF. HistGradientBoosting and RandomForest performed similarly and outperformed the base models.
This held for precision (~0.75), recall (~0.70), Matthew's correlation coefficient (~0.45), and AUC (~0.70).
I also tested LightGBM, which was the inspiration for scikit-learn's HistogramGradientBoostingClassifier.
Essentially, its performance was identical to HistGradientBoosting.

Then, the best models (HGB, RF, GBM) were tuned using HalvingGridSearchCV.
There are more details in the script, but generally the learning rate, tree depth, and regularization parameters were tuned.
Performance uplift from hyperparameter tuning was very small.

I used the plotting functions from scikit-learn's metrics module to display the confusion matrices, roc_auc_curves, and precision recall curves.
Also, RF and GBM have feature importances built in, so those were extracted and plotted as well. 
Finally, the individual decision trees from RF can be plotted invidually, so I went ahead and did that.

RandomForest was slightly better than the boosting models:
Macro-averaged precision 0.76
Macro-averaged recall 0.72
Matthew's corrcoef 0.48
AUC 0.72

Important features are mostly numerical, such as Interest, Age, IncomeTotal, and MonthlyPayment.




10/16/21

Changed to the Bondora dataset (https://ieee-dataport.org/open-access/bondora-peer-peer-lending-data#files)

The main issue with the LendingClub dataset is that the quality of the data was not sufficient to train good models.
Specifically, when testing several models:
  LogisticRegression, LinearSVC, MLPClassifier, SGDClassifier, DecisionTreeClassifier, RandomForest, HistogramGradientBoostingClassifier
the resulting accuracy (~0.5), AUC score (~0.5), and Matthews correlation coefficient (~0) were low. The models were no better than random guesses.

The Bondora set also contains P2P lending data on loans from the period March 2009 - January 2020. 
Extra preprocessing was done on the available dataset and features were selected using RandomForest.

data245's People

Contributors

groth00 avatar taolizhen avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

tiffany-yxzr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.