Giter VIP home page Giter VIP logo

gcforest's Introduction

Deep Forest in Python

Status : not under active development

What's New

version 0.1.6 : corrected max_features=1 for the completely random forest (correction thanks to sevenguin).
version 0.1.5 : remove layer when accuracy gets worse (behavior corrected thanks to felixwzh).
version 0.1.4 : faster slicing method.

Presentation

gcForest is a deep forest algorithm suggested in Zhou and Feng 2017 ( https://arxiv.org/abs/1702.08835 ). It uses a multi-grain scanning approach for data slicing and a cascade structure of multiple random forests layers (see paper for details).

The present gcForest implementation has been first developed as a Classifier and designed such that the multi-grain scanning module and the cascade structure can be used separately. During development I've paid special attention to write the code in the way that future parallelization should be pretty straightforward to implement.

You can find the official release of the code used in Zhou and Feng 2017 here.

Prerequisites

The present code has been developed under python3.x. You will need to have the following installed on your computer to make it work :

  • Python 3.x
  • Numpy >= 1.12.0
  • Scikit-learn >= 0.18.1
  • jupyter >= 1.0.0 (only useful to run the tuto notebook)

You can install all of them using pip install :

$ pip3 install -r requirements.txt

Using gcForest

The syntax uses the scikit learn style with a .fit() function to train the algorithm and a .predict() function to predict new values class. You can find two examples in the jupyter notebook included in the repository.

from GCForest import *
gcf = gcForest( **kwargs )
gcf.fit(X_train, y_train)
gcf.predict(X_test)

Saving and Loading Models

Using sklearn.externals.joblib you can easily save your model to disk and load it later. Just proceed as follow :
To save :

from sklearn.externals import joblib
joblib.dump(gcf, 'name_of_file.sav')

To load :

joblib.load('name_of_file.sav')

Notes

I wrote the code from scratch in two days and even though I have tested it on several cases I cannot certify that it is a 100% bug free obviously. Feel free to test it and send me your feedback about any improvement and/or modification!

Known Issues

Memory comsuption when slicing data There is now a short naive calculation illustrating the issue in the notebook. So far the input data slicing is done all in a single step to train the Random Forest for the Multi-Grain Scanning. The problem is that it might requires a lot of memory depending on the size of the data set and the number of slices asked resulting in memory crashes (at least on my Intel Core 2 Duo).
The memory consumption when slicing data is more complicated than it seems. A StackOverflow related post can be found here. The main problem is the non-contiguous aspect of the sliced array with the original data forcing a copy to be made in memory.

OOB score error During the Random Forests training the Out-Of-Bag (OOB) technique is used for the prediction probabilities. It was found that this technique can sometimes raises an error when one or several samples is/are used for all trees training.
A potential solution consists in using cross validation instead of OOB score although it slows down the training. Anyway, simply increasing the number of trees and re-running the training (and crossing fingers) is often enough.

Built With

  • PyCharm community edition
  • memory_profiler library

License

This project is licensed under the MIT License (see LICENSE for details)

Early Results

(will be updated as new results come out)

  • Scikit-learn handwritten digits classification :
    training time ~ 5min
    accuracy ~ 98%

gcforest's People

Contributors

pylablanche avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.