Giter VIP home page Giter VIP logo

fast_retraining's Introduction

Fast Retraining

In this repo we compare two of the fastest boosted decision tree libraries: XGBoost and LightGBM. We will evaluate them across datasets of several domains and different sizes.

On July 25, 2017, we published a blog post evaluating both libraries and discussing the benchmark results. The post is Lessons Learned From Benchmarking Fast Machine Learning Algorithms.

Installation and Setup

The installation instructions can be found here.

Project

In the folder experiments you can find the different experiments of the project. We developed 6 experiments with the CPU and GPU versions of the libraries.

  • Airline
  • BCI
  • Football
  • Planet Kaggle
  • Fraud Detection
  • HIGGS

In the folder experiment/libs there is the common code for the project.

Benchmark

In the following table there are summarized the time results (in seconds) and the ratio of the benchmarks performed in the experiments:

Dataset Experiment Data size Features xgb time:
CPU (GPU)
xgb_hist time:
CPU (GPU)
lgb time:
CPU (GPU)
ratio xgb/lgb:
CPU (GPU)
ratio xgb_hist/lgb:
CPU
(GPU)
Football Link CPU
Link GPU
19673 46 2.27 (7.09) 2.47 (4.58) 0.58 (0.97) 3.90
(7.26)
4.25
(4.69)
Fraud Detection Link CPU
Link GPU
284807 30 4.34 (5.80) 2.01 (1.64) 0.66 (0.29) 6.58
(19.74)
3.04
(5.58)
BCI Link CPU
Link GPU
20497 2048 11.51 (12.93) 41.84 (42.69) 7.31 (2.76) 1.57
(4.67)
5.72
(15.43)
Planet Kaggle Link CPU
Link GPU
40479 2048 313.89 (-) 2115.28 (2028.43) 194.57 (317.68) 1.61
(-)
10.87
(6.38)
HIGGS Link CPU
Link GPU
11000000 28 2996.16 (-) 121.21 (114.88) 119.34 (71.87) 25.10
(-)
1.01
(1.59)
Airline Link CPU
Link GPU
115069017 13 - (-) 1242.09 (1271.91) 1056.20 (645.40) -
(-)
1.17
(1.97)

In the next table we summarize the performance results using the F1-Score.

Dataset Experiment Data size Features xgb F1:
CPU (GPU)
xgb_hist F1:
CPU (GPU)
lgb F1:
CPU (GPU)
Football Link
Link
19673 46 0.458 (0.470) 0.460 (0.472) 0.459 (0.470)
Fraud Detection Link
Link
284807 30 0.824 (0.821) 0.802 (0.814) 0.813 (0.811)
BCI Link
Link
20497 2048 0.110 (0.093) 0.142 (0.120) 0.137 (0.138)
Planet Kaggle Link
Link
40479 2048 0.805 (-) 0.822 (0.822) 0.822 (0.821)
HIGGS Link
Link
11000000 28 0.763 (-) 0.767 (0.767) 0.768 (0.767)
Airline Link
Link
115069017 13 - (-) 0.741 (0.745) 0.732 (0.745)

The experiments were run on an Azure NV24 VM with 24 cores and 224 GB memory. The machine has 4 NVIDIA M60 GPUs. In both cases we used Ubuntu 16.04.

Contributing

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

fast_retraining's People

Contributors

microsoftopensource avatar miguelgfierro avatar msalvaris avatar msftgits avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fast_retraining's Issues

General plan outline

Speeding up Machine Learning Applications with LightGBM library in Real Time domains

Short description
The speed of a machine learning algorithm can be crucial in problems that requires retraining in real time. Some useful domains can be IoT, sport result prediction, predictive maintenance and healthcare. Microsoft recently open sourced LightGBM library for decision trees that outperformed other libraries in both speed and performance. In this talk we will demo several applications using LightGBM.

Abstract
In some applications training and retraining times need to be kept below 5 seconds to be useful. Such applications are often referred to as real-time and include but are not limited to IoT, sport result prediction, predictive maintenance and healthcare. Algorithms that allow for fast retraining are fundamental to enabling such applications and can open-up new business opportunities. One reason for retraining is that the features used in these applications can degrade causing previously useful features to no longer be useful. Such degradation is often observed as sensors age or as information becomes out-of-date.
LightGBM is a new open source library created by Microsoft that is set to become the new standard in decision trees algorithms. Depending on the application, it can be anything from 4 to 10 times faster than XGBoost and offers a higher accuracy. It has already been proven useful in several Kaggle competitions.
In this talk we will explore this promising library, compare it with the current state of the art and demo a business case of a real-time application.

Slides

  1. Concept drift (5 minutes)
  • Introduction
  • Intro to Concept Drift
  • What ways can we combat concept Drift
  1. XGBoost & LightGBM (10-15 minutes)
  • Intro to the unreassonble effectivness of XGBoost
  • What is LightGBM
  • Summary of XGBoost vs LightGBM training speed, execution speed and accuracy
    • Planet Kaggle
    • Flights
    • ...
  • Why is LightGBM faster
  1. Real time applications (10 minutes)
  • Real time application
    • BCI
    • Airplane
    • IoT?
    • ...
  • Business case?

LightGBMError: b'GPU Tree Learner was not enabled in this build. Recompile with CMake option -DUSE_GPU=1'

LightGBMError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/06_HIGGS_GPU.py in ()
242
243 with Timer() as train_t:
--> 244 lgbm_clf_pipeline = lgb.train(params, lgb_train, num_boost_round=num_rounds)
245
246 with Timer() as test_t:

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
178 """construct booster"""
179 try:
--> 180 booster = Booster(params=params, train_set=train_set)
181 if is_valid_contain_train:
182 booster.set_train_data_name(train_data_name)

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in init(self, params, train_set, model_file, silent)
1251 train_set.construct().handle,
1252 c_str(params_str),
-> 1253 ctypes.byref(self.handle)))
1254 """save reference to data"""
1255 self.train_set = train_set

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in _safe_call(ret)
46 """
47 if ret != 0:
---> 48 raise LightGBMError(_LIB.LGBM_GetLastError())
49
50

LightGBMError: b'GPU Tree Learner was not enabled in this build. Recompile with CMake option -DUSE_GPU=1'

Unknown solution. Only happens with airlines and higgs -- larger data sets, even though system has 1080ti with full free memory. Other smaller data sets like credit and football have no such issues. It's like the error message is wrong and in reality ran out of memory with lightgbm.

NameError: name 'generate_validation_files' is not defined while trying planet notebook.

LightGBM version: 2.0.5

NameError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/04_PlanetKaggle_GPU.py in ()
51
52
---> 53 X_train, y_train, X_test, y_test = load_planet_kaggle()
54
55

~/h2o4gpu/testsxgboost/libs/loaders.py in load_planet_kaggle()
212 if not os.listdir(val_path):
213 logger.info('Validation folder is empty, moving files...')
--> 214 generate_validation_files(train_path, val_path)
215
216 logger.info('Reading in labels')

NameError: name 'generate_validation_files' is not defined

Solution: In loaders.py:

from libs.planet_kaggle import to_multi_label_dict, get_file_count, enrich_with_feature_encoding, featurise_images, generate_validation_files

More experiments

  • HIGGS CPU
  • HIGGS GPU
  • Amazon GPU
  • BCI GPU
  • Football GPU
  • Fraud GPU

Blog writeup

Some advices from David:

Probably the most important advice I can give is this: know your audience. Think of the person who’s going to have the most interest in what you have to say, and write for that one person. As you mention, providing the code they can use to replicate what you do is important. I also feel like it’s better to focus on specifics rather than generalities: benchmarks/results from a specific analysis of a specific dataset on specific hardware are much more interesting IMO than “what might be”.

My other advice is keep it short. This is harder than it sounds – it’s tempting to go into every detail. But your post will have more impact the more people read it, and not many people will take the time to read more than a page or so. My usual target is 4-6 paragraphs, plus images/videos (and do include at least one image). Keep the details for the Github repo, documentation or whitepapers, and link to them as needed.

Lastly, follow journalistic principles and summarize your entire post in the first paragraph. Make sure your main point is included there – ideally in the first sentence – along with any “hero” links you’d like the reader to follow. That way a reader will know immediately if they’re interested in your content, and even if not, they still come away with the main point.

Improvements

  • Add the GPU results to the histogram
  • Divide the table of results into two
  • Add the computer features below the benchmark

while trying football notebook

Generating match features...
Generating match labels...
Generating bookkeeper data...
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:297: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
(19673, 48)
CPU times: user 10min 50s, sys: 7.58 s, total: 10min 58s
Wall time: 10min 58s
In [9]:

feables = convert_cols_categorical_to_numeric(feables)
feables.head()
[autoreload of six failed: Traceback (most recent call last):
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 246, in check
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 385, in superreload
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 324, in update_generic
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 276, in update_class
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/six.py", line 93, in get
setattr(obj, self.name, result) # Invokes set.
AttributeError: 'NoneType' object has no attribute 'cStringIO'
]

XGBoost GPU benchmarks

Hi, I am the author of the XGBoost GPU algorithms.

Your benchmarks of my GPU hist algorithm are simply running on the CPU. The reason for this is the 'tree_method':'hist' parameter is overriding the selection of the GPU updater. This was fixed some time ago but it seems you are using an older commit. The correct usage would now be to set 'tree_method':'gpu_hist'. I would appreciate if you can update your benchmarks, I think you might find my algorithm far more competitive.

I also noticed that the XGBoost CPU hist algorithm has not had the number of bins set correctly, so you would be comparing 256 bins for XGBoost against 63 bins for LightGBM. This was due to a mistake in our documentation regarding the naming of the parameter that I have noted in dmlc/xgboost#2567.

Thanks
Rory

Comparison experiment using BCI dataset

  • Download dataset
  • Create data loader and preprocessing
  • Create initial xgboost pipeline
  • Create initial lightgbm pipeline
  • Create comparison on test set
  • Add docs

General plan outline

Contents of MVP2:

  • Optimize airline experiment
  • Add comparison with XGBoost hist
  • Add Guolin experiments and run them on the same machine
  • Update notebooks with new parameters
  • Add experiments with GPU
  • Change notebooks with bokeh
  • Blog writeup

Planning Strata

Planning kick off meeting. THAT IS THE FIRST MVP after this we iterate depending on the time we have

Methodology (based on TDSP):

  • Experiment in notebooks, common libraries in python
  • Develop in branches, merge to master via PR accepted by an external person (not by the commiter)
  • Try to do atomic commits in case anybody wants to cherry pick
  • In master there is always working code
  • FOCUS: get as fast as possible to MVP1 and then iterate.
  • When someone picks a task, you have to assign it to yourself and change the state in Projects

TODO MVP1:

  • Start adding cards in the project github
  • ask Ned about airline and sensor datasets
  • set up a github private repo: Azure
  • setup DSVM + libraries
  • Put data in a fileshare
  • timer class
  • create conda environment variables for data and other stuff
  • data loaders
  • experiments with lightgbm (sklearn api)
  • compare with xgboost
  • parameter tuning framework
  • plotter of model vs retrained model performance

Comparison experiment using football dataset

  • Download dataset and put it on fileshare
  • Create data loader
  • Initial preprocessing
  • Create initial xgboost pipeline
  • Create initial lightgbm pipeline
  • Create comparison on test set
  • Find concept drift

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.