The fast_retraining from azure

Dataset	Experiment	Data size	Features	xgb time: CPU (GPU)	xgb_hist time: CPU (GPU)	lgb time: CPU (GPU)	ratio xgb/lgb: CPU (GPU)	ratio xgb_hist/lgb: CPU (GPU)
Football	Link CPU Link GPU	19673	46	2.27 (7.09)	2.47 (4.58)	0.58 (0.97)	3.90 (7.26)	4.25 (4.69)
Fraud Detection	Link CPU Link GPU	284807	30	4.34 (5.80)	2.01 (1.64)	0.66 (0.29)	6.58 (19.74)	3.04 (5.58)
BCI	Link CPU Link GPU	20497	2048	11.51 (12.93)	41.84 (42.69)	7.31 (2.76)	1.57 (4.67)	5.72 (15.43)
Planet Kaggle	Link CPU Link GPU	40479	2048	313.89 (-)	2115.28 (2028.43)	194.57 (317.68)	1.61 (-)	10.87 (6.38)
HIGGS	Link CPU Link GPU	11000000	28	2996.16 (-)	121.21 (114.88)	119.34 (71.87)	25.10 (-)	1.01 (1.59)
Airline	Link CPU Link GPU	115069017	13	- (-)	1242.09 (1271.91)	1056.20 (645.40)	- (-)	1.17 (1.97)

Dataset	Experiment	Data size	Features	xgb F1: CPU (GPU)	xgb_hist F1: CPU (GPU)	lgb F1: CPU (GPU)
Football	Link Link	19673	46	0.458 (0.470)	0.460 (0.472)	0.459 (0.470)
Fraud Detection	Link Link	284807	30	0.824 (0.821)	0.802 (0.814)	0.813 (0.811)
BCI	Link Link	20497	2048	0.110 (0.093)	0.142 (0.120)	0.137 (0.138)
Planet Kaggle	Link Link	40479	2048	0.805 (-)	0.822 (0.822)	0.822 (0.821)
HIGGS	Link Link	11000000	28	0.763 (-)	0.767 (0.767)	0.768 (0.767)
Airline	Link Link	115069017	13	- (-)	0.741 (0.745)	0.732 (0.745)

Retrain CPU experiments in the same machine as in GPU

General plan outline

Speeding up Machine Learning Applications with LightGBM library in Real Time domains

Short description
The speed of a machine learning algorithm can be crucial in problems that requires retraining in real time. Some useful domains can be IoT, sport result prediction, predictive maintenance and healthcare. Microsoft recently open sourced LightGBM library for decision trees that outperformed other libraries in both speed and performance. In this talk we will demo several applications using LightGBM.

Abstract
In some applications training and retraining times need to be kept below 5 seconds to be useful. Such applications are often referred to as real-time and include but are not limited to IoT, sport result prediction, predictive maintenance and healthcare. Algorithms that allow for fast retraining are fundamental to enabling such applications and can open-up new business opportunities. One reason for retraining is that the features used in these applications can degrade causing previously useful features to no longer be useful. Such degradation is often observed as sensors age or as information becomes out-of-date.
LightGBM is a new open source library created by Microsoft that is set to become the new standard in decision trees algorithms. Depending on the application, it can be anything from 4 to 10 times faster than XGBoost and offers a higher accuracy. It has already been proven useful in several Kaggle competitions.
In this talk we will explore this promising library, compare it with the current state of the art and demo a business case of a real-time application.

Slides

Concept drift (5 minutes)

Introduction
Intro to Concept Drift
What ways can we combat concept Drift

XGBoost & LightGBM (10-15 minutes)

Intro to the unreassonble effectivness of XGBoost
What is LightGBM
Summary of XGBoost vs LightGBM training speed, execution speed and accuracy
- Planet Kaggle
- Flights
- ...
Why is LightGBM faster

Real time applications (10 minutes)

Real time application
- BCI
- Airplane
- IoT?
- ...
Business case?

Comparison of experiments in LightGBM with xgboost

LightGBMError: b'GPU Tree Learner was not enabled in this build. Recompile with CMake option -DUSE_GPU=1'

LightGBMError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/06_HIGGS_GPU.py in ()
242
243 with Timer() as train_t:
--> 244 lgbm_clf_pipeline = lgb.train(params, lgb_train, num_boost_round=num_rounds)
245
246 with Timer() as test_t:

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
178 """construct booster"""
179 try:
--> 180 booster = Booster(params=params, train_set=train_set)
181 if is_valid_contain_train:
182 booster.set_train_data_name(train_data_name)

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in init(self, params, train_set, model_file, silent)
1251 train_set.construct().handle,
1252 c_str(params_str),
-> 1253 ctypes.byref(self.handle)))
1254 """save reference to data"""
1255 self.train_set = train_set

~/.pyenv/versions/3.6.1/lib/python3.6/site-packages/lightgbm/basic.py in _safe_call(ret)
46 """
47 if ret != 0:
---> 48 raise LightGBMError(_LIB.LGBM_GetLastError())
49
50

LightGBMError: b'GPU Tree Learner was not enabled in this build. Recompile with CMake option -DUSE_GPU=1'

Unknown solution. Only happens with airlines and higgs -- larger data sets, even though system has 1080ti with full free memory. Other smaller data sets like credit and football have no such issues. It's like the error message is wrong and in reality ran out of memory with lightgbm.

create conda environment variables for data and other stuff

Compare XGBoost and LightGBM on Resnet features from Planet Kaggle

Create dataset from Resnet50 features
Create experiment for XGBoost
Create experiment for lightGBM
Add docs

Review README

Export bokeh plots to htlm+js

Create fraud notebook

correct issue in football CPU

https://github.com/Azure/fast_retraining/blob/master/experiments/03_football.ipynb

Create concept drift notebook

NameError: name 'generate_validation_files' is not defined while trying planet notebook.

LightGBM version: 2.0.5

NameError Traceback (most recent call last)
~/h2o4gpu/testsxgboost/04_PlanetKaggle_GPU.py in ()
51
52
---> 53 X_train, y_train, X_test, y_test = load_planet_kaggle()
54
55

~/h2o4gpu/testsxgboost/libs/loaders.py in load_planet_kaggle()
212 if not os.listdir(val_path):
213 logger.info('Validation folder is empty, moving files...')
--> 214 generate_validation_files(train_path, val_path)
215
216 logger.info('Reading in labels')

NameError: name 'generate_validation_files' is not defined

Solution: In loaders.py:

from libs.planet_kaggle import to_multi_label_dict, get_file_count, enrich_with_feature_encoding, featurise_images, generate_validation_files

Bokeh version

change requirements with recent version of bokeh

The load_iot function does not convert the data to the correct data types

experiment page is not accessible.

reinstall new version of lighgbm

More experiments

timer class and logger

Setup DSVM + Libraries

https://github.com/Microsoft/LightGBM

Refactor and merge all code into master

Decide the experiments we want to present
Clean code and make comments
Merge code in master

Blog writeup

Some advices from David:

Probably the most important advice I can give is this: know your audience. Think of the person who’s going to have the most interest in what you have to say, and write for that one person. As you mention, providing the code they can use to replicate what you do is important. I also feel like it’s better to focus on specifics rather than generalities: benchmarks/results from a specific analysis of a specific dataset on specific hardware are much more interesting IMO than “what might be”.

My other advice is keep it short. This is harder than it sounds – it’s tempting to go into every detail. But your post will have more impact the more people read it, and not many people will take the time to read more than a page or so. My usual target is 4-6 paragraphs, plus images/videos (and do include at least one image). Keep the details for the Github repo, documentation or whitepapers, and link to them as needed.

Lastly, follow journalistic principles and summarize your entire post in the first paragraph. Make sure your main point is included there – ideally in the first sentence – along with any “hero” links you’d like the reader to follow. That way a reader will know immediately if they’re interested in your content, and even if not, they still come away with the main point.

Add HIGGS experiment

dataset: https://archive.ics.uci.edu/ml/datasets/HIGGS

Improvements

Add the GPU results to the histogram
Divide the table of results into two
Add the computer features below the benchmark

plotter of model vs retrained model performance

parameter tuning framework

Experiment with rendering Bokeh plots in blog post

Put data in a fileshare

Fix issue with Planet Kaggle dataset

Try gpu_use_dp=false in GPU parameters to see if improves the time

Create data loaders

Implement bokeh plots

@msalvaris on branch https://github.com/Azure/fast_retraining/tree/bokeh

Update ntoebooks with new parameters

branch new parameters: https://github.com/Azure/fast_retraining/blob/new_parameters/experiments/02_BCI.ipynb

Create comparison notebook

while trying football notebook

Generating match features...
Generating match labels...
Generating bookkeeper data...
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:297: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/pandas/core/indexing.py:477: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
(19673, 48)
CPU times: user 10min 50s, sys: 7.58 s, total: 10min 58s
Wall time: 10min 58s
In [9]:

feables = convert_cols_categorical_to_numeric(feables)
feables.head()
[autoreload of six failed: Traceback (most recent call last):
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 246, in check
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 385, in superreload
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 324, in update_generic
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 276, in update_class
File "/home/jon/.pyenv/versions/3.6.1/lib/python3.6/site-packages/six.py", line 93, in get
setattr(obj, self.name, result) # Invokes set.
AttributeError: 'NoneType' object has no attribute 'cStringIO'
]

XGBoost GPU benchmarks

Hi, I am the author of the XGBoost GPU algorithms.

Your benchmarks of my GPU hist algorithm are simply running on the CPU. The reason for this is the 'tree_method':'hist' parameter is overriding the selection of the GPU updater. This was fixed some time ago but it seems you are using an older commit. The correct usage would now be to set 'tree_method':'gpu_hist'. I would appreciate if you can update your benchmarks, I think you might find my algorithm far more competitive.

I also noticed that the XGBoost CPU hist algorithm has not had the number of bins set correctly, so you would be comparing 256 bins for XGBoost against 63 bins for LightGBM. This was due to a mistake in our documentation regarding the naming of the parameter that I have noted in dmlc/xgboost#2567.

Thanks
Rory

Optimize airline experiment
Add comparison with XGBoost hist
Add Guolin experiments and run them on the same machine
Update notebooks with new parameters
Add experiments with GPU
Change notebooks with bokeh
Blog writeup

Experiment in notebooks, common libraries in python
Develop in branches, merge to master via PR accepted by an external person (not by the commiter)
Try to do atomic commits in case anybody wants to cherry pick
In master there is always working code
FOCUS: get as fast as possible to MVP1 and then iterate.
When someone picks a task, you have to assign it to yourself and change the state in Projects

TODO MVP1:

azure / fast_retraining Goto Github PK

fast_retraining's Introduction

Fast Retraining

Installation and Setup

Project

Benchmark

Contributing

fast_retraining's People

Contributors

Stargazers

Watchers

Forkers

fast_retraining's Issues

LightGBM version: 2.0.5

Recommend Projects

Recommend Topics

Recommend Org