ray-project / lightgbm_ray Goto Github PK

View Code? Open in Web Editor NEW

42.0 7.0 7.0 169 KB

LightGBM on Ray

License: Apache License 2.0

Shell 8.20% Python 91.80%

lightgbm_ray's Introduction

Distributed LightGBM on Ray

LightGBM-Ray is a distributed backend for LightGBM, built on top of distributed computing framework Ray.

LightGBM-Ray

enables multi-node and multi-GPU training
integrates seamlessly with distributed hyperparameter optimization library Ray Tune
comes with fault tolerance handling mechanisms, and
supports distributed dataframes and distributed data loading

All releases are tested on large clusters and workloads.

This package is based on XGBoost-Ray. As of now, XGBoost-Ray is a dependency for LightGBM-Ray.

Installation

You can install the latest LightGBM-Ray release from PIP:

pip install "lightgbm_ray"

If you'd like to install the latest master, use this command instead:

pip install "git+https://github.com/ray-project/lightgbm_ray.git#egg=lightgbm_ray"

Usage

LightGBM-Ray provides a drop-in replacement for LightGBM's train function. To pass data, a RayDMatrix object is required, common with XGBoost-Ray. You can also use a scikit-learn interface - see next section.

Just as in original lgbm.train() function, the training parameters are passed as the params dictionary.

Ray-specific distributed training parameters are configured with a lightgbm_ray.RayParams object. For instance, you can set the num_actors property to specify how many distributed actors you would like to use.

Here is a simplified example (which requires sklearn):

Training:

from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train(
    {
        "objective": "binary",
        "metric": ["binary_logloss", "binary_error"],
    },
    train_set,
    evals_result=evals_result,
    valid_sets=[train_set],
    valid_names=["train"],
    verbose_eval=False,
    ray_params=RayParams(num_actors=2, cpus_per_actor=2))

bst.booster_.save_model("model.lgbm")
print("Final training error: {:.4f}".format(
    evals_result["train"]["binary_error"][-1]))

Prediction:

from lightgbm_ray import RayDMatrix, RayParams, predict
from sklearn.datasets import load_breast_cancer
import lightgbm as lgbm

data, labels = load_breast_cancer(return_X_y=True)

dpred = RayDMatrix(data, labels)

bst = lgbm.Booster(model_file="model.lgbm")
pred_ray = predict(bst, dpred, ray_params=RayParams(num_actors=2))

print(pred_ray)

scikit-learn API

LightGBM-Ray also features a scikit-learn API fully mirroring pure LightGBM scikit-learn API, providing a completely drop-in replacement. The following estimators are available:

RayLGBMClassifier
RayLGBMRegressor

Example usage of RayLGBMClassifier:

from lightgbm_ray import RayLGBMClassifier, RayParams
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

seed = 42

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.25, random_state=42)

clf = RayLGBMClassifier(
    n_jobs=2,  # In LightGBM-Ray, n_jobs sets the number of actors
    random_state=seed)

# scikit-learn API will automatically convert the data
# to RayDMatrix format as needed.
# You can also pass X as a RayDMatrix, in which case
# y will be ignored.

clf.fit(X_train, y_train)

pred_ray = clf.predict(X_test)
print(pred_ray)

pred_proba_ray = clf.predict_proba(X_test)
print(pred_proba_ray)

# It is also possible to pass a RayParams object
# to fit/predict/predict_proba methods - will override
# n_jobs set during initialization

clf.fit(X_train, y_train, ray_params=RayParams(num_actors=2))

pred_ray = clf.predict(X_test, ray_params=RayParams(num_actors=2))
print(pred_ray)

Things to keep in mind:

n_jobs parameter controls the number of actors spawned. You can pass a RayParams object to the fit/predict/predict_proba methods as the ray_params argument for greater control over resource allocation. Doing so will override the value of n_jobs with the value of ray_params.num_actors attribute. For more information, refer to the Resources section below.
By default n_jobs is set to 1, which means the training will not be distributed. Make sure to either set n_jobs to a higher value or pass a RayParams object as outlined above in order to take advantage of LightGBM-Ray's functionality.
After calling fit, additional evaluation results (e.g. training time, number of rows, callback results) will be available under additional_results_ attribute.
eval_ arguments are supported, but early stopping is not.
LightGBM-Ray's scikit-learn API is based on LightGBM 3.2.1. While we try to support older LightGBM versions, please note that this library is only fully tested and supported for LightGBM >= 3.2.1.

For more information on the scikit-learn API, refer to the LightGBM documentation.

Data loading

Data is passed to LightGBM-Ray via a RayDMatrix object.

The RayDMatrix lazy loads data and stores it sharded in the Ray object store. The Ray LightGBM actors then access these shards to run their training on.

A RayDMatrix support various data and file types, like Pandas DataFrames, Numpy Arrays, CSV files and Parquet files.

Example loading multiple parquet files:

import glob
from lightgbm_ray import RayDMatrix, RayFileType

# We can also pass a list of files
path = list(sorted(glob.glob("/data/nyc-taxi/*/*/*.parquet")))

# This argument will be passed to `pd.read_parquet()`
columns = [
    "passenger_count",
    "trip_distance", "pickup_longitude", "pickup_latitude",
    "dropoff_longitude", "dropoff_latitude",
    "fare_amount", "extra", "mta_tax", "tip_amount",
    "tolls_amount", "total_amount"
]

dtrain = RayDMatrix(
    path, 
    label="passenger_count",  # Will select this column as the label
    columns=columns,
    # ignore=["total_amount"],  # Optional list of columns to ignore
    filetype=RayFileType.PARQUET)

Hyperparameter Tuning

LightGBM-Ray integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed LightGBM models. You can run multiple LightGBM-Ray training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, and pass the function to tune.run. Internally, train will detect if tune is being used and will automatically report results to tune.

Example using LightGBM-Ray with Ray Tune:

from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

num_actors = 2
num_cpus_per_actor = 2

ray_params = RayParams(
    num_actors=num_actors, cpus_per_actor=num_cpus_per_actor)

def train_model(config):
    train_x, train_y = load_breast_cancer(return_X_y=True)
    train_set = RayDMatrix(train_x, train_y)

    evals_result = {}
    bst = train(
        params=config,
        dtrain=train_set,
        evals_result=evals_result,
        valid_sets=[train_set],
        valid_names=["train"],
        verbose_eval=False,
        ray_params=ray_params)
    bst.booster_.save_model("model.lgbm")

from ray import tune

# Specify the hyperparameter search space.
config = {
    "objective": "binary",
    "metric": ["binary_logloss", "binary_error"],
    "eta": tune.loguniform(1e-4, 1e-1),
    "subsample": tune.uniform(0.5, 1.0),
    "max_depth": tune.randint(1, 9)
}

# Make sure to use the `get_tune_resources` method to set the `resources_per_trial`
analysis = tune.run(
    train_model,
    config=config,
    metric="train-binary_error",
    mode="min",
    num_samples=4,
    resources_per_trial=ray_params.get_tune_resources())
print("Best hyperparameters", analysis.best_config)

Also see examples/simple_tune.py for another example.

Fault tolerance

LightGBM-Ray leverages the stateful Ray actor model to enable fault tolerant training. Currently, only non-elastic training is supported.

Non-elastic training (warm restart)

When an actor or node dies, LightGBM-Ray will retain the state of the remaining actors. In non-elastic training, the failed actors will be replaced as soon as resources are available again. Only these actors will reload their parts of the data. Training will resume once all actors are ready for training again.

You can configure this mode in the RayParams:

from lightgbm_ray import RayParams

ray_params = RayParams(
    max_actor_restarts=2,    # How often are actors allowed to fail, Default = 0
)

Resources

By default, LightGBM-Ray tries to determine the number of CPUs available and distributes them evenly across actors.

In the case of very large clusters or clusters with many different machine sizes, it makes sense to limit the number of CPUs per actor by setting the cpus_per_actor argument. Consider always setting this explicitly.

The number of LightGBM actors always has to be set manually with the num_actors argument.

Multi GPU training

By default, LightGBM-Ray tries to determine the number of CPUs available and distributes them evenly across actors.

It is important to note that distributed LightGBM needs at least two CPUs per actor to function efficiently (without blocking). Therefore, by default, at least two CPUs will be assigned to each actor, and an exception will be raised if an actor has less than two CPUs. It is possible to override this check by setting the allow_less_than_two_cpus argument to True, though it is not recommended, as it will negatively impact training performance.

The number of LightGBM actors always has to be set manually with the num_actors argument.

Multi GPU training

LightGBM-Ray enables multi GPU training. The LightGBM core backend will automatically handle communication. All you have to do is to start one actor per GPU and set LightGBM's device_type to a GPU-compatible option, eg. gpu (see LightGBM documentation for more details.)

For instance, if you have 2 machines with 4 GPUs each, you will want to start 8 remote actors, and set gpus_per_actor=1. There is usually no benefit in allocating less (e.g. 0.5) or more than one GPU per actor.

You should divide the CPUs evenly across actors per machine, so if your machines have 16 CPUs in addition to the 4 GPUs, each actor should have 4 CPUs to use.

from lightgbm_ray import RayParams

ray_params = RayParams(
    num_actors=8,
    gpus_per_actor=1,
    cpus_per_actor=4,   # Divide evenly across actors per machine
)

How many remote actors should I use?

This depends on your workload and your cluster setup. Generally there is no inherent benefit of running more than one remote actor per node for CPU-only training. This is because LightGBM core can already leverage multiple CPUs via threading.

However, there are some cases when you should consider starting more than one actor per node:

For multi GPU training, each GPU should have a separate remote actor. Thus, if your machine has 24 CPUs and 4 GPUs, you will want to start 4 remote actors with 6 CPUs and 1 GPU each
In a heterogeneous cluster, you might want to find the greatest common divisor for the number of CPUs. E.g. for a cluster with three nodes of 4, 8, and 12 CPUs, respectively, you should set the number of actors to 6 and the CPUs per actor to 4.

Distributed data loading

LightGBM-Ray can leverage both centralized and distributed data loading.

In centralized data loading, the data is partitioned by the head node and stored in the object store. Each remote actor then retrieves their partitions by querying the Ray object store. Centralized loading is used when you pass centralized in-memory dataframes, such as Pandas dataframes or Numpy arrays, or when you pass a single source file, such as a single CSV or Parquet file.

from lightgbm_ray import RayDMatrix

# This will use centralized data loading, as only one source file is specified
# `label_col` is a column in the CSV, used as the target label
ray_params = RayDMatrix("./source_file.csv", label="label_col")

In distributed data loading, each remote actor loads their data directly from the source (e.g. local hard disk, NFS, HDFS, S3), without a central bottleneck. The data is still stored in the object store, but locally to each actor. This mode is used automatically when loading data from multiple CSV or Parquet files. Please note that we do not check or enforce partition sizes in this case - it is your job to make sure the data is evenly distributed across the source files.

from lightgbm_ray import RayDMatrix

# This will use distributed data loading, as four source files are specified
# Please note that you cannot schedule more than four actors in this case.
# `label_col` is a column in the Parquet files, used as the target label
ray_params = RayDMatrix([
    "hdfs:///tmp/part1.parquet",
    "hdfs:///tmp/part2.parquet",
    "hdfs:///tmp/part3.parquet",
    "hdfs:///tmp/part4.parquet",
], label="label_col")

Lastly, LightGBM-Ray supports distributed dataframe representations, such as Ray Datasets, Modin and Dask dataframes (used with Dask on Ray). Here, LightGBM-Ray will check on which nodes the distributed partitions are currently located, and will assign partitions to actors in order to minimize cross-node data transfer. Please note that we also assume here that partition sizes are uniform.

from lightgbm_ray import RayDMatrix

# This will try to allocate the existing Modin partitions
# to co-located Ray actors. If this is not possible, data will
# be transferred across nodes
ray_params = RayDMatrix(existing_modin_df)

Data sources

The following data sources can be used with a RayDMatrix object.

Type	Centralized loading	Distributed loading
Numpy array	Yes	No
Pandas dataframe	Yes	No
Single CSV	Yes	No
Multi CSV	Yes	Yes
Single Parquet	Yes	No
Multi Parquet	Yes	Yes
Ray Dataset	Yes	Yes
Petastorm	Yes	Yes
Dask dataframe	Yes	Yes
Modin dataframe	Yes	Yes

Memory usage

Details coming soon.

Best practices

In order to reduce peak memory usage, consider the following suggestions:

Store data as float32 or less. More precision is often not needed, and keeping data in a smaller format will help reduce peak memory usage for initial data loading.
Pass the dtype when loading data from CSV. Otherwise, floating point values will be loaded as np.float64 per default, increasing peak memory usage by 33%.

Placement Strategies

LightGBM-Ray leverages Ray's Placement Group API (https://docs.ray.io/en/master/placement-group.html) to implement placement strategies for better fault tolerance.

By default, a SPREAD strategy is used for training, which attempts to spread all of the training workers across the nodes in a cluster on a best-effort basis. This improves fault tolerance since it minimizes the number of worker failures when a node goes down, but comes at a cost of increased inter-node communication To disable this strategy, set the RXGB_USE_SPREAD_STRATEGY environment variable to 0. If disabled, no particular placement strategy will be used.

When LightGBM-Ray is used with Ray Tune for hyperparameter tuning, a PACK strategy is used. This strategy attempts to place all workers for each trial on the same node on a best-effort basis. This means that if a node goes down, it will be less likely to impact multiple trials.

When placement strategies are used, LightGBM-Ray will wait for 100 seconds for the required resources to become available, and will fail if the required resources cannot be reserved and the cluster cannot autoscale to increase the number of resources. You can change the RXGB_PLACEMENT_GROUP_TIMEOUT_S environment variable to modify how long this timeout should be.

More examples

For complete end to end examples, please have a look at the examples folder:

Simple sklearn breastcancer dataset example (requires sklearn)
HIGGS classification example (download dataset (2.6 GB))
HIGGS classification example with Parquet (uses the same dataset)
Test data classification (uses a self-generated dataset)

Resources

lightgbm_ray's People

Contributors

Stargazers

Watchers

Forkers

krfricke yard1 jypeng28 peytondmurray jimthompson5802 justinvyu xiaoxiaoma549

lightgbm_ray's Issues

import RayLGBMClassifier error

I'm getting the following errors when I try to import raylgbmclassifier.

Traceback (most recent call last):
  File "/home/mforeman/miniconda3/envs/rapids-23.06/classifiers4.py", line 14, in <module>
    from lightgbm_ray import RayLGBMClassifier
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/lightgbm_ray/__init__.py", line 1, in <module>
    from lightgbm_ray.main import RayParams, train, predict
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/lightgbm_ray/main.py", line 55, in <module>
    from xgboost_ray.main import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/xgboost_ray/__init__.py", line 1, in <module>
    from xgboost_ray.main import RayParams, predict, train
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/xgboost_ray/main.py", line 76, in <module>
    from xgboost_ray.matrix import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/xgboost_ray/matrix.py", line 36, in <module>
    from ray.data.dataset import Dataset as RayDataset
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/__init__.py", line 5, in <module>
    from ray.data._internal.compute import ActorPoolStrategy
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/compute.py", line 8, in <module>
    from ray.data._internal.delegating_block_builder import DelegatingBlockBuilder
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/delegating_block_builder.py", line 4, in <module>
    from ray.data._internal.arrow_block import ArrowBlockBuilder
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/arrow_block.py", line 22, in <module>
    from ray.data._internal.numpy_support import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/data/_internal/numpy_support.py", line 5, in <module>
    from ray.air.util.tensor_extensions.utils import create_ragged_ndarray
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/air/__init__.py", line 1, in <module>
    from ray.air.checkpoint import Checkpoint
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/air/checkpoint.py", line 22, in <module>
    from ray.air._internal.remote_storage import (
  File "/home/mforeman/miniconda3/envs/rapids-23.06/lib/python3.10/site-packages/ray/air/_internal/remote_storage.py", line 142, in <module>
    _cached_fs: Dict[tuple, Tuple[float, pyarrow.fs.FileSystem]] = {}
AttributeError: 'NoneType' object has no attribute 'fs'

Support for CrossValidation: Enhancement Request

I am using RayDP with Spark and am using this package with Ray Tune for HyperParameter Optimization with the lightGBM regressor. Unless there is something I'm missing, there's no way to use lgbm's native cross validation as in Ray's examples, this would be a huge help to model accuracy when training large models.

Support early stopping

"grpc_message":"Received message larger than max

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "Received message larger than max (210771146 vs. 104857600)"
debug_error_string = "{"created":"@1646188124.309444695","description":"Error received from peer ipv4:10.207.183.32:40455","file":"src/core/lib/surface/call.cc","file_line":1074,"grpc_message":"Received message larger than max (210771146 vs. 104857600)","grpc_status":8}"

where are the examples

The example folder is empty and the links to these examples are all broken. Please provided an updated link to the examples, thank you!

Unintuitive naming of RayDMatrix class

I've just realized that RayDMatrix class in lightgbm_ray is named xgboost_ray.matrix.RayDMatrix, rather than lightgbm_ray.matrix.RayDMatrix.

I understand code re-use, but name re-use? It is apparently by design (as per docs), but in my opinion it violates the principle of least astonishment and thus should be changed to a more intuitive lightgbm_ray.matrix.RayDMatrix.

To reproduce:

from lightgbm_ray import RayParams as test1, RayDMatrix as test2
print([test1, test2])

Output:

[<class 'lightgbm_ray.main.RayParams'>, <class 'xgboost_ray.matrix.RayDMatrix'>]

Error when running example

Setup

conda create --name lgbm python=3.8
conda activate lgbm
conda install lightgbm
pip install lightgbm_ray

Script:

# light_ray.py
from lightgbm_ray import RayDMatrix, RayParams, train
from sklearn.datasets import load_breast_cancer

train_x, train_y = load_breast_cancer(return_X_y=True)
train_set = RayDMatrix(train_x, train_y)

evals_result = {}
bst = train(
    {
        "objective": "binary",
        "metric": ["binary_logloss", "binary_error"],
    },
    train_set,
    num_boost_round=10,
    evals_result=evals_result,
    valid_sets=[train_set],
    valid_names=["train"],
    verbose_eval=False,
    ray_params=RayParams(num_actors=2, cpus_per_actor=2))


bst.booster_.save_model("model.lgbm")

Exception:

% python light_ray.py 
Traceback (most recent call last):
  File "light_ray.py", line 1, in <module>
    from lightgbm_ray import RayDMatrix, RayParams, train
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm_ray/__init__.py", line 1, in <module>
    from lightgbm_ray.main import RayParams, train, predict
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm_ray/main.py", line 44, in <module>
    from lightgbm import LGBMModel, LGBMRanker, Booster
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/__init__.py", line 8, in <module>
    from .basic import Booster, Dataset, register_logger
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/basic.py", line 95, in <module>
    _LIB = _load_lib()
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/basic.py", line 86, in _load_lib
    lib = ctypes.cdll.LoadLibrary(lib_path[0])
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/ctypes/__init__.py", line 459, in LoadLibrary
    return self._dlltype(name)
  File "/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/ctypes/__init__.py", line 381, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: dlopen(/Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/will/opt/anaconda3/envs/lgbm/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found

Interaction constraints not working

Hi,

I've been testing using the interaction_constraints parameter from lightgbm (see here).

Unfortunately, passing in the list of constraints causes training to fail with a sigsegv error.

Example:

#%%
# set up and load boston data
import numpy as np
import pandas as pd
from lightgbm_ray import RayLGBMRegressor, RayParams, RayDMatrix
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import ray
import os

boston = load_boston()
x, y = boston.data, boston.target
df = pd.DataFrame(x, columns= boston.feature_names)

# make into dmatrix
train_df_with_target = df.copy()
train_df_with_target['target'] = y

train_set = RayDMatrix(
    data=train_df_with_target,
    label = 'target'
    )

# set params and ray params
params = {
    'boosting_type': 'goss',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 10,
    'max_depth': 4,
    'learning_rate': 0.05,
    'verbose': 1
}

ray_params = RayParams(
    num_actors=2,
    cpus_per_actor = 2,
    )



#%% set up constraint (age cannot interact with any other feature)
constrained_feature = 'AGE'
other_features = [x for x,y in enumerate(df.columns) if y != constrained_feature ]
constrained_feature_idx = [x for x,y in enumerate(df.columns) if y == constrained_feature ]

constraint = [constrained_feature_idx, other_features]


#%% fit model
mod_ray_constrained = RayLGBMRegressor(
    random_state=100,
    interaction_constraints = constraint,
    **params
)


mod_ray_constrained.fit(train_set,
    y='target',
    eval_set = [(train_set, 'target')],
    eval_names=["train"],
    ray_params=ray_params)

The constrained model fit returns the error:

(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) *** SIGSEGV received at time=1673976819 on cpu 3 ***
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) PC: @ 0x7fedc74926f7 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) @ 0x7fedc750d420 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) [2023-01-17 09:33:39,994 E 266 292] logging.cc:361: *** SIGSEGV received at time=1673976819 on cpu 3 ***
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) [2023-01-17 09:33:39,994 E 266 292] logging.cc:361: PC: @ 0x7fedc74926f7 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) [2023-01-17 09:33:39,994 E 266 292] logging.cc:361: @ 0x7fedc750d420 (unknown) (unknown)
(_RemoteRayLightGBMActor pid=266, ip=10.99.13.194) Fatal Python error: Segmentation fault

Running this code with non-distributed lightgbm works fine, as does the above code with interaction constraints removed.

Fix client tests being flaky (timing out)

Client tests in test_end_to_end.py time out often during Github Actions CI (though not always). This should be fixed.

Doesn't seem to time out locally. My guess it's due to less cores being available on the CI runner.

Ray Tune custom callback based on model structure

I have some code that uses a callback to stop a Ray Tune trial if the complexity of the model (total leaves in the model) exceeds a given threshold). This works fine with a normal lightgbmmodel but fails when I use a lightgbm_ray model.

In the below code, "use_distributed" can be toggled to True to reproduce the error.

I presume the error is because the correct way of passing the metrics back to tune is with the TuneReportCheckpointCallback() from ray.tune.integration.lightgbm. I've played around with this, but it seems like I can only access the metrics reported by the lightgbm model. I can't add the "total_leaves" as a metric because it relies on accessing the model itself, not just the data and predictions.

Is it possible to report total_leaves to ray tune with lightgbm_ray?

#%%
# set up and load boston data
import numpy as np
import pandas as pd
import os
import lightgbm
from lightgbm_ray import RayLGBMRegressor, RayParams, RayDMatrix
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import ray
from ray.air import session
from ray import tune
from ray.tune.search.optuna import OptunaSearch


ray.shutdown()
## Initialise ray:
if ray.is_initialized() == False:
    service_host = os.environ['RAY_HEAD_SERVICE_HOST']
    service_port = os.environ['RAY_HEAD_SERVICE_PORT']
    ray.init(
        f'ray://{service_host}:{service_port}'
    )

use_distributed = False
out_dir =< '/path/to/output_folder'>

boston = load_boston()
x, y = boston.data, boston.target
df = pd.DataFrame(x, columns= boston.feature_names)

# make into dmatrix
if use_distributed:
    actors = 2
    ray_params = RayParams(
        num_actors= actors,
        cpus_per_actor = 2,
    )


    train_df_with_target = df.copy()
    train_df_with_target['target'] = y

    train_set = RayDMatrix(
        data=train_df_with_target,
        label = 'target'
        )
else:
    actors = 1
    

# set params and ray params
params = {
    'boosting_type': 'goss',
    'objective': 'regression',
    'metric': 'rmse',
    'n_estimators':100,
    'num_leaves': 6,
    'max_depth': 3,
    'learning_rate': tune.quniform(0.05,0.1, 0.01),
    'verbose': 1
}




#%% define function to count total leaves in model
def leaves_callback(env):
    model = env.model

    mod_dump = model.dump_model()
    tree_info = mod_dump['tree_info']
    num_leaves = 0
    num_iterations = 0
    for tree in tree_info:
        num_leaves += tree['num_leaves']
        num_iterations += 1

    session.report({'total_leaves': num_leaves,
                    "rmse_train":  env.evaluation_result_list[0][2],
                    'num_iterations': num_iterations})

# define trainable
def trainable(params):
    if use_distributed:
        mod_ray = RayLGBMRegressor(
            random_state=100,
            **params
        )


        mod_ray.fit(train_set,
            y='target',
            eval_set = [(train_set, 'target')],
            eval_names=["train"],
            ray_params=ray_params,
            callbacks = [leaves_callback])
    else:
        mod = lightgbm.LGBMRegressor(
            random_state=100,
            **params
        )

        mod.fit(X = x,
            y=y,
            eval_set = [(x, y)],
            eval_names=["train"],
            callbacks = [leaves_callback])


#%% RUN TUNING

resources = [{'CPU': 2.0} for x in range(actors+1)] + [{'CPU': 1.0}]

analysis = tune.Tuner(
    tune.with_resources(
            trainable,
            tune.PlacementGroupFactory(
                resources,
                strategy='PACK')
        ),
    tune_config=tune.TuneConfig(
        metric="rmse_train",
        mode= "min",
        search_alg=OptunaSearch(),
        num_samples=5),
        
    run_config= ray.air.RunConfig(local_dir=out_dir,
                                name = 'test_callback',
                                stop = {'total_leaves': 300}),
    param_space= params,     
    )


results = analysis.fit()

If I toggle use_distributed to True

(_RemoteRayLightGBMActor pid=585, ip=10.99.15.76) File "/opt/conda/lib/python3.9/site-packages/ray/air/session.py", line 61, in report
(_RemoteRayLightGBMActor pid=585, ip=10.99.15.76) _get_session().report(metrics, checkpoint=checkpoint)
(_RemoteRayLightGBMActor pid=585, ip=10.99.15.76) AttributeError: 'NoneType' object has no attribute 'report'

If I toggle use_distributed to False, I get the expected result:

(TunerInternal pid=2096) +--------------------+------------+-----------------+-----------------+--------+------------------+----------------+--------------+------------------+
(TunerInternal pid=2096) | Trial name | status | loc | learning_rate | iter | total time (s) | total_leaves | rmse_train | num_iterations |
(TunerInternal pid=2096) |--------------------+------------+-----------------+-----------------+--------+------------------+----------------+--------------+------------------|
(TunerInternal pid=2096) | trainable_a895fa72 | TERMINATED | 10.99.5.8:2131 | 0.05 | 56 | 0.24444 | 300 | 3.44845 | 56 |
(TunerInternal pid=2096) | trainable_aa0be088 | TERMINATED | 10.99.5.8:2131 | 0.1 | 61 | 0.296924 | 302 | 2.82896 | 61 |
(TunerInternal pid=2096) | trainable_aa3ab41c | TERMINATED | 10.99.5.8:2131 | 0.08 | 60 | 0.354107 | 301 | 2.89081 | 60 |
(TunerInternal pid=2096) | trainable_aa6d4d32 | TERMINATED | 10.99.15.76:749 | 0.07 | 59 | 0.310418 | 300 | 2.99355 | 59 |
(TunerInternal pid=2096) | trainable_aa89c7a0 | TERMINATED | 10.99.5.8:2131 | 0.05 | 56 | 0.265122 | 300 | 3.44845 | 56 |
(TunerInternal pid=2096) +--------------------+------------+-----------------+-----------------+--------+------------------+----------------+--------------+------------------+

Ray lightgbm reproducibility issue

@Yard1 Hi Sir, I was trying Light GBM Ray on a large dataset with 3 num actors and 3 CPUs per actor. With this context, the result keeps changing across different runs. Can you guide how to make results reproducable in LightGBM-Ray ?

I have set the following seeds:

Lightgbm random state seed

import numpy as np
np.random.seed(seed)

import random as python_random
python_random.seed(seed)

Any more seeds or parameters to set ?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.