Comments (2)
possible to have a node in LightGBM that has no samples assigned to it during training
Given a dataset X
with label y
, while performing one round of boosting on X
and y
I don't believe it's possible for LightGBM to produce any leaf nodes matching 0 samples in X
.
These checks explicitly prevent the addition of splits that result in 0 samples on one side of the split.
Here's a minimal example using lightgbm==4.3.0
showing how to trigger those checks, by providing a "forced" split that is impossible to satisfy.
import lightgbm as lgb
import numpy as np
import json
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=10_000, n_features=1, centers=3)
with open("forced_splits.json", "w") as f:
f.write(json.dumps(
{"feature": 0, "threshold": np.max(X) + 1.0}
))
# construct the estimator + fit the model
bst = lgb.train(
params={
"forcedsplits_filename": "forced_splits.json",
"objective": "multiclass",
"min_gain_to_split": 0.0,
"min_data_in_leaf": 0,
"num_classes": 3,
"num_iterations": 10,
"num_leaves": 31,
"verbose": 1
},
train_set=lgb.Dataset(X, label=y, params={"min_data_in_bin": 1})
)
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 856 .
Ok, so given that I just trained a LightGBM model on dataset X
, every leaf in every tree will match at least one sample in X
?
No.
LightGBM supports "training continuation", where you start from an already-created model and then perform additional boosting rounds to add trees to that model.
It's not necessary that the training dataset used to produce that initial model is the same as the one used to generate those new trees. If the distribution of features is very different in those two datasets, it's possible for earlier trees (from the pre-trained model) to have leaves that match 0 samples in the newly-provided dataset.
For more, see https://stackoverflow.com/questions/73664093/lightgbm-train-vs-update-vs-refit/73669068#73669068.
So you CANNOT assume that every leaf in every tree in a LightGBM model m
will match at least one record in training dataset X
, unless every tree in the model was created from X
.
A LightGBM model can be created from multiple, heterogeneous datasets, so you cannot assume that for a given model there is such as thing "the", single, training dataset.
Wait, you can set min_child_samples = 0
?
Yes. In shap/shap#3574, you mentioned the parameter min_child_samples
several times. That is an alias for min_data_in_leaf
, which must be in the range >=0
(parameter docs).
Setting it to 0
is supported as a way to say "use other mechanisms to prevent overfitting instead", like:
min_gain_to_split
= minimum total change in the loss function that a split must provide to be added to the treemax_depth
,num_leaves
= limits on the number of leaves, regardless of how many samples fall in them
from lightgbm.
@jameslamb Thank you very much! Really appreciating your help!
from lightgbm.
Related Issues (20)
- [Question] Setting values for linear coefficient HOT 5
- [QUESTION] Adding trees manually on txt file HOT 1
- LGBM_BoosterUpdateOneIterCustom requires objective_func_ not null HOT 2
- [R-package] Expose start_iteration to dump/save/lgb.model.dt.tree HOT 1
- [GPU] Kernel crashed when using GPU HOT 5
- [python-package] LGBM hangs with high number of categories HOT 2
- [python-package] Early stopping callback added when early_stopping_round = 0 HOT 15
- [python-package] `init_score` causes the predictions different HOT 6
- Classifier predict numerical precision issue with large raw_score HOT 4
- score belong to different target classes. class 0 when init_score is given else 1 HOT 2
- LGBMClassifier gives non-deterministic outputs with very low AUC score compared to xgboost and catboost HOT 2
- [R-package] lgb.cv() fails with categorical features HOT 3
- [python-package] Can't continued training on Dataset with SequenceDataset(lgb.Sequence)
- Q:error in predict
- [python-package] CUDA version not truly installing HOT 4
- Continue traing by C_API HOT 2
- Failed to build by docker HOT 3
- Memory not being returned to OS on calling C.LGBM_BoosterFree HOT 15
- checkpoint can't be loaded on MacOS with M3 HOT 12
- [python-package] First metric is not used for early stopping
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lightgbm.