Deion In <a class="issue-link js-issue-link" data-error-text

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)?,about microsoft/lightgbm

Comments (2)

jameslamb commented on May 25, 2024

possible to have a node in LightGBM that has no samples assigned to it during training

Given a dataset X with label y, while performing one round of boosting on X and y I don't believe it's possible for LightGBM to produce any leaf nodes matching 0 samples in X.

These checks explicitly prevent the addition of splits that result in 0 samples on one side of the split.

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 846 in 28536a0

CHECK_GT(best_split_info.left_count, 0);

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 856 in 28536a0

CHECK_GT(best_split_info.right_count, 0);

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 856 in 28536a0

CHECK_GT(best_split_info.right_count, 0);

LightGBM/src/treelearner/serial_tree_learner.cpp

Line 880 in 28536a0

CHECK_GT(best_split_info.right_count, 0);

Here's a minimal example using lightgbm==4.3.0 showing how to trigger those checks, by providing a "forced" split that is impossible to satisfy.

import lightgbm as lgb
import numpy as np
import json
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10_000, n_features=1, centers=3)

with open("forced_splits.json", "w") as f:
    f.write(json.dumps(
        {"feature": 0, "threshold": np.max(X) + 1.0}
    ))

# construct the estimator + fit the model
bst = lgb.train(
    params={
        "forcedsplits_filename": "forced_splits.json",
        "objective": "multiclass",
        "min_gain_to_split": 0.0,
        "min_data_in_leaf": 0,
        "num_classes": 3,
        "num_iterations": 10,
        "num_leaves": 31,
        "verbose": 1
    },
    train_set=lgb.Dataset(X, label=y, params={"min_data_in_bin": 1})
)

lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 856 .

Ok, so given that I just trained a LightGBM model on dataset `X`, every leaf in every tree will match at least one sample in `X`?

No.

LightGBM supports "training continuation", where you start from an already-created model and then perform additional boosting rounds to add trees to that model.

It's not necessary that the training dataset used to produce that initial model is the same as the one used to generate those new trees. If the distribution of features is very different in those two datasets, it's possible for earlier trees (from the pre-trained model) to have leaves that match 0 samples in the newly-provided dataset.

For more, see https://stackoverflow.com/questions/73664093/lightgbm-train-vs-update-vs-refit/73669068#73669068.

So you CANNOT assume that every leaf in every tree in a LightGBM model m will match at least one record in training dataset X, unless every tree in the model was created from X.

A LightGBM model can be created from multiple, heterogeneous datasets, so you cannot assume that for a given model there is such as thing "the", single, training dataset.

Wait, you can set `min_child_samples = 0`?

Yes. In shap/shap#3574, you mentioned the parameter min_child_samples several times. That is an alias for min_data_in_leaf, which must be in the range >=0 (parameter docs).

Setting it to 0 is supported as a way to say "use other mechanisms to prevent overfitting instead", like:

min_gain_to_split = minimum total change in the loss function that a split must provide to be added to the tree
max_depth, num_leaves = limits on the number of leaves, regardless of how many samples fall in them

from lightgbm.

NegatedObjectIdentity commented on May 25, 2024

@jameslamb Thank you very much! Really appreciating your help!

from lightgbm.

[Question] SHAP library: Is it possible to have a node in LightGBM that has no coverage (no samples assigned to it)? about lightgbm HOT 2 CLOSED

Comments (2)

Ok, so given that I just trained a LightGBM model on dataset `X`, every leaf in every tree will match at least one sample in `X`?

Wait, you can set `min_child_samples = 0`?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (2)

Ok, so given that I just trained a LightGBM model on dataset X, every leaf in every tree will match at least one sample in X?

Wait, you can set min_child_samples = 0?

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

Ok, so given that I just trained a LightGBM model on dataset `X`, every leaf in every tree will match at least one sample in `X`?

Wait, you can set `min_child_samples = 0`?