Giter VIP home page Giter VIP logo

Comments (2)

jameslamb avatar jameslamb commented on May 25, 2024

possible to have a node in LightGBM that has no samples assigned to it during training

Given a dataset X with label y, while performing one round of boosting on X and y I don't believe it's possible for LightGBM to produce any leaf nodes matching 0 samples in X.

These checks explicitly prevent the addition of splits that result in 0 samples on one side of the split.

CHECK_GT(best_split_info.left_count, 0);

CHECK_GT(best_split_info.right_count, 0);

CHECK_GT(best_split_info.right_count, 0);

CHECK_GT(best_split_info.right_count, 0);

Here's a minimal example using lightgbm==4.3.0 showing how to trigger those checks, by providing a "forced" split that is impossible to satisfy.

import lightgbm as lgb
import numpy as np
import json
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=10_000, n_features=1, centers=3)

with open("forced_splits.json", "w") as f:
    f.write(json.dumps(
        {"feature": 0, "threshold": np.max(X) + 1.0}
    ))

# construct the estimator + fit the model
bst = lgb.train(
    params={
        "forcedsplits_filename": "forced_splits.json",
        "objective": "multiclass",
        "min_gain_to_split": 0.0,
        "min_data_in_leaf": 0,
        "num_classes": 3,
        "num_iterations": 10,
        "num_leaves": 31,
        "verbose": 1
    },
    train_set=lgb.Dataset(X, label=y, params={"min_data_in_bin": 1})
)
lightgbm.basic.LightGBMError: Check failed: (best_split_info.right_count) > (0) at /Users/jlamb/repos/LightGBM/lightgbm-python/src/treelearner/serial_tree_learner.cpp, line 856 .

Ok, so given that I just trained a LightGBM model on dataset X, every leaf in every tree will match at least one sample in X?

No.

LightGBM supports "training continuation", where you start from an already-created model and then perform additional boosting rounds to add trees to that model.

It's not necessary that the training dataset used to produce that initial model is the same as the one used to generate those new trees. If the distribution of features is very different in those two datasets, it's possible for earlier trees (from the pre-trained model) to have leaves that match 0 samples in the newly-provided dataset.

For more, see https://stackoverflow.com/questions/73664093/lightgbm-train-vs-update-vs-refit/73669068#73669068.

So you CANNOT assume that every leaf in every tree in a LightGBM model m will match at least one record in training dataset X, unless every tree in the model was created from X.

A LightGBM model can be created from multiple, heterogeneous datasets, so you cannot assume that for a given model there is such as thing "the", single, training dataset.

Wait, you can set min_child_samples = 0?

Yes. In shap/shap#3574, you mentioned the parameter min_child_samples several times. That is an alias for min_data_in_leaf, which must be in the range >=0 (parameter docs).

Setting it to 0 is supported as a way to say "use other mechanisms to prevent overfitting instead", like:

  • min_gain_to_split = minimum total change in the loss function that a split must provide to be added to the tree
  • max_depth, num_leaves = limits on the number of leaves, regardless of how many samples fall in them

from lightgbm.

NegatedObjectIdentity avatar NegatedObjectIdentity commented on May 25, 2024

@jameslamb Thank you very much! Really appreciating your help!

from lightgbm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.