crflynn / skgrf Goto Github PK

View Code? Open in Web Editor NEW

29.0 3.0 6.0 423 KB

scikit-learn compatible Python bindings for grf (generalized random forests) C++ random forest library

Home Page: https://skgrf.readthedocs.io/en/stable/

License: GNU General Public License v3.0

Makefile 0.36% Python 86.07% Dockerfile 0.30% Cython 13.27%

machine-learning random-forest scikit-learn generalized-random-forest

skgrf's Introduction

skgrf

skgrf provides scikit-learn compatible Python bindings to the C++ random forest implementation, grf, using Cython.

The latest release of skgrf uses version 2.1.0 of grf.

skgrf is still in development. Please create issues for any discrepancies or errors. PRs welcome.

Documentation

Installation

skgrf is available on pypi and can be installed via pip:

pip install skgrf

Estimators

GRFForestCausalRegressor
GRFForestInstrumentalRegressor
GRFForestLocalLinearRegressor
GRFForestQuantileRegressor
GRFForestRegressor
GRFBoostedForestRegressor
GRFForestSurvival

Usage

GRFForestRegressor

The GRFForestRegressor predictor uses grf's RegressionPredictionStrategy class.

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from skgrf.ensemble import GRFForestRegressor

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

forest = GRFForestRegressor()
forest.fit(X_train, y_train)

predictions = forest.predict(X_test)
print(predictions)
# [31.81349144 32.2734354  16.51560285 11.90284392 39.69744341 21.30367911
#  19.52732937 15.82126562 26.49528961 11.27220097 16.02447197 20.01224404
#  ...
#  20.70674263 17.09041289 12.89671205 20.79787926 21.18317924 25.45553279
#  20.82455595]

GRFForestQuantileRegressor

The GRFForestQuantileRegressor predictor uses grf's QuantilePredictionStrategy class.

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from skgrf.ensemble import GRFForestQuantileRegressor

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

forest = GRFForestQuantileRegressor(quantiles=[0.1, 0.9])
forest.fit(X_train, y_train)

predictions = forest.predict(X_test)
print(predictions)
# [[21.9 50. ]
# [ 8.5 24.5]
# ...
# [ 8.4 18.6]
# [ 8.1 20. ]]

License

skgrf is licensed under GPLv3.

Development

To develop locally, it is recommended to have asdf, make and a C++ compiler already installed. After cloning, run make setup. This will setup the grf submodule, install python and poetry from .tool-versions, install dependencies using poetry, copy the grf source code into skgrf, and then build and install skgrf in the local virtualenv.

To format code, run make fmt. This will run isort and black against the .py files.

To run tests and inspect coverage, run make test or make xtest for testing in parallel.

To rebuild in place after making changes, run make build.

To create python package artifacts, run make dist.

To build and view documentation, run make docs.

skgrf's People

Contributors

Stargazers

Watchers

Forkers

mili-yini kalashnov zhangc927 allie8 wushicanasl apoorvalal

skgrf's Issues

How should tuning be implemented?

GRF includes tuning facilities for many of the estimators. In particular, the following estimators have tuning parameter options:

Regression forest
Causal forest
Instrumental forest
Local linear forest
Boosted forest
Causal survival forest

In addition, some forests use tuning implicitly, and/or pass tuning parameters down into internal forests.

Causal forest performs tuning but also passes tune params down into the orthogonalization forests (regression and boosted) in which tuning is performed separately.
Instrumental forest performs tuning but also passes tune params down into the orthogonalization regression forest in which tuning is performed separately
Boosted forest uses tune params on the initial forest, but not the boosted ones

Scikit-learn also provides facilities for hyperparameter tuning under the model_selection module. This begs the question: When and where in skgrf should tuning be implemented, if at all?

Make skgrf a true port of R-grf. This means implementing tuning exactly as it exists in the R lib, ignoring sklearn model selection, and hardcoding tuning in the same way.
Ignore R-grf's tuning entirely, allowing users to utilize the model_selection module. This means however, that the implementations for Causal, Instrumental, and Boosted forests would be different than what exists in R.
Selectively implement R-grf's tuning, in order to maintain parity with R-grf's implicit tuning. This is the current implementation.
Refactor some of the estimators to allow more fine-grained control of tuning separate components, removing tuning from skgrf and allowing users to tune with model_selection objects.

Installation error: metadata preparation error (Google Colab environment)

Hi, I am trying to install skgrf in the Google Colab environment, but the following error occurs:

Collecting skgrf
Using cached skgrf-0.3.0.tar.gz (1.8 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting scikit-learn<1.0,>=0.23.0 (from skgrf)
Using cached scikit-learn-0.24.2.tar.gz (7.5 MB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
error: subprocess-exited-with-error

× Preparing metadata (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Preparing metadata (pyproject.toml) ... error
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

Increase default n_estimators in causal forest

per: #28 (comment)

You might also want to increase the default number of trees in causal forest, 100 will typically not give great kernel weights. If you are using 100 as the default because of sklearn, I can imagine it's that low because of speed (which isn't much of an issue since grf is faster)

AttributeError: module 'skgrf.grf' has no attribute 'regression_train'

I get this error when trying to use GRFForestRegressor to fit, if I try a local linear regressor i get the same error but for 'll_regression_train'. I've tried checking if all packages are the right version and follow the guidlines from all examples I can find, but it doesn't seem to work..
I am working in a cloud computing environment on Microsoft Azure databricks.

Fix wheels badge

Issue with numpy<1.20

First, wanted to say thank you so much for building this! I'd been using grf with rpy for awhile, but am so happy to be away from the R dependency!

Second, I'm having an issue with the py38 wheel with numpy<1.20. While I can install skgrf, when I try to use it numpy complains about ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject. Upgrading to numpy >= 1.20 fixes the problem. Unfortunately I also need to install tensorflow, which has a strong dependency on numpy<1.20 (thanks google).

My assumption is that this is caused by skgrf being compiled with numpy>=1.20 (I've been able to use the py36 wheels successfully, since those would have been built on numpy<1.20). I don't know how hard it is to build/maintain wheels that use numpy<1.20 or if you have other suggestions for how best to proceed.

I have tried installing from source and building from main, but pip was having trouble with both and it's beyond my abilities to debug exactly what was wrong

So I'm wondering if you have thoughts on the best way around this.

Steps to reproduce:

Fresh env with py3.8
pip install numpy==1.19.5 skgrf
python; from skgrf import grf – ValueError
pip uninstall -y numpy
pip install numpy==1.20
python; from skgrf import grf – Succeeds

Move cpp extern refs into build sources

Allow missing values in X?

All grf forests (except local linear) support splitting with missing X values (IEEE NaNs) https://grf-labs.github.io/grf/REFERENCE.html#missing-values.

Ideally it should only require a light change in the wrappers when doing fit/predict: calling into sklearn.utils.check_X_y/sklearn.utils.check_array instead of _validate_input and passing force_all_finite='allow-nan' for X (though have to be sure other wrapper logic still works). I can send a PR later if interest.

explainer.shap_values crashes kernel

Hi,

Amazing package! I was wondering if you have any insight into why calling explainer.shap_values consistently crashes my python kernel. I don't think it's a memory issue.

Thanks for your time; really appreciate it.

-Bernie

add check_estimator tests

https://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.check_estimator.html

AttributeError: 'GRFForestRegressor' object has no attribute '_validate_data'

Trying to reproduce the example from the README:

from skgrf.ensemble import GRFForestCausalRegressor

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from skgrf.ensemble import GRFForestRegressor

X, y = load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)

forest = GRFForestRegressor()
forest.fit(X_train, y_train)

predictions = forest.predict(X_test)
print(predictions)

This gives the error

---------------------------------------------------------------
AttributeError                Traceback (most recent call last)
<ipython-input-8-897d3385e0c1> in <module>
      7 
      8 forest = GRFForestRegressor()
----> 9 forest.fit(X_train, y_train)
     10 
     11 predictions = forest.predict(X_test)

~/anaconda3/lib/python3.7/site-packages/skgrf/ensemble/regressor.py in fit(self, X, y, sample_weight, cluster, compute_oob_predictions)
    116         :param array1d cluster: optional cluster assignments for input samples
    117         """
--> 118         X, y = self._validate_data(X, y)
    119         self._check_num_samples(X)
    120         self._check_n_features(X, reset=True)

AttributeError: 'GRFForestRegressor' object has no attribute '_validate_data'

setup wheel build repo

https://github.com/crflynn/skgrf-wheels following https://github.com/crflynn/skranger-wheels

Add impurity attribute to Tree

Installation error: building wheel fails

Hi, thank you so much for developing this package. Unfortunately, I am running into a bit of a problem with the installation via 'pip install skgrf'. I have tried it in multiple environments with various python versions (all >= 3.7.10) on two different systems, and I always received more or less the same error message.
I would really appreciate any advice on how to overcome this problem.

The full error message (with Python version 3.8.13) is:

Collecting skgrf
  Using cached skgrf-0.3.0.tar.gz (1.8 MB)
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting scikit-learn<1.0,>=0.23.0
  Using cached scikit_learn-0.24.2-cp38-cp38-win_amd64.whl (6.9 MB)
Requirement already satisfied: scipy>=0.19.1 in c:\user\anaconda3\envs\test_pkgs\lib\site-packages (from scikit-learn<1.0,>=0.23.0->skgrf) (1.9.3)
Requirement already satisfied: joblib>=0.11 in c:\user\anaconda3\envs\test_pkgs\lib\site-packages (from scikit-learn<1.0,>=0.23.0->skgrf) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\user\anaconda3\envs\test_pkgs\lib\site-packages (from scikit-learn<1.0,>=0.23.0->skgrf) (3.1.0)
Requirement already satisfied: numpy>=1.13.3 in c:\user\anaconda3\envs\test_pkgs\lib\site-packages (from scikit-learn<1.0,>=0.23.0->skgrf) (1.23.4)
Building wheels for collected packages: skgrf
  Building wheel for skgrf (pyproject.toml): started
  Building wheel for skgrf (pyproject.toml): finished with status 'error'
Failed to build skgrf
Note: you may need to restart the kernel to use updated packages.
  error: subprocess-exited-with-error
  
  Building wheel for skgrf (pyproject.toml) did not run successfully.
  exit code: 1
  
  [25 lines of output]
  A setup.py file already exists. Using it.
  Traceback (most recent call last):
    File "C:\User\AppData\Local\Temp\pip-install-g5fwsbx8\skgrf_aa7c8fee85a9467799deccfdcd073228\setup.py", line 2, in <module>
      from setuptools import setup
  ModuleNotFoundError: No module named 'setuptools'
  Traceback (most recent call last):
    File "C:\User\Anaconda3\envs\test_pkgs\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 363, in <module>
      main()
    File "C:\User\Anaconda3\envs\test_pkgs\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 345, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
    File "C:\User\Anaconda3\envs\test_pkgs\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 261, in build_wheel
      return _build_backend().build_wheel(wheel_directory, config_settings,
    File "C:\User\AppData\Local\Temp\pip-build-env-uyym_whd\overlay\Lib\site-packages\poetry\core\masonry\api.py", line 67, in build_wheel
      return WheelBuilder.make_in(poetry, Path(wheel_directory))
    File "C:\User\AppData\Local\Temp\pip-build-env-uyym_whd\overlay\Lib\site-packages\poetry\core\masonry\builders\wheel.py", line 80, in make_in
      wb.build(target_dir=directory)
    File "C:\User\AppData\Local\Temp\pip-build-env-uyym_whd\overlay\Lib\site-packages\poetry\core\masonry\builders\wheel.py", line 114, in build
      self._build(zip_file)
    File "C:\User\AppData\Local\Temp\pip-build-env-uyym_whd\overlay\Lib\site-packages\poetry\core\masonry\builders\wheel.py", line 168, in _build
      self._run_build_command(setup)
    File "C:\User\AppData\Local\Temp\pip-build-env-uyym_whd\overlay\Lib\site-packages\poetry\core\masonry\builders\wheel.py", line 206, in _run_build_command
      subprocess.check_call(
    File "C:\User\Anaconda3\envs\test_pkgs\lib\subprocess.py", line 364, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['C:/User/Anaconda3/envs/test_pkgs/python.exe', 'C:\\User\\AppData\\Local\\Temp\\pip-install-g5fwsbx8\\skgrf_aa7c8fee85a9467799deccfdcd073228\\setup.py', 'build', '-b', 'C:\\User\\AppData\\Local\\Temp\\pip-install-g5fwsbx8\\skgrf_aa7c8fee85a9467799deccfdcd073228\\build']' returned non-zero exit status 1.
  [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for skgrf
ERROR: Could not build wheels for skgrf, which is required to install pyproject.toml-based projects

Forest Classifier

Probability forest implemented but not yet released in grf

ATE+Clustering

Hey! two questions:

is there an analogous function to average_treatment_effect() in R?
Does GRFForestCausalRegressor allows clustering?

Thanks!

dockerfile and tests in repo

Add performance benchmarks

SHAP on Forest Causal Regressor

Hi,
thanks for this very useful package.
I would like to compute SHAP values of a causal forest model.
I ran a code similar to the one given in https://skgrf.readthedocs.io/en/latest/tree/tree_interface.html#shap , except that my model is :

cf = GRFForestCausalRegressor(enable_tree_details = True)
cf.fit(X, Y, W, Y_hat_nomissing, W_hat_nomissing)

This works fine. However, SHAP considers as predictions of my model the predicted values of Y, whereas I would like to consider as predictions the conditional effects of my treatment variable W on the target Y, which are typically the values given by cf.predict().
I wonder if there exists a way of achieving my desired outcome?