Giter VIP home page Giter VIP logo

shapr's Introduction

shapr

CRAN_Status_Badge CRAN_Downloads_Badge R build status Lifecycle: experimental License: MIT DOI

Brief NEWS

Breaking change (June 2023)

As of version 0.2.3.9000, the development version of shapr (master branch on GitHub from June 2023) has been severely restructured, introducing a new syntax for explaining models, and thereby introducing a range of breaking changes. This essentially amounts to using a single function (explain()) instead of two functions (shapr() and explain()). The CRAN version of shapr (v0.2.2) still uses the old syntax. See the NEWS for details. The examples below uses the new syntax. Here is a version of this README with the syntax of the CRAN version (v0.2.2).

Python wrapper

As of version 0.2.3.9100 (master branch on GitHub from June 2023), we provide a Python wrapper (shaprpy) which allows explaining python models with the methodology implemented in shapr, directly from Python. The wrapper is available here. See also details in the NEWS.

Introduction

The most common machine learning task is to train a model which is able to predict an unknown outcome (response variable) based on a set of known input variables/features. When using such models for real life applications, it is often crucial to understand why a certain set of features lead to exactly that prediction. However, explaining predictions from complex, or seemingly simple, machine learning models is a practical and ethical question, as well as a legal issue. Can I trust the model? Is it biased? Can I explain it to others? We want to explain individual predictions from a complex machine learning model by learning simple, interpretable explanations.

Shapley values is the only prediction explanation framework with a solid theoretical foundation (Lundberg and Lee (2017)). Unless the true distribution of the features are known, and there are less than say 10-15 features, these Shapley values needs to be estimated/approximated. Popular methods like Shapley Sampling Values (Štrumbelj and Kononenko (2014)), SHAP/Kernel SHAP (Lundberg and Lee (2017)), and to some extent TreeSHAP (Lundberg, Erion, and Lee (2018)), assume that the features are independent when approximating the Shapley values for prediction explanation. This may lead to very inaccurate Shapley values, and consequently wrong interpretations of the predictions. Aas, Jullum, and Løland (2021) extends and improves the Kernel SHAP method of Lundberg and Lee (2017) to account for the dependence between the features, resulting in significantly more accurate approximations to the Shapley values. See the paper for details.

This package implements the methodology of Aas, Jullum, and Løland (2021).

The following methodology/features are currently implemented:

  • Native support of explanation of predictions from models fitted with the following functions stats::glm, stats::lm,ranger::ranger, xgboost::xgboost/xgboost::xgb.train and mgcv::gam.
  • Accounting for feature dependence
    • assuming the features are Gaussian (approach = 'gaussian', Aas, Jullum, and Løland (2021))
    • with a Gaussian copula (approach = 'copula', Aas, Jullum, and Løland (2021))
    • using the Mahalanobis distance based empirical (conditional) distribution approach (approach = 'empirical', Aas, Jullum, and Løland (2021))
    • using conditional inference trees (approach = 'ctree', Redelmeier, Jullum, and Aas (2020)).
    • using the endpoint match method for time series (approach = 'timeseries', Jullum, Redelmeier, and Aas (2021))
    • using the joint distribution approach for models with purely cateogrical data (approach = 'categorical', Redelmeier, Jullum, and Aas (2020))
    • assuming all features are independent (approach = 'independence', mainly for benchmarking)
  • Combining any of the above methods.
  • Explain forecasts from time series models at different horizons with explain_forecast() (R only)
  • Batch computation to reduce memory consumption significantly
  • Parallelized computation using the future framework. (R only)
  • Progress bar showing computation progress, using the progressr package. Must be activated by the user.
  • Optional use of the AICc criterion of Hurvich, Simonoff, and Tsai (1998) when optimizing the bandwidth parameter in the empirical (conditional) approach of Aas, Jullum, and Løland (2021).
  • Functionality for visualizing the explanations. (R only)
  • Support for models not supported natively.

Note the prediction outcome must be numeric. All approaches except approach = 'categorical' works for numeric features, but unless the models are very gaussian-like, we recommend approach = 'ctree' or approach = 'empirical', especially if there are discretely distributed features. When the models contains both numeric and categorical features, we recommend approach = 'ctree'. For models with a smaller number of categorical features (without many levels) and a decent training set, we recommend approach = 'categorical'. For (binary) classification based on time series models, we suggest using approach = 'timeseries'. To explain forecasts of time series models (at different horizons), we recommend using explain_forecast() instead of explain(). The former has a more suitable input syntax for explaining those kinds of forecasts. See the vignette for details and further examples.

Unlike SHAP and TreeSHAP, we decompose probability predictions directly to ease the interpretability, i.e. not via log odds transformations.

Installation

To install the current stable release from CRAN (note, using the old explanation syntax), use

install.packages("shapr")

To install the current development version (with the new explanation syntax), use

remotes::install_github("NorskRegnesentral/shapr")

If you would like to install all packages of the models we currently support, use

remotes::install_github("NorskRegnesentral/shapr", dependencies = TRUE)

If you would also like to build and view the vignette locally, use

remotes::install_github("NorskRegnesentral/shapr", dependencies = TRUE, build_vignettes = TRUE)
vignette("understanding_shapr", "shapr")

You can always check out the latest version of the vignette here.

Example

shapr supports computation of Shapley values with any predictive model which takes a set of numeric features and produces a numeric outcome.

The following example shows how a simple xgboost model is trained using the airquality dataset, and how shapr explains the individual predictions.

library(xgboost)
library(shapr)

data("airquality")
data <- data.table::as.data.table(airquality)
data <- data[complete.cases(data), ]

x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"

ind_x_explain <- 1:6
x_train <- data[-ind_x_explain, ..x_var]
y_train <- data[-ind_x_explain, get(y_var)]
x_explain <- data[ind_x_explain, ..x_var]

# Looking at the dependence between the features
cor(x_train)
#>            Solar.R       Wind       Temp      Month
#> Solar.R  1.0000000 -0.1243826  0.3333554 -0.0710397
#> Wind    -0.1243826  1.0000000 -0.5152133 -0.2013740
#> Temp     0.3333554 -0.5152133  1.0000000  0.3400084
#> Month   -0.0710397 -0.2013740  0.3400084  1.0000000

# Fitting a basic xgboost model to the training data
model <- xgboost(
  data = as.matrix(x_train),
  label = y_train,
  nround = 20,
  verbose = FALSE
)

# Specifying the phi_0, i.e. the expected prediction without any features
p0 <- mean(y_train)

# Computing the actual Shapley values with kernelSHAP accounting for feature dependence using
# the empirical (conditional) distribution approach with bandwidth parameter sigma = 0.1 (default)
explanation <- explain(
  model = model,
  x_explain = x_explain,
  x_train = x_train,
  approach = "empirical",
  prediction_zero = p0
)
#> Note: Feature classes extracted from the model contains NA.
#> Assuming feature classes from the data are correct.
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.

# Printing the Shapley values for the test data.
# For more information about the interpretation of the values in the table, see ?shapr::explain.
print(explanation$shapley_values)
#>        none    Solar.R      Wind      Temp      Month
#> 1: 43.08571 13.2117337  4.785645 -25.57222  -5.599230
#> 2: 43.08571 -9.9727747  5.830694 -11.03873  -7.829954
#> 3: 43.08571 -2.2916185 -7.053393 -10.15035  -4.452481
#> 4: 43.08571  3.3254595 -3.240879 -10.22492  -6.663488
#> 5: 43.08571  4.3039571 -2.627764 -14.15166 -12.266855
#> 6: 43.08571  0.4786417 -5.248686 -12.55344  -6.645738

# Finally we plot the resulting explanations
plot(explanation)

See the vignette for further examples.

Contribution

All feedback and suggestions are very welcome. Details on how to contribute can be found here. If you have any questions or comments, feel free to open an issue here.

Please note that the ‘shapr’ project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

References

Aas, Kjersti, Martin Jullum, and Anders Løland. 2021. “Explaining Individual Predictions When Features Are Dependent: More Accurate Approximations to Shapley Values.” Artificial Intelligence 298.

Hurvich, Clifford M, Jeffrey S Simonoff, and Chih-Ling Tsai. 1998. “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60 (2): 271–93.

Jullum, Martin, Annabelle Redelmeier, and Kjersti Aas. 2021. “Efficient and Simple Prediction Explanations with groupShapley: A Practical Perspective.” In Proceedings of the 2nd Italian Workshop on Explainable Artificial Intelligence, 28–43. CEUR Workshop Proceedings.

Lundberg, Scott M, Gabriel G Erion, and Su-In Lee. 2018. “Consistent Individualized Feature Attribution for Tree Ensembles.” arXiv Preprint arXiv:1802.03888.

Lundberg, Scott M, and Su-In Lee. 2017. “A Unified Approach to Interpreting Model Predictions.” In Advances in Neural Information Processing Systems, 4765–74.

Redelmeier, Annabelle, Martin Jullum, and Kjersti Aas. 2020. “Explaining Predictive Models with Mixed Features Using Shapley Values and Conditional Inference Trees.” In International Cross-Domain Conference for Machine Learning and Knowledge Extraction, 117–37. Springer.

Štrumbelj, Erik, and Igor Kononenko. 2014. “Explaining Prediction Models and Individual Predictions with Feature Contributions.” Knowledge and Information Systems 41 (3): 647–65.

shapr's People

Contributors

andersloland avatar aredelmeier avatar camiling avatar jenswahl avatar lhbo avatar martinju avatar nikolase90 avatar rawanmahdi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

shapr's Issues

First release and JOSS

This is a list of the things we must do before releasing a version to CRAN and submitting a paper to JOSS

Variable in scope

There seems to be a bug regarding global ourr use of data.table and global variables.

This chunk of code throws an error about the m variable not being defined when used in w_shapley, although it seems to be passed as it should be.

l.1 <- prepare_kernelShap(
    m = 3,
    Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
    Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
    exact = FALSE,
    nrows = 10,
    scale = F)

Defining m as a global variable fixes the issue:

m=3 # 
l.2 <- prepare_kernelShap(
    m = p,
    Xtrain = as.data.frame(matrix(rnorm(30),ncol=3)),
    Xtest = as.data.frame(matrix(rnorm(30),ncol=3)),
    exact = FALSE,
    nrows = 10,
    scale = F)

Warning: Setting m equal to something else gives an incorrect answer even if it is not passed to the function, so this is actually quite dangerous.

The w_shapley function is called within a data.table call, and apparently it does not look for m as defined within the scope of the function where it is present, but rather in the global scope... I can't see what we are doing wrong here. Am I missing something obvious?

Please take a look when you are back from vacation, @nikolase90.

Improve documentation

The following needs to be done;

  • All arguments in functions should be properly documented.
  • All functions should have good examples (OK, not all, but as many as possible).
  • There should be zero warnings when running devtools::check_man()

Add contribution paragraph in README

Should include the following;

  • How we use testthat and covr
  • Style of code and linting (i.e. using styler & lintr)
  • Documentation of functions using roxygen2
  • When to update NEWS.md

Switch to solely using data.table internally

After fixing #120, I suggest we change to the use of data.table throughout the package, only to convert to matrix when calling cpp-functions. The combination of data.frame, data.table and matrix of various dimensions complicates maintenance. See also #123

Tests for functions in R/sampling.R

Steps;

  • Rename file tests/testthat/test-sample_combinations.R to tests/testthat/test-sampling.R
  • Adds tests for the following two functions:
    • sample_gaussian() (located in R/sampling.R)
    • sample_copula() (located in R/sampling.R)

Note that it is not necessary to create tests for sample_combinations since they are already written.

Add tests using testthat

If you're working on one of these files, please add the url to your branch or the pull request. Mark the box if the changes are merged with master.

R-files

  • clustering.R
  • explanation.R#128
  • features.R #70
  • observations.R #95
  • plot.R #86
  • predictions.R #109
  • sampling.R#92
  • shapley.R #128
  • transformation.R #77
  • utils.R Currently there are zero functions in R/utils.R.
  • models.R #129

src-files

  • AICc.cpp
  • distance.cpp
  • impute_data.cpp #78
  • weighted_matrix.cpp #73

The following files should not be tested

  • R/shapr-package.R
  • R/zzz.R

[review-expectopatronum] References

Hi @expectopatronum
Just opening an issue to handle the reference comments from openjournals/joss-reviews#2027 (comment)

  • 1. Lundberg, S. (2019): I am not sure if this is correct/necessary. On their Github page they state which references should be used, this one doesn't seem to be there. (in case this actually should be there, I think SHAP should be capitalized)
  • 2. Maybe this also needs to be cited (very recent): https://raw.githubusercontent.com/slundberg/shap/master/docs/references/tree_explainer.bib
  • 3. Pedersen & Benesty: Title should be "lime: Local Interpretable Model-Agnostic Explanations"
  • 4. Furthermore, capitalization of the titles and conference names is not consistent (some are capitalized, others are all lower case).

Simplify user modification of the empirical approach

Currently the user must supply the kernelshap function with a list specifying all details of the empirical condtional approach. There are default values for the full list, but if the user just needs to change one of these values, he needs to specify all the other default values manually, which is very inconvenient. Thus, if an element of the list is not supplied, default values should be used.

I think the easiest way out of this is to start with the default list in the top of the function and modify that with the elements in the list supplied in the function call.

Update vignette

  • Include section on "Advanced usage" or so including how to explain a custom model. Use the example script under inst.

  • For the "Advanced usage" section include an example with the combined approach. See the example in the documentation in #134 .

  • May also add an example with the independence version here (so people easily can see that the results differ.

  • Fix broken link to comparison with Lundbergs python implementation.

Require input data to be of class data.frame

When expanding the package to handle categorical variables, it is beneficial to only allow input data (training and testing) to be of class data.frames (or data.table). This allows us to stop the procedure if one tries to use existing numeric methods when categorical variables are provided (or potentially one-hot-encode them in that case).

Agree @nikolase90 ?

Enchantments

  • Update shapr with feature labels
  • Add check that uses exact = FALSE if m is greater than a given number
  • Find a way to re-use D matrix if you run code again
  • Add n_samples as an argument in explain
  • Fix documentation for type in explain.combined
  • Don't copy code in explain.combined
  • Rename argument noSamp as n_combinations in feature_combinations
  • Rename x_test in prepare_data.copula
  • Add check for single prediction in shapr and explain (want to check that correct features are passed into the functions)

Require specific format in sample_** functions

Currently, the sample_gaussian and sample_copula functions can take input on a number of different formats. We should restrict this to a single format to reduce the possibility of errors. Both px1 and 1xp is currently OK per the documentation, but px1 will actually fail if m=0 or p.

I suggest using data.table all the way, or 1xp matrix.

Ref #120 and #122

Change header at the pkgdown site

I suggest to change the Article header on the pkgdown site to "Vignettes", OR (if we plan to put the JOSS paper here as well when it is accepted), change the title on the button you click to open the vignette to something like "Vignette: Understanding shapr" describing that this is a vignette.

Parallelization

We should add 4 arguments for parallelization:

  1. One concerning parallelization of the predictions method (which is passed to prediction_vector)
  2. One for parallelization of the sampling method when either the Gaussian or copula method is used.
  3. One which concerns parallelization over test samples in compute_kshap
  4. One for parallelization of distance computation in prepare_kshap.

We should also add a test checking that either 3 or both of 1 and 2 are set to 1 core to avoid parallelization within parallelizations.

Fix redundant arguments

A list of potentially redundant function arguments that we could remove to simplify code.

  • We may delete the reduce_dim argument of feature_combinations (always set it to TRUE)
  • If we delete the reduce_dim argument, we probably don't need the use_shapley_weights_in_W argument in weight_matrix, as the no-column is always 1 (check!!!), and then we can simplify this code by always using the shapley_weight column only as the weights.
  • Remove type in inv_gaussian_transform

Package development

Just leaving a few possible tasks for further improvments of the package

  • Should perhaps parallelize the distance computations in prepare_kernelShap, but computing 10000x100x1002 takes a minute or two and is 7.5gb large, so much more than that cannot be done anyway, without spilling to disk (could do that though?)
  • If we are using the Gaussian approach, then the computations of the distances is not necessary and the computations could be speeded up significantly.
  • We should add three arguments for parallellization: One which concerns parallellization of the predictions method (which is passed to pred_vector), a second another which concerns parallelization over test samples in compute_kernelShap [add a check that at least one of thesem are set to 1) and a third for parallelization of distance computation in prepare_kernelShap.

Output of compute_kshap

Currently we output a matrix with the explanations in Kshap. Maybe a data.table with original column names is better?

Also, adding the actual predictions for the test data in the output list would be good.

Vignette fails with uninformative error

When package gbm is missing, the vignette fails to knit, and the error message doesn't make it very clear that gbm is the culprit.

I suggest adding library(gbm) somewhere before line 399, so that a more informative error message is printed.

Add linting of package

The following files should be adjusted:

  • clustering.R
  • explanation.R
  • features.R
  • models.R
  • observations.R
  • plot.R
  • predictions.R
  • sampling.R
  • shapley.R
  • shapr-package.R
  • transformation.R
  • utils.R

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.