modeloriented / kernelshap Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 7.0 2.41 MB

Efficient R implementation of SHAP

Home Page: https://modeloriented.github.io/kernelshap/

License: GNU General Public License v2.0

R 100.00%

explainable-ai interpretability interpretable-machine-learning machine-learning rstats shap xai

kernelshap's People

Contributors

Stargazers

Watchers

Forkers

dswatson rnaimehaom smudna 00mjk jmaspons 5l1v3r1

kernelshap's Issues

"no applicable method for 'predict' when using 'ranger' random forest

I'm trying to calculate SHAP values for a rf with 2300 trees and I keep getting this message.

Error in kernelshap_one(x = X[i, , drop = FALSE], v1 = v1[i, , drop = FALSE], :
task 1 failed - "no applicable method for 'predict' applied to an object of class "ranger""

The demo with the diamonds dataset works fine so I don't know why it doesn't work with my own data.

The rf syntax looks like this:

rf= ranger(OverOrdAltiAltFactor~.,
data= df,
num.trees = 2300,
probability = TRUE)

The syntax for SHAP calculation:
shaps= kernelshap(rf, sample_big, bg_X= sample_small, parallel=TRUE)

If I dont use "probability= TRUE" wirh ranger, I get an error that "predictions must be a vector, matrix, data.frame or 2D array".

I can't share the data unfortunately.

Exclude some columns from SHAP calculations

In some situations, X contains non-feature columns. An example is a model

lm(y ~ . - x1, data = X).

Its predict function will launch an error if "x1" is dropped from X. Thus, if we wish to use kernelshap on this model, we would need to calculate SHAP values also for column "x1", which is unnecessary.

Thus, it would be nice to have an argument feature_names = colnames(X) to reduce the feature set in such cases.

Assymetric Shapley Values

Does this also work for the calculation of asymmetric shapley values, that allow their calculation by incorporating causal information with domain knowledge? The github for a package implementing asymmetric shapley values is here:

https://github.com/nredell/shapFlex

This is made by using what the researchers have found in this paper:

https://proceedings.neurips.cc/paper/2020/file/0d770c496aa3da6d2c3f2bd19e7b9d6b-Paper.pdf

There is also a talk by the researchers explaining the theory on youtube:

https://www.youtube.com/watch?app=desktop&v=7d13f4UaAn0

CRAN issue with Latex symbol on MacOS

MacOS does not recognize text in math Latex. Need to replace by textrm. Fixed in #89

Remove inner covariance updates

The core function requires a nested loop to update A and b. The inner loop can be replaced. This makes the algo easier to read.

Naming of multiresponse models

The result of kernelshap() applied to a multiresponse model containts an unnamed list S of shap values. It would be nice if the elements would be named according to the prediction function output. Similar for SE.

"hybrid" strategy

SHAP kernel weights put at least 0.75 of the mass to vectors z with sum(z) in {1, p-1}.

The current sampling strategy works by drawing vectors z from this distribution. Depending on p, >=75% of the draws will fall in this category. Not much mass is left for all other cases. This appears to be very wasteful as there are only 2p distinct z vectors with sum(z) in {1, p-1}.

Much better seems a hybrid between the exact approach (listing all possible z vectors and weighting them accordingly) and the sampling approach (drawing z vectors from the right distribution):

Enumerate the p vectors with sum(z) in {1, p-1}.
The remaining mass (m - 2p) is randomly sampled according to the pairwise sampling strategy.

Weight corrections as in the exact case will be necessary to weight up the exact part and (uniformly) weight down the random part.

The Python implementation even goes a step further and also tries to enumerate all z with sum(z) in {2, p-2} etc. While we could also do this, we can prefer the much simpler approach outlined here and rely on the iterative sampling logic of Covert and Lee (2021). A drawback of also enumerating the cases with sum(z) in {2, p-2} etc. is that the sample size m must be relatively large.

Comparing this hybrid case with the Python logic, we can say: Python uses a larger m, and we use a smaller m but iterate until SHAP values are sufficiently precise. Both makes sense.

kernelshap calculation makes R abort

I'm trying to calculate SHAP-values of random forests from the ranger package. The random forests have between 1000 and 8000 trees and the datasets between 5000 and 37000 rows. The only model I managed to calculate SHAP-values for is a model that had 850 trees. The rest all crashed. I find it odd that none of the tutorials or vignettes says this could happen so I am wondering if there is something wrong with my models or whether this is expected. I cannot post the data for privacy issues (government data) but the variables are all numerical except one binary variable. One group of datasets have 8 variables (there are datasets for different years) and the other 32. The dependent variable is a 3-level factor. I'm using a 32GB Macbook pro and R 4.1.2.

This is the code I am using:

ntree=850

set.seed(199)
model <- ranger(Satisfaction ~ .,
data = df20,
importance = 'permutation',
num.trees = ntree20,
mtry=3,
probability = TRUE,
case.weights = class_weights[df20$Satisfaction])

set.seed(199)
xvars=colnames(df20[-9])
X= df20[sample(nrow(df20),500), xvars]

bg_X= df20[sample(nrow(df20),200),]

shap_rf= kernelshap(model, X, bg_X = bg_X)

Error in check_pred(pred_fun(object, X, ...), n = n) : Predictions must be a vector, matrix, data.frame, or <=2D array

Hello,
I want to try to run kernelshap with the example related to mlr3, ranger package but it reports an error as in the title, what should I do about this?

Remove ks_extract()

I am proposing to remove the ks_extract() function. It allows to extract objects like S from a kernelshap object x. However, the usual extractors x$S or x[["S"]] will suffice.

Parallel processing

Thanks for this awesome package!

I was wondering if it might be possible to have an option for parallel processing? I'm thinking specifically of the big for loop:

for (i in seq_len(n)) res[[i]] <- kernelshap_one(...)

which could easily be replaced with foreach:

res <- foreach(i = seq_len(n)) %dopar% kernelshap_one(...)

The downside is, you'd lose the progress bar. (There are some methods for doing parallel progress bars in R, but they all kind of suck in my experience.) The upside is, you can speed up the slowest part of the task by a factor of n_cores. Perhaps a logical parallel argument could allow users to toggle between the two?

suppress message in kernelshap

Thank you for developing so kind package! But, when I apply this function, there was a problem for suppressing message.

The scrits like:
library(kernelshap)
ks <- kernelshap(model_final,X=training_data,bg_X = training_data, verbose = FALSE)

output:
1/1 [==============================] - 0s 21ms/step
1/1 [==============================] - 0s 18ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 2ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 2ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 2ms/step
19/19 [==============================] - 0s 2ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 2ms/step
19/19 [==============================] - 0s 2ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step
19/19 [==============================] - 0s 1ms/step

I try 2 ways to suppress this messages, but those ways do not work.
suppressMessages({ ks <- kernelshap(model_final,X=training_data,bg_X = training_data, verbose = FALSE) })

invisible(capture.output(kernelshap(model_final,X=training_data,bg_X = training_data, verbose = FALSE)))

Increase code-cov

See title

Allow background dataset with 1 row

Currently, the background dataset is not allowed to have only a single row. This is not ideal as there are certain important cases where this is explicitly wanted:

In cases where there is a natural "null" value: e.g. MNIST digits (a value 0 or 255 means no color). Here you would want to pass a one-row data.frame or matrix containing all 0 or 255.
In models that can naturally deal with missing values. Here you would want ot pass a one-row data.frame or matrix containing all NAs.

In such cases, the algorithm will run extremely fast.

Optional feature selection

This confused me for a bit because I assumed the L1 penalty was part of the optimization itself (i.e., just another constraint in Eq. 6 of the Covert & Lee paper). But a closer look at the shap code reveals that this is not the case. They just fit a lasso for feature selection then remove some column(s) from Z and run the regular pipeline.

We could implement this with glmnet. We'd need an l1 argument, with possible values NULL (the default, no regularization), "aic", "bic" (for adaptive feature selection), or some integer less than p to set a target number of variables. Then within kernelshap_one, we'd have something like the following (after Z and vz are defined):

# Optional feature selection
  if (!is.null(l1)) {
    # Fit lasso
    fit <- glmnet(x = Z, y = (vz - v0_ext), weights = w, intercept = FALSE)
    if (l1 %in% c('aic', 'bic')) {
      # Compute information criteria for adaptive feature selection
      y_hat <- predict.glmnet(fit, newx = Z, s = fit$lambda)
      eps <- y_hat - (vz - v0_ext)
      rmse <- sqrt(colMeans(eps^2))
      ll <- sapply(seq_len(length(f$lambda)), function(l) {
        sum(dnorm(eps[, l], sd = rmse[l], log = TRUE))
      })
      aic <- 2 * fit$df - 2 * ll
      bic <- log(n) * fit$df - 2 * ll
      k <- ifelse(crit == 'aic', which.min(aic), which.min(bic))
      keep <- predict.glmnet(fit, s = fit$lambda[k], type = 'nonzero')$s1
    } else {
      # Or select some fixed number of features. This can be tricky when
      # no value of lambda delivers precisely the target number of nonzeros
      if (l1 %in% fit$df) {
        k <- max(which(fit$df == l1))
        keep <- predict.glmnet(fit, s = fit$lambda[k], type = 'nonzero')$s1
      } else {
        k <- min(which(fit$df >= l1))
        beta <- abs(as.numeric(coef.glmnet(fit, s = fit$lambda[k])))
        keep <- order(beta, decreasing = TRUE)[1:l1]
      }
    }
    Z <- Z[, keep]
  }

This updated Z is then used to define A and b. Of course, we'd need to set the beta for dropped features to 0 prior to output.

What do you think? I can create a pull request if you like.

Keep track of dimnames

kernelshap involves working with different matrices of different shapes.

We should consistently use feature names for the "p" dimension and prediction names (if available) for the "K" dimension. This would make the code easier to read.

Does kernelshap surport long short-term memory (LSTM)?

I try to explain long short-term memory (LSTM) model with keras package, but does not work. The codes are:
`library(keras)
library(kernelshap)
set.seed(123)
x <- matrix(runif(100), ncol = 5)
y <- runif(20)

n_samples <- 20
n_timesteps <- 2
n_features <- 5

x_samples <- array(0, dim = c(n_samples, n_timesteps, n_features))
y_samples <- array(0, dim = c(n_samples, 1))

for (i in 1:(n_samples - n_timesteps + 1)) {
x_samples[i,,] <- x[i:(i + n_timesteps - 1),]
y_samples[i,] <- y[i + n_timesteps - 1]
}

model <- keras_model_sequential()
model %>%
layer_lstm(units = 32, input_shape = c(n_timesteps, n_features)) %>%
layer_dense(units = 1) %>%
compile(loss = 'mean_squared_error',optimizer = optimizer_adam())

model %>% fit(x_samples, y_samples, epochs = 20, batch_size = 1, verbose = 0)

ks <- kernelshap(model, X=x_samples, bg_X = x_samples, verbose = F)`

Error in kernelshap.default(model, X = x_samples, bg_X = x_samples, verbose = F) :
is.matrix(X) || is.data.frame(X) is not TRUE

Adding predictions to output

kernelshap() calculates predictions on the explainer data X. For some applications, it would be convenient to return the $(n \times K)$ matrix of predictions. $n$ is the number of rows in X, $K$ is the dimensionality of a single prediction (usually 1).

Thus I am proposing to add a matrix of predictions to the output of kernelshap().

Add pkgdown website

Just to make it easier to follow the documentation

Remove dependency to MASS

I would like to reduce the dependency footprint of "kernelshap". One candidate is MASS::ginv(), which is the only function imported from "MASS". Thus, if MASS allows, I would love to use ginv() directly, without MASS.

Reduce number of extract functions

The package contains a large number of ks_*() extraction functions. In order to not clutter the memory, they should be replaced by a single ks_extract(x, what = "S") function.

dorng orphaned

Test result of "kernelshap" has two warns due to orphaned dependency "dorng":

https://cran.r-project.org/web/packages/doRNG/index.html

We used it to produce reproducible results under parallel computing. Since kernelshap's results are quite unaffected by randomness, we can replace it by %dopar% from "doFuture". "dofuture" has a very stable maintainer.

Fix inconsistency between paired and simple sampling

With paired sampling, 2m on-off vectors are drawn, with simple sampling only m. This makes a direct comparison very unfair.

To fix this inconsistency, we modify paired sampling by drawing only m/2 on-off vectors and then add their complements. To ensure identical results for the (default) paired strategy, the "auto" values of m are doubled.

This will leave most applications of kernelshap() unchanged, but fixes the inconsistency.

Automatic selection of background data

A pain-point in using {kernelshap} is the manual preparation of the background data bg_X. Most applications have a relatively large explanation data X and a model m, but no background data. It would be convenient to derive the background data automatically from X, so that a SHAP analysis would start with:

shp <- kernelshap(m, X)

Suggestion: Set the default bg_X = NULL. In this case, use this logic here:

If nrow(X) <= 200 -> bg_X = X
If nrow(X) < 20 -> warning("X is to small to be used as background data. Please pass a larger background data via 'bg_X'.")
If nrow(X) > 200 -> message("X is too large to be used as background data. We randomly select 200 rows from it.") and do subsampling

If bg_X = NULL and the user wants to pass a vector of case weights bg_w, the latter would need to correspond to X (we need to subsample correspondingly)

Ping @pbiecek

Error in unserialize(socklist[[n]]) : error reading from connection

I am dealing with a large dataset (27000 x 140) and when I first try without parallel computation of kernalSHAP, there is no movement in the progress bar (it stays at 0). When I turn on parallel processing using what you have described on the home page of your repo, I get the error:

Error in unserialize(socklist[[n]]) : error reading from connection

What should I do? I am using windows, core i7 12700k.

Exact solutions

I think we can adapt the existing code for exact solutions to extend the approach beyond $p = 5$ while actually speeding things up. It might go something like this.

We don't need any data to compute A, which only depends on p. So a simple utils function will do the trick. (Fun fact, the formula for off-diagonal values in Covert & Lee's Appx. A is actually wrong. You can verify this by comparing it with brute force results as in your existing crossprod(Z) / n_Z when Z is exact.)

# Here's the proposed utils function
off_diag <- function(p) {
  S <- 1:(p - 1)
  probs <- (p - 1) / (choose(p, S) * S * (p - S))
  probs <- probs / sum(probs)
  c_pr <- (0.5 * S^2 - 0.5 * S) / choose(p, 2)
  sum(probs * c_pr)
}

# Then A is just given by
if (exact) {
  A <- diag(0.5, nrow = p)
  A[A == 0] <- off_diag(p)
  ...
}

The tougher part is computing b without any extra rows beyond the $2^p - 2$ required for complete enumeration of Z-vectors. The trick is just to use those weights that are currently computed as part of create_exact_Z, but deliver them along with the matrix Z. This can be done outside the kernelshap_one loop to avoid unnecessary computation.

# Another utils function, this one to replace create_exact_Z
complete_Z <- function(p) {
  # Create Z
  z <- do.call(expand.grid, replicate(p, 0:1, simplify = FALSE))
  z <- z[2:(nrow(z) - 1), ]
  z <- as.matrix(z)
  dimnames(z) <- NULL
  rs <- rowSums(z)
  # Compute weights
  S <- 1:(p - 1)
  probs <- (p - 1) / (choose(p, S) * S * (p - S))
  probs <- probs / sum(probs)
  w <- probs / as.vector(table(rs))
  w <- w[rs]
  out <- list('Z' = z, 'w' = w)
  return(out)
}

# Then we can resume the code above
if (exact) {
  A <- diag(0.5, nrow = p)
  A[A == 0] <- off_diag(p)
  cz <- complete_Z(p)
  Z <- cz$Z
  w <- cz$w
  b <- crossprod(Z, w * (vz - v0_ext))
}

Of course, the value functions depend on the data, so b will need to be computed within the kernelshap_one loop. Everything else can be done outside though, and requires only $2^p - 2$ calls to pred_fun per sample, which is less than the current create_exact_Z implementation. Also allows for arbitrary $p$ if people are willing to wait for exact results.

Just a thought, anyway! Happy to create a pull request from the current branch if you're interested.

Interface change: "strategy"

Currently, there are four sampling strategies implemented:

exact = TRUE
exact = "auto"
paired_sampling = TRUE (and exact = FALSE)
paired_sampling = FALSE (and exact = FALSE)

For the user, it is relatively difficult to see how these arguments interact. Furthermore, additional strategies might be added in the future.

A clearer interface would be to replace the two arguments "exact" and "paired_sampling" by a single one, "sampling_strategy":

sampling_strategy = "auto" (same as exact = auto)
sampling_strategy = "exact" (same as exact = TRUE)
sampling_strategy = "paired" (same as exact = FALSE with paired_sampling = TRUE)
sampling_strategy = "simple" (same as exact = FALSE with paired_sampling = FALSE)

Ping @dswatson

Allow multi-output models

See title.