Hi We have ran a analysis on a multiclass problem: <div class="h

I think this is possible. Without working code and data, I can't say more, I am

Thanks for the test, highly appreciated! Regarding the vertica

Thanks again <a class="user-mention notranslate" data-hovercard-type="user" data-hover

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Odd findings in sv_importance() using beeswarm. about shapviz HOT 14 CLOSED

viv-analytics commented on May 18, 2024

Odd findings in sv_importance() using beeswarm.

from shapviz.

Comments (14)

mayer79 commented on May 18, 2024 1

I think this is possible. Without working code and data, I can't say more, I am afraid.
A discrete feature will appear like this if it (a) does not interact with any other feature or (b) if it only interacts with other discrete features.

from shapviz.

mayer79 commented on May 18, 2024 1

Thanks for the test, highly appreciated!

Regarding the vertical lines: this is not at all a problem. The beeswarm plot just uses the SHAP value density as range for the vertical scatter. If the "density" is discrete (as with discrete features interacting only with discrete features), you will get vertical lines.
Regarding the other issue: Are still both categories ("Yes", "No") of the weird feature above 0? If yes: are there missing values in this feature? (I think not). And: does the model response have more than two categories? Then, it would be interesting to see the beeswarm plot of all response classes.

from shapviz.

mayer79 commented on May 18, 2024 1

I think one could write a Tidymodels code that would do this. But the current code is not proper. E.g., calling prep() once on the prediction data X_pred and once on the display data X will produce different SMOTE rows, I think.

from shapviz.

viv-analytics commented on May 18, 2024

Unfortunately, I can't share actual data for this example.

I would be happy to share as much information/code as possible. Which parts of the code would be of interest in order to judge the two points?

from shapviz.

mayer79 commented on May 18, 2024

Point 2 is solved, see answer. Point 1 might be a consequence of using a different preprocessing for training and SHAP calculations. You can try to fit the model without the Tidymodel stuff and see if the results would make more sense (or are identical).

Btw., you don't need to pass an X.

Wild guess: I think your Tidymodel workflow is not consistent with the prep() logic. Put differently: something seems to be wrong. Maybe you can add the code that produces the model?

from shapviz.

viv-analytics commented on May 18, 2024

Thanks again @mayer79
I very much appreciate your support.

Setup

set.seed(53422)
split <- initial_split(data  = data, prop   = 0.7, strata = "target")

train_model <- training(split)

set.seed(64575)
cv_folds <-
	vfold_cv(data    = train_model,
                       v = 10, 
	         repeats = 3,
	         strata  = "target")

set.seed(44444)
ctrl_grid <- 
        control_grid(save_pred  = TRUE, save_workflow = TRUE, 
        allow_par = TRUE, verbose = TRUE, parallel_over = "everything")

Models:

set.seed(79743)
boost_tree_xgboost_spec <-
  boost_tree(tree_depth      = tune(), 
  	     trees           = tune(), 
  	     learn_rate      = tune(), 
             min_n           = tune(),
  	     stop_iter       = tune()) %>%
  set_engine('xgboost', importance = TRUE, validation = 0.2) %>%
  set_mode('classification')

set.seed(79743)
rand_forest_ranger_spec <-
  rand_forest(trees = tune(),
  	      mtry  = tune(), 
  	      min_n = tune()) %>%
  set_engine('ranger', importance = "impurity", verbose = TRUE) %>%
  set_mode('classification')

set.seed(79743)
decision_tree_rpart_spec <- 
decision_tree(tree_depth  = tune(), 
                   min_n  = tune(), 
          cost_complexity = tune()) %>%
  set_engine('rpart') %>%
  set_mode('classification')

Recipes:

basic_rec <- 
	recipe(formula = target ~ ., data = train_model) %>% 
	step_integer(all_nominal_predictors(), zero_based = TRUE)

smote_rec <-
	recipe(formula = target ~ ., data = train_model) %>% 
	step_integer(all_nominal_predictors(), zero_based = TRUE) %>% 
	step_smote(target , seed = 44444)

Modeling:

Workflow-Map

models  = list(xgb = boost_tree_xgboost_spec, rf  = rand_forest_ranger_spec, dt  = decision_tree_rpart_spec)

preproc = list(basic = basic_rec, smote  = smote_rec)

wflowset <- 
          workflow_set(preproc = preproc, models  = models, cross  = TRUE)

set.seed(44444)
wflowmap <- 
          workflow_map(object    = wflowset, 
			     fn        = "tune_grid", 
		       resamples = cv_folds, 
			   grid      = 30, 
		        metrics   = classification_metrics, 
			control   = ctrl_grid, 
		       verbose   = TRUE,
			 seed      = 54321)

Finalizing

workflow_results_res <- 
		wflowmap %>% 
		rank_results(rank_metric = "roc_auc") %>% 
		filter(.metric == "roc_auc")) %>% 
		select(wflow_id, model,  .config, metric = mean, rank) %>% 
		group_by(wflow_id) %>% 
		slice_min(rank, with_ties = FALSE) %>% 
		ungroup() %>% 
		arrange(rank)
	
		
    workflow_results_res_id_best <- 
			workflow_results_res %>% 
			slice_min(rank, with_ties = FALSE) %>% 
			pull(wflow_id)

	
	workflow_results_res_best <-
		workflow_results %>% 
		extract_workflow_set_result(id = workflow_results_res_id_best) %>% 
		select_best(metric = "roc_auc")
	
       set.seed(2023)
       workflow_results_res_best_fit <-
		workflow_results %>% 
		extract_workflow(workflow_results_res_id_best) %>% 
		finalize_workflow(workflow_results_res_best) %>% 
		last_fit(split = split, metrics = classification_metrics)

Explainability

X = bake(prep(smote_rec ), new_data = train_model )
X_pred = bake(prep(smote_rec ), has_role("predictor"), new_data = train_model, composition = "matrix")

set.seed(2023)
SHAP_VIZ_Explainer <- 
          shapviz::shapviz(object = workflow_results_res_best_fit %>% extract_fit_engine(), X_pred = X_pred, X = X)
names(SHAP_VIZ_Explainer) <- c("Class1", "Class2", "Class3")

shapviz::sv_importance(object = SHAP_VIZ_Explainer, kind  = "beeswarm", max_display  = 10, show_numbers = FALSE)

from shapviz.

mayer79 commented on May 18, 2024

Does the same happen for a non-SMOTE model?

from shapviz.

viv-analytics commented on May 18, 2024

Good point, I've tested this with a basic recipe meaning, no SMOTE applied to the training data during preprocessing.

The prominent pattern of vertical lines(2) is not present anymore.
The feature with the weird distribution (1) is much less important overall and no other feature has a similar distribution of shap values.

from shapviz.

viv-analytics commented on May 18, 2024

Hi @mayer79 , sorry for the delayed response.

Please find attached the bee plots for both versions (top: SMOTE / bottom: no-SMOTE) for all three classes.

There are no missing values in the data and the target variable does indeed have 3 categories.

from shapviz.

mayer79 commented on May 18, 2024

Oh wow, thanks - the plots without SMOTE look perfectly fine. So I think we can close the issue? There is no good reason to use SMOTE on tabular data anyway.

Probably, if SMOTE would not only be applied to the training data, but also on the explanation data, the drift from 0 would vanish also.

from shapviz.

viv-analytics commented on May 18, 2024

Just for my better understanding:

This "effect" which may relates to SMOTE is only present in one variable
Tidymodels used SMOTE only on the training data but not on the evaluation of the testing data via last_fit() which resulted in the extracted fitted engine used in shapviz
SMOTE is then applied on the training data via recipe in the X_pred step in shapviz

What would be the explanation data in this case?

from shapviz.

mayer79 commented on May 18, 2024

@viv-analytics : I don't think it is an issue of Tidymodels, rather of your code. The SMOTE step should be fitted once, and then 1:1 applied to the explanation data. In your code, SMOTE is repeatedly fitted: once for the modeling, then for the explanation data X_pred and even separately for the display data X. In this specific application, there is no reason to define a "display" data X as it is identical to X_pred.

from shapviz.

viv-analytics commented on May 18, 2024

@mayer79 : Thanks for clarifying this point.
I'm not entirely sure how this could be achieved in a way shapviz does digest it correctly. Therefore, I'd appreciate any hint.

Thanks in advance

from shapviz.

mayer79 commented on May 18, 2024

My advise for SMOTE: never use it on tabular data. Rather optimize the right metrics, e.g., logloss.

I will close the Issue for now.

from shapviz.

Odd findings in sv_importance() using beeswarm. about shapviz HOT 14 CLOSED

Comments (14)

Setup

Models:

Recipes:

Modeling:

Workflow-Map

Finalizing

Explainability

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent