Giter VIP home page Giter VIP logo

Comments (14)

mayer79 avatar mayer79 commented on May 18, 2024 1
  1. I think this is possible. Without working code and data, I can't say more, I am afraid.
  2. A discrete feature will appear like this if it (a) does not interact with any other feature or (b) if it only interacts with other discrete features.

from shapviz.

mayer79 avatar mayer79 commented on May 18, 2024 1

Thanks for the test, highly appreciated!

  • Regarding the vertical lines: this is not at all a problem. The beeswarm plot just uses the SHAP value density as range for the vertical scatter. If the "density" is discrete (as with discrete features interacting only with discrete features), you will get vertical lines.
  • Regarding the other issue: Are still both categories ("Yes", "No") of the weird feature above 0? If yes: are there missing values in this feature? (I think not). And: does the model response have more than two categories? Then, it would be interesting to see the beeswarm plot of all response classes.

from shapviz.

mayer79 avatar mayer79 commented on May 18, 2024 1

I think one could write a Tidymodels code that would do this. But the current code is not proper. E.g., calling prep() once on the prediction data X_pred and once on the display data X will produce different SMOTE rows, I think.

from shapviz.

viv-analytics avatar viv-analytics commented on May 18, 2024

Unfortunately, I can't share actual data for this example.

I would be happy to share as much information/code as possible. Which parts of the code would be of interest in order to judge the two points?

from shapviz.

mayer79 avatar mayer79 commented on May 18, 2024

Point 2 is solved, see answer. Point 1 might be a consequence of using a different preprocessing for training and SHAP calculations. You can try to fit the model without the Tidymodel stuff and see if the results would make more sense (or are identical).

Btw., you don't need to pass an X.

Wild guess: I think your Tidymodel workflow is not consistent with the prep() logic. Put differently: something seems to be wrong. Maybe you can add the code that produces the model?

from shapviz.

viv-analytics avatar viv-analytics commented on May 18, 2024

Thanks again @mayer79
I very much appreciate your support.

Setup

set.seed(53422)
split <- initial_split(data  = data, prop   = 0.7, strata = "target")

train_model <- training(split)

set.seed(64575)
cv_folds <-
	vfold_cv(data    = train_model,
                       v = 10, 
	         repeats = 3,
	         strata  = "target")

set.seed(44444)
ctrl_grid <- 
        control_grid(save_pred  = TRUE, save_workflow = TRUE, 
        allow_par = TRUE, verbose = TRUE, parallel_over = "everything")

Models:

set.seed(79743)
boost_tree_xgboost_spec <-
  boost_tree(tree_depth      = tune(), 
  	     trees           = tune(), 
  	     learn_rate      = tune(), 
             min_n           = tune(),
  	     stop_iter       = tune()) %>%
  set_engine('xgboost', importance = TRUE, validation = 0.2) %>%
  set_mode('classification')

set.seed(79743)
rand_forest_ranger_spec <-
  rand_forest(trees = tune(),
  	      mtry  = tune(), 
  	      min_n = tune()) %>%
  set_engine('ranger', importance = "impurity", verbose = TRUE) %>%
  set_mode('classification')

set.seed(79743)
decision_tree_rpart_spec <- 
decision_tree(tree_depth  = tune(), 
                   min_n  = tune(), 
          cost_complexity = tune()) %>%
  set_engine('rpart') %>%
  set_mode('classification')

Recipes:

basic_rec <- 
	recipe(formula = target ~ ., data = train_model) %>% 
	step_integer(all_nominal_predictors(), zero_based = TRUE)

smote_rec <-
	recipe(formula = target ~ ., data = train_model) %>% 
	step_integer(all_nominal_predictors(), zero_based = TRUE) %>% 
	step_smote(target , seed = 44444)

Modeling:

Workflow-Map

models  = list(xgb = boost_tree_xgboost_spec, rf  = rand_forest_ranger_spec, dt  = decision_tree_rpart_spec)

preproc = list(basic = basic_rec, smote  = smote_rec)

wflowset <- 
          workflow_set(preproc = preproc, models  = models, cross  = TRUE)

set.seed(44444)
wflowmap <- 
          workflow_map(object    = wflowset, 
			     fn        = "tune_grid", 
		       resamples = cv_folds, 
			   grid      = 30, 
		        metrics   = classification_metrics, 
			control   = ctrl_grid, 
		       verbose   = TRUE,
			 seed      = 54321)

Finalizing

workflow_results_res <- 
		wflowmap %>% 
		rank_results(rank_metric = "roc_auc") %>% 
		filter(.metric == "roc_auc")) %>% 
		select(wflow_id, model,  .config, metric = mean, rank) %>% 
		group_by(wflow_id) %>% 
		slice_min(rank, with_ties = FALSE) %>% 
		ungroup() %>% 
		arrange(rank)
	
		
    workflow_results_res_id_best <- 
			workflow_results_res %>% 
			slice_min(rank, with_ties = FALSE) %>% 
			pull(wflow_id)

	
	workflow_results_res_best <-
		workflow_results %>% 
		extract_workflow_set_result(id = workflow_results_res_id_best) %>% 
		select_best(metric = "roc_auc")
	
       set.seed(2023)
       workflow_results_res_best_fit <-
		workflow_results %>% 
		extract_workflow(workflow_results_res_id_best) %>% 
		finalize_workflow(workflow_results_res_best) %>% 
		last_fit(split = split, metrics = classification_metrics)

Explainability

X = bake(prep(smote_rec ), new_data = train_model )
X_pred = bake(prep(smote_rec ), has_role("predictor"), new_data = train_model, composition = "matrix")

set.seed(2023)
SHAP_VIZ_Explainer <- 
          shapviz::shapviz(object = workflow_results_res_best_fit %>% extract_fit_engine(), X_pred = X_pred, X = X)
names(SHAP_VIZ_Explainer) <- c("Class1", "Class2", "Class3")

shapviz::sv_importance(object = SHAP_VIZ_Explainer, kind  = "beeswarm", max_display  = 10, show_numbers = FALSE)

from shapviz.

mayer79 avatar mayer79 commented on May 18, 2024

Does the same happen for a non-SMOTE model?

from shapviz.

viv-analytics avatar viv-analytics commented on May 18, 2024

Good point, I've tested this with a basic recipe meaning, no SMOTE applied to the training data during preprocessing.

  • The prominent pattern of vertical lines(2) is not present anymore.
  • The feature with the weird distribution (1) is much less important overall and no other feature has a similar distribution of shap values.

from shapviz.

viv-analytics avatar viv-analytics commented on May 18, 2024

Hi @mayer79 , sorry for the delayed response.

Please find attached the bee plots for both versions (top: SMOTE / bottom: no-SMOTE) for all three classes.

There are no missing values in the data and the target variable does indeed have 3 categories.


from shapviz.

mayer79 avatar mayer79 commented on May 18, 2024

Oh wow, thanks - the plots without SMOTE look perfectly fine. So I think we can close the issue? There is no good reason to use SMOTE on tabular data anyway.

Probably, if SMOTE would not only be applied to the training data, but also on the explanation data, the drift from 0 would vanish also.

from shapviz.

viv-analytics avatar viv-analytics commented on May 18, 2024

Just for my better understanding:

  • This "effect" which may relates to SMOTE is only present in one variable
  • Tidymodels used SMOTE only on the training data but not on the evaluation of the testing data via last_fit() which resulted in the extracted fitted engine used in shapviz
  • SMOTE is then applied on the training data via recipe in the X_pred step in shapviz

What would be the explanation data in this case?

from shapviz.

mayer79 avatar mayer79 commented on May 18, 2024

@viv-analytics : I don't think it is an issue of Tidymodels, rather of your code. The SMOTE step should be fitted once, and then 1:1 applied to the explanation data. In your code, SMOTE is repeatedly fitted: once for the modeling, then for the explanation data X_pred and even separately for the display data X. In this specific application, there is no reason to define a "display" data X as it is identical to X_pred.

from shapviz.

viv-analytics avatar viv-analytics commented on May 18, 2024

@mayer79 : Thanks for clarifying this point.
I'm not entirely sure how this could be achieved in a way shapviz does digest it correctly. Therefore, I'd appreciate any hint.

Thanks in advance

from shapviz.

mayer79 avatar mayer79 commented on May 18, 2024

My advise for SMOTE: never use it on tabular data. Rather optimize the right metrics, e.g., logloss.

I will close the Issue for now.

from shapviz.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.