Comments (14)
- I think this is possible. Without working code and data, I can't say more, I am afraid.
- A discrete feature will appear like this if it (a) does not interact with any other feature or (b) if it only interacts with other discrete features.
from shapviz.
Thanks for the test, highly appreciated!
- Regarding the vertical lines: this is not at all a problem. The beeswarm plot just uses the SHAP value density as range for the vertical scatter. If the "density" is discrete (as with discrete features interacting only with discrete features), you will get vertical lines.
- Regarding the other issue: Are still both categories ("Yes", "No") of the weird feature above 0? If yes: are there missing values in this feature? (I think not). And: does the model response have more than two categories? Then, it would be interesting to see the beeswarm plot of all response classes.
from shapviz.
I think one could write a Tidymodels code that would do this. But the current code is not proper. E.g., calling prep()
once on the prediction data X_pred
and once on the display data X
will produce different SMOTE rows, I think.
from shapviz.
Unfortunately, I can't share actual data for this example.
I would be happy to share as much information/code as possible. Which parts of the code would be of interest in order to judge the two points?
from shapviz.
Point 2 is solved, see answer. Point 1 might be a consequence of using a different preprocessing for training and SHAP calculations. You can try to fit the model without the Tidymodel stuff and see if the results would make more sense (or are identical).
Btw., you don't need to pass an X
.
Wild guess: I think your Tidymodel workflow is not consistent with the prep() logic. Put differently: something seems to be wrong. Maybe you can add the code that produces the model?
from shapviz.
Thanks again @mayer79
I very much appreciate your support.
Setup
set.seed(53422)
split <- initial_split(data = data, prop = 0.7, strata = "target")
train_model <- training(split)
set.seed(64575)
cv_folds <-
vfold_cv(data = train_model,
v = 10,
repeats = 3,
strata = "target")
set.seed(44444)
ctrl_grid <-
control_grid(save_pred = TRUE, save_workflow = TRUE,
allow_par = TRUE, verbose = TRUE, parallel_over = "everything")
Models:
set.seed(79743)
boost_tree_xgboost_spec <-
boost_tree(tree_depth = tune(),
trees = tune(),
learn_rate = tune(),
min_n = tune(),
stop_iter = tune()) %>%
set_engine('xgboost', importance = TRUE, validation = 0.2) %>%
set_mode('classification')
set.seed(79743)
rand_forest_ranger_spec <-
rand_forest(trees = tune(),
mtry = tune(),
min_n = tune()) %>%
set_engine('ranger', importance = "impurity", verbose = TRUE) %>%
set_mode('classification')
set.seed(79743)
decision_tree_rpart_spec <-
decision_tree(tree_depth = tune(),
min_n = tune(),
cost_complexity = tune()) %>%
set_engine('rpart') %>%
set_mode('classification')
Recipes:
basic_rec <-
recipe(formula = target ~ ., data = train_model) %>%
step_integer(all_nominal_predictors(), zero_based = TRUE)
smote_rec <-
recipe(formula = target ~ ., data = train_model) %>%
step_integer(all_nominal_predictors(), zero_based = TRUE) %>%
step_smote(target , seed = 44444)
Modeling:
Workflow-Map
models = list(xgb = boost_tree_xgboost_spec, rf = rand_forest_ranger_spec, dt = decision_tree_rpart_spec)
preproc = list(basic = basic_rec, smote = smote_rec)
wflowset <-
workflow_set(preproc = preproc, models = models, cross = TRUE)
set.seed(44444)
wflowmap <-
workflow_map(object = wflowset,
fn = "tune_grid",
resamples = cv_folds,
grid = 30,
metrics = classification_metrics,
control = ctrl_grid,
verbose = TRUE,
seed = 54321)
Finalizing
workflow_results_res <-
wflowmap %>%
rank_results(rank_metric = "roc_auc") %>%
filter(.metric == "roc_auc")) %>%
select(wflow_id, model, .config, metric = mean, rank) %>%
group_by(wflow_id) %>%
slice_min(rank, with_ties = FALSE) %>%
ungroup() %>%
arrange(rank)
workflow_results_res_id_best <-
workflow_results_res %>%
slice_min(rank, with_ties = FALSE) %>%
pull(wflow_id)
workflow_results_res_best <-
workflow_results %>%
extract_workflow_set_result(id = workflow_results_res_id_best) %>%
select_best(metric = "roc_auc")
set.seed(2023)
workflow_results_res_best_fit <-
workflow_results %>%
extract_workflow(workflow_results_res_id_best) %>%
finalize_workflow(workflow_results_res_best) %>%
last_fit(split = split, metrics = classification_metrics)
Explainability
X = bake(prep(smote_rec ), new_data = train_model )
X_pred = bake(prep(smote_rec ), has_role("predictor"), new_data = train_model, composition = "matrix")
set.seed(2023)
SHAP_VIZ_Explainer <-
shapviz::shapviz(object = workflow_results_res_best_fit %>% extract_fit_engine(), X_pred = X_pred, X = X)
names(SHAP_VIZ_Explainer) <- c("Class1", "Class2", "Class3")
shapviz::sv_importance(object = SHAP_VIZ_Explainer, kind = "beeswarm", max_display = 10, show_numbers = FALSE)
from shapviz.
Does the same happen for a non-SMOTE model?
from shapviz.
Good point, I've tested this with a basic recipe meaning, no SMOTE applied to the training data during preprocessing.
- The prominent pattern of vertical lines(2) is not present anymore.
- The feature with the weird distribution (1) is much less important overall and no other feature has a similar distribution of shap values.
from shapviz.
Hi @mayer79 , sorry for the delayed response.
Please find attached the bee plots for both versions (top: SMOTE / bottom: no-SMOTE) for all three classes.
There are no missing values in the data and the target variable does indeed have 3 categories.
from shapviz.
Oh wow, thanks - the plots without SMOTE look perfectly fine. So I think we can close the issue? There is no good reason to use SMOTE on tabular data anyway.
Probably, if SMOTE would not only be applied to the training data, but also on the explanation data, the drift from 0 would vanish also.
from shapviz.
Just for my better understanding:
- This "effect" which may relates to SMOTE is only present in one variable
- Tidymodels used SMOTE only on the training data but not on the evaluation of the testing data via last_fit() which resulted in the extracted fitted engine used in shapviz
- SMOTE is then applied on the training data via recipe in the X_pred step in shapviz
What would be the explanation data in this case?
from shapviz.
@viv-analytics : I don't think it is an issue of Tidymodels, rather of your code. The SMOTE step should be fitted once, and then 1:1 applied to the explanation data. In your code, SMOTE is repeatedly fitted: once for the modeling, then for the explanation data X_pred
and even separately for the display data X
. In this specific application, there is no reason to define a "display" data X
as it is identical to X_pred
.
from shapviz.
@mayer79 : Thanks for clarifying this point.
I'm not entirely sure how this could be achieved in a way shapviz does digest it correctly. Therefore, I'd appreciate any hint.
Thanks in advance
from shapviz.
My advise for SMOTE: never use it on tabular data. Rather optimize the right metrics, e.g., logloss.
I will close the Issue for now.
from shapviz.
Related Issues (20)
- Multiclass/Multioutput/multiple models HOT 1
- Multiple plots: align SHAP axis limits
- issue with sv_importance function HOT 4
- Idea: sv_dependence2D() HOT 1
- Multioutput model names HOT 1
- Best practice for visualizing tidymodels last_fit() object HOT 6
- Cannot rename colnames/dimnames in post-processing HOT 2
- maintenance: changes in package_version() HOT 1
- Cannot set x-axis limits with beeswarm plot when data exist outside of specified xlims HOT 3
- how to get Shap interactions for LightGBM? HOT 6
- Stacked/dodged bar plots? HOT 1
- Controlling threads HOT 2
- Individual baselines HOT 1
- Treatment of categorical features in `potential_interactions()`: suggestion to use R squared instead of squared correlation HOT 15
- Interaction importance HOT 4
- Not compatible with mlr3 package and DALEXtra package HOT 6
- Custom color palettes for the beeswarm plot HOT 1
- ENH Allow sv_importance() and sv_interaction() to be unsorted
- Baseline-value question HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from shapviz.