modeloriented / xspliner Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 3.0 14.03 MB

Explain black box with GLM

Home Page: https://ModelOriented.github.io/xspliner/

R 100.00%

xspliner's People

Contributors

Stargazers

Watchers

Forkers

drroad learningasigoxyz

xspliner's Issues

Choose global statistics for summary comparision with bare blackbox

Which parameters should be available from pdp::partial function?

What graphics should be available for the solution?

The plots should be plot s3 methods.

Ideas based on case:

factorMerger graphics (when used on variable)

For quantitative:

data points
pdp (ale)
pdp (ale) approximation (when used on variable)
pdp (ale) derivative on separate axis

For qualitative

Factor Merger

For xspliner
Comparison on

probs, responces (heatmaps?)

Extract useful functions from previous code version.

Temporarily move them into deprecated.R and do not export. Finally customized with new library version it will be moved to existing actual scripts.

Approx as monotonic spline.
Approx with partially constant function.
factorMerger approx

Add factorMerger for xf options

Based on #6

Prepare short gh graphics with instructions

Rewrite old examples using current functions

Specify which variables should not be transformed.

Some variables, mainly integer ones has few unique values, so that GAM cannot approximate them with splines. In this case we get error and algorith stops. it would be great to specify which variables should bo not transformed.

Shouldn't I use glm instead of gam in final model build?

I think I should. There is no good reason to use mgcv::gam in final model when I don't use splines there.
Actually one con: it's easily to compare two gam models.

Maybe i should add parameter fr choosing one?

DALEX integrations

Integrate code with DALEX (result of explain function)

Remove previous version code

Add MI2 pkgdown

Basic fixes

add authors (Przemysław Biecek)
lower package version
rename functions
complete NEWS file

Use link function only when passed

By default should be NULL and link shoul be extracted from family parameter.

If xs or xf is specified with some params use them istead of deafult ones.

Include estimated coefficients in plot

Actually only bare transformation is plotted but this makes interpretation inaccurate on scale level.
It would be great to add such flag for plot method.

When choosing transition is automatic display message on that.

Add ale plot for xs options

Based on #6

Automatic decision if xs or xf should be used or raw variable

It was implemented in the previous version.
The idea is to compare the performance of:
lm(y ~ pred_var) and lm(y ~ approx(pred_var)) and choose better option.

Important:
Make decision rule general (passed as parameter?).

NOTE:
How to do it without recalculating response approximation?

Idea:
It can be parameter passed for xs nd xf, for example choose = "automatic" that parforms decision in backend.

Within approx_with_splines data needs to be renamed in order to recalc gam. Allow to keep original predictor names.

Add variables extracting from blackbox

When we use formula y ~ . in xspliner, the full formula is built automatically on all variables from data. But bb could be built on smaller variables set.

Define S3 summary method

Existing one is based on mgcv::gam. Should it be extended?
What can be added?

Ideas:

performance comparison with bare gam and bare blackbox?

Add unittests

Create pred function for Ale method to return 'link' everytime

Should xs and xf be one function?

It could be based on variable type.

Add cheatsheets

Rewrite final code to S3

Add general way for passing xs and xf approximations

Currently it is not general. Only pdp can be used for type = "pdp".
How to extend this?

Include only formula variables in its Environment

Description: While formula preprocessing parent.frame the environment is used, so it can become huge (huge global env).
Let's use just variables used as parameters to raw formula.

Idea how to solve: use all.vars function to get the names.

Add plotting transition comparing for many models

If I models to compare_with add option to plot pdp-s also for them on one plot.

Add print and summary methods

Allow automatic transformation for raw variables formula

When passing y ~ x + z + t formula, it could be useful to automatically consider spline transformation for each of them.

For this case common method and spline parameters needs to be defined.

Add type = c("classification", "regression") parameter, also quantitatives which specifies which variable should be used with xf.

Add stats for models comparison

Consider "pseudo r-squared" https://christophm.github.io/interpretable-ml-book/global.html#theory-4

When calling xp_gam xs and xf functions are passed for global environment

Make package robust for mistakes

Current ideas:

usage of xs on factor

mgcv::gam uses global environment for model object

As a result, when using prediction there is used global environment instead of formula environment

Package crashes on duplicated variables

See:

formula <- log(y) ~
    xs(x, method_opts = list(type = type)) * z + xf(t) + w ^ 2 + I(z ^ 2)
extract_formula_var_names(formula_4, data) # gives c("y", "x", "z", "t", "w")
get_formula_raw_components(formula_terms) # gives
    c("xs(x, method_opts = list(type = type))", "z", "xf(t)", "w", "I(z^2)")

Find out what logic should be here implemented do match z variable.

Monotonic spline approximation doesn't work everytime

In some cases the result is not monotonic (it may depend on grid.resolution). Check boston data nad rm variable.

Make functionality work without passing data (raw variables from env)

Functionality bases on iteration across all variables that possibly should be found in data. If you pass no data it is not determined which variables are "data sourced" and not parameters. Find the way how to distinguish data from parameters.

See:

x <- rnorm(10)
y <- rnorm(10)
oko <- 10
get_formula_details(y ~ xs(x, spline_opts = list(k = oko)))

Idea? Assumption that data variables should be vectors with the same length. It doesn't cover all cases but huge part of them.

Add vignettes or Rmd instructions

Add link function parameter.

It's actually quite simple with using pdp package.
When link is passed for continuous response variable right in the formula, then we just use it (transformed with link) in pdp, and for spline estimation.

When we use family parameter (like in glm) pdp returns output transformed with link. After that we use the variable to esitmate spline.

After all we pass raw formula link or family in final model (probably it shouldn't be mgcv::gam, glm is enough).