mlr-org / mlr3gallery Goto Github PK

View Code? Open in Web Editor NEW

21.0 10.0 9.0 468.52 MB

Case studies using mlr3

Home Page: https://mlr3gallery.mlr-org.com

CSS 24.33% JavaScript 0.30% R 3.79% TeX 8.72% Dockerfile 0.21% HTML 62.65%

mlr3 case-studies machine-learning preprocessing mlr3pipelines r

mlr3gallery's People

Contributors

Stargazers

Watchers

Forkers

danielsaggau pkopper missuse tio2 mattdube hayeszhou atharkharal sophiafrei

mlr3gallery's Issues

add tags to posts on landing page

We kind of lost the info which topics / tags a post is on.
Would be cool to have this in italic / grey / small under the title
on the landing page

Tuning usecase uses outdated API

https://mlr3gallery.mlr-org.com/posts/2020-03-11-mlr3tuning-tutorial-german-credit/

TunerInstance is now TunerInstanceSingleCrit and some other arguments have changed as well, (terms became trm, measures -> measure, etc.).

The gallery should be checked after each API change, otherwise it's very confusing for people trying to learn mlr3 from these examples.

Do not attach unneeded packages

Some blog posts start with a series of library() calls. Many of them are not required because their namespace is loaded automatically. Examples are smotefamiliy, mlr3misc or mlr3data.

Broken links

The URLs of the posts have changed which resulted in broken links, e.g. https://mlr3gallery.mlr-org.com/posts/2020-03-11-mlr3pipelines-tutorial-german-credit/ linking to part I/II.

Use downlit

Not sure what would be required to make this work. Should also be used in the book.

https://github.com/r-lib/downlit

Revert dependencies back to CRAN versions once packages are stable again

Ensure compat with rmarkdown 2.11

WIP in branch rmarkdown.

Blocked by #116

Add link to github repository in header/navbar (nt)

Do not use hard links, instead use the helper functions from mlr3book

The ref() function is more robust when it comes to renames (works with aliases, throws an error instead of creating broken links).

Why PipeOpTuneThreshold reduce the classif.ce?

In the post about TuneThreshold, classif.ce was used for tunethreshold.measure. But why in result of benchmark the classif.ce for classif.rpart.tunethreshold was bigger than no_tuning model?

gallery deployment setup suboptimal

ok, its kinda 2 things

it is a bit unclear for a new or edited post which files should be pushed where
I don't understand why we require to push so many files in general

this is from a current edit, where I simply fixed some typos

    modified:   _posts/2020-03-18-iris-mlr3-basics/2020-03-18-iris-mlr3-basics.Rmd
    modified:   _posts/2020-03-18-iris-mlr3-basics/2020-03-18-iris-mlr3-basics.html
    modified:   docs/index.html
    modified:   docs/index.xml
    modified:   docs/posts/2020-03-18-iris-mlr3-basics/index.html
    modified:   docs/posts/posts.json
    modified:   docs/sitemap.xml

Analyis:

I am touching 7 (!) files, I should push one? The RMD
The HTML is now included in there twice? I am not even sure if that's our current setup.
IMHO it should be in the PR zero times. Even if we for some reason require the user to push the HTML, it should be documented in the README where. (the readme mentions docs, but not if the HTML should also be in _posts)
I am touching global files. Thats an invitation to "merge conflict hell"? Especially as we are not merging PRs super fast here

--->

Why don't we have a normal "deploy setup" with travis? the user pushes the RMD only, travis builds the rest. and i guess then travis needs to self-push the compiled HTML under "docs
"

Impute Missing Variables article: `mlr_pipeops_missind` seems to accept only one other imputation operator at a time

While running the code for the classification example "Impute Missing Variables", I added a third imputation operator, namely mlr_pipeops_imputesample.

My graph operator - before adding the learner - looks like this:

graph = po('copy', length(po_list)) %>>% gunion(po_list) %>>% po('featureunion')

where the pipe operators list is

po_list <- list(

  imp_missing <- po('missind')
, imp_num <- po('imputehist', param_vals = list(affect_columns = selector_type('numeric')))
, imp_samp <- po('imputesample', param_vals = list(affect_columns = selector_missing()))

)

and the extracted indices are:

ids <- map_chr(po_list, `[[`, 'id')

[1] "missind"      "imputehist"   "imputesample"

as suggested in the article.

As a short reminder, the dataset used in this article refers to diabetes cases of pima indians:

> task$data()
     diabetes age glucose insulin mass pedigree pregnant pressure triceps
  1:      pos  50     148      NA 33.6    0.627        6       72      35
  2:      neg  31      85      NA 26.6    0.351        1       66      29
  3:      pos  32     183      NA 23.3    0.672        8       64      NA
  4:      neg  21      89      94 28.1    0.167        1       66      23
  5:      pos  33     137     168 43.1    2.288        0       40      35

The data has missing values in some of the variables

> task$missings()

diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227

When training the graph with all three operators, I get the following error:

> graph$train(task)

 Error in task_data(self, rows, cols, data_format, ordered) : 
  Assertion on 'cols' failed: Must be a subset of {'diabetes','missing_glucose','missing_insulin','missing_mass','missing_pressure','missing_triceps'}, but is {'age'}.

while the graph plot shows all three operators

Selecting only one operator at a time to pair with missind, for example imputehist imputes the missing data just as in the example:

ids <- map_chr(po_list, `[[`, 'id')

[1] "missind"      "imputehist"

> graph$train(task)[[1]]$data()

   diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
    missing_triceps age pedigree pregnant glucose   insulin mass pressure  triceps
  1:         present  50    0.627        6     148 118.83233 33.6       72 35.00000
  2:         present  31    0.351        1      85 212.17043 26.6       66 29.00000
  3:         missing  32    0.672        8     183  29.31409 23.3       64 10.59325
  4:         present  21    0.167        1      89  94.00000 28.1       66 23.00000
  5:         present  33    2.288        0     137 168.00000 43.1       40 35.00000

and the graph is plotted correctly

Pairing missind with imputesample also works:

> ids
[1] "missind"      "imputesample"

> graph$train(task)[[1]]$data()

     diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
       missing_triceps age pedigree pregnant glucose insulin mass pressure triceps
  1:         present  50    0.627        6     148     116 33.6       72      35
  2:         present  31    0.351        1      85     168 26.6       66      29
  3:         missing  32    0.672        8     183     108 23.3       64       7
  4:         present  21    0.167        1      89      94 28.1       66      23
  5:         present  33    2.288        0     137     168 43.1       40      35

as well as the plot:

Removing missind from the list throws the same error as before.

While using missind, I tried different selectors for imputesample such as selector_type, selector_name (naming only variables with missing data), selector_grep (idem), to no avail. The only selector that worked was selector_missing().

What is necessary for having more than one imputation operators work simultaneously with missind? Please advise!

Deployment: Install packages from yaml frontmatter

Each document has a YAML front-matter that lists the needed packages. These should be automatically installed by whatever deploy method we are using.
It seems redundant to put these in the fictitious DESCRIPTION file.

Render large data tables as HTML

There are packages for this, we should use them for larger tables.

check and fix posts using german_credit task

related to mlr-org/mlr3#514

affects the following posts:

2020-03-11-basics-german-credit (runs fine because no specific use of the internal structure of variables is made here)
2020-03-11-mlr3pipelines-tutorial-german-credit (should run fine but is no longer very sensible because now there are only 3 integer features being used and the filter is using 3 features, therefore needs some minor changes)
2020-03-11-mlr3tuning-tutorial-german-credit (runs fine but uses data from here https://github.com/mlr-org/mlr-outreach/raw/master/2019_madrid/mlr3tuning/randomsearch_3600.rds which uses the old german_credit; not sure whether other outreach files need to be changed/looked at as well)
2020-03-30-stratification-blocking (runs fine, no changes needed)
2020-01-31-encode-factors-for-xgboost (needs minor wording changes because there is now more than one ordered factor and unemployment_duration is now a factor)

Pro tipps: mlr3 & R6

Illustration of basic components of mlr3 and how they are composed.
E.g. Tasks, measures, resampling and syntax

Add option to NOT render a post, but instead have the user upload rendered html

A frequently occurring situation is that we want to use datasets from Kaggle,
but we can not easily access them during CI because getting them requires logging in to Kaggle and accepting competition rules.
In this case, we might want to allow people to simply upload the full, rendered HTML (and state in the text how to get the data).

Could this be solved via an additional flag in the post's YAML header?

This also holds for other cases, where building the post would require a lot of additional setup / overhead (e.g. running something on a GPU which is not available on CI).

Remove title prefix ("mlr3 Use Cases: ")

"Practical Tuning Series - Build an Automated Machine Learning System" errors

https://github.com/mlr-org/mlr3gallery/runs/4059593765?check_suite_focus=true#step:16:9389

kknn.k is actually defined but reported as missing during execution.

@be-marc Do you have an idea what might be going on here?

"Bikesharing demand" post fails

Looks like due to the new pipelines update.

@be-marc Could you have a look?

https://github.com/mlr-org/mlr3gallery/runs/4246980745?check_suite_focus=true

Also is there some way you could watch the CI status without me having to open issues for failing builds? :)

Unify titles

We should try to find a unified style for the titles of the posts. They looks a bit chaotic ATM.

House Prices in King County Example

I have two comments regarding this example:

Under the section titled "Engineering Features: Mutate Zip Codes" the house price is being regressed against its own median value in each zip code, obviously resulting in a lower RMSE. This was to be expected as the price and its group median are correlated while the zipcode remains present among features (regressors).

I would have thought that the median price becomes the new target in the task_train and task_test while the house price is removed from features list and the impact-coded zipcode replaces the high-cardinality zipcode factor.

Q: Would it be a huge mistake if the median price and/or the impact-coding of the zipcode were carried out in the kc_housing dataset instead at task level? Information leak occurs anyway.

I think this comment relates to the Task$new() instance in general. The object

task_train$data() is of the data.table class.

Yet, the following, short and perfectly legit operation does not preserve the med_price variable:

task_train$data()[, med_price := median(price), by = 'zipcode']

and instead, the contorted workaround presented in Example is needed.

Q: What invalidates the above operation in mlr3?

Please advise, thank you!

cannot run your code from your article

I copied you code from this article Practical Tuning Series - Build an Automated Machine Learning System ,
when I run this code: graph_learner$param_set$values$branch.selection = to_tune(c("kknn", "svm", "ranger")),
I got this error:
Error in self$assert(xs) : Assertion on 'xs' failed: branch.selection: tune token invalid: to_tune(c("kknn", "svm", "ranger")) generates points that are not compatible with param selection. Bad value: [1] "kknn" Parameter: id class lower upper levels default 1: selection ParamInt 1 3 <NoDefault[3]>.
How should I correct it ?
Thank you.

Publish on r-bloggers

We should cherry-pick some posts from the gallery and publish them on r-bloggers via the mlr3 blog (maybe after polishing them a bit). Additionally, we should try to get the gallery into r-bloggers and r-weekly so that new posts are automatically published there.

@berndbischl your 2 cents?