Giter VIP home page Giter VIP logo

mlr3gallery's People

Contributors

be-marc avatar berndbischl avatar danielsaggau avatar github-actions[bot] avatar giuseppec avatar henrifnk avatar jakob-r avatar mllg avatar pat-s avatar pfistfl avatar pkopper avatar sumny avatar web-flow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlr3gallery's Issues

add tags to posts on landing page

We kind of lost the info which topics / tags a post is on.
Would be cool to have this in italic / grey / small under the title
on the landing page

Do not attach unneeded packages

Some blog posts start with a series of library() calls. Many of them are not required because their namespace is loaded automatically. Examples are smotefamiliy, mlr3misc or mlr3data.

gallery deployment setup suboptimal

ok, its kinda 2 things

  1. it is a bit unclear for a new or edited post which files should be pushed where

  2. I don't understand why we require to push so many files in general

this is from a current edit, where I simply fixed some typos

    modified:   _posts/2020-03-18-iris-mlr3-basics/2020-03-18-iris-mlr3-basics.Rmd
    modified:   _posts/2020-03-18-iris-mlr3-basics/2020-03-18-iris-mlr3-basics.html
    modified:   docs/index.html
    modified:   docs/index.xml
    modified:   docs/posts/2020-03-18-iris-mlr3-basics/index.html
    modified:   docs/posts/posts.json
    modified:   docs/sitemap.xml

Analyis:

  • I am touching 7 (!) files, I should push one? The RMD
  • The HTML is now included in there twice? I am not even sure if that's our current setup.
    IMHO it should be in the PR zero times. Even if we for some reason require the user to push the HTML, it should be documented in the README where. (the readme mentions docs, but not if the HTML should also be in _posts)
  • I am touching global files. Thats an invitation to "merge conflict hell"? Especially as we are not merging PRs super fast here

--->

Why don't we have a normal "deploy setup" with travis? the user pushes the RMD only, travis builds the rest. and i guess then travis needs to self-push the compiled HTML under "docs
"

Impute Missing Variables article: `mlr_pipeops_missind` seems to accept only one other imputation operator at a time

While running the code for the classification example "Impute Missing Variables", I added a third imputation operator, namely mlr_pipeops_imputesample.

My graph operator - before adding the learner - looks like this:

graph = po('copy', length(po_list)) %>>% gunion(po_list) %>>% po('featureunion')

where the pipe operators list is

po_list <- list(

  imp_missing <- po('missind')
, imp_num <- po('imputehist', param_vals = list(affect_columns = selector_type('numeric')))
, imp_samp <- po('imputesample', param_vals = list(affect_columns = selector_missing()))

)

and the extracted indices are:

ids <- map_chr(po_list, `[[`, 'id')

[1] "missind"      "imputehist"   "imputesample"

as suggested in the article.

As a short reminder, the dataset used in this article refers to diabetes cases of pima indians:

> task$data()
     diabetes age glucose insulin mass pedigree pregnant pressure triceps
  1:      pos  50     148      NA 33.6    0.627        6       72      35
  2:      neg  31      85      NA 26.6    0.351        1       66      29
  3:      pos  32     183      NA 23.3    0.672        8       64      NA
  4:      neg  21      89      94 28.1    0.167        1       66      23
  5:      pos  33     137     168 43.1    2.288        0       40      35

The data has missing values in some of the variables

> task$missings()

diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227 

When training the graph with all three operators, I get the following error:

> graph$train(task)

 Error in task_data(self, rows, cols, data_format, ordered) : 
  Assertion on 'cols' failed: Must be a subset of {'diabetes','missing_glucose','missing_insulin','missing_mass','missing_pressure','missing_triceps'}, but is {'age'}. 

while the graph plot shows all three operators

image

Selecting only one operator at a time to pair with missind, for example imputehist imputes the missing data just as in the example:

ids <- map_chr(po_list, `[[`, 'id')

[1] "missind"      "imputehist"

> graph$train(task)[[1]]$data()

   diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
    missing_triceps age pedigree pregnant glucose   insulin mass pressure  triceps
  1:         present  50    0.627        6     148 118.83233 33.6       72 35.00000
  2:         present  31    0.351        1      85 212.17043 26.6       66 29.00000
  3:         missing  32    0.672        8     183  29.31409 23.3       64 10.59325
  4:         present  21    0.167        1      89  94.00000 28.1       66 23.00000
  5:         present  33    2.288        0     137 168.00000 43.1       40 35.00000

and the graph is plotted correctly

image

Pairing missind with imputesample also works:

> ids
[1] "missind"      "imputesample"

> graph$train(task)[[1]]$data()

     diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
       missing_triceps age pedigree pregnant glucose insulin mass pressure triceps
  1:         present  50    0.627        6     148     116 33.6       72      35
  2:         present  31    0.351        1      85     168 26.6       66      29
  3:         missing  32    0.672        8     183     108 23.3       64       7
  4:         present  21    0.167        1      89      94 28.1       66      23
  5:         present  33    2.288        0     137     168 43.1       40      35

as well as the plot:

image

Removing missind from the list throws the same error as before.

While using missind, I tried different selectors for imputesample such as selector_type, selector_name (naming only variables with missing data), selector_grep (idem), to no avail. The only selector that worked was selector_missing().

What is necessary for having more than one imputation operators work simultaneously with missind? Please advise!

Deployment: Install packages from yaml frontmatter

Each document has a YAML front-matter that lists the needed packages. These should be automatically installed by whatever deploy method we are using.
It seems redundant to put these in the fictitious DESCRIPTION file.

check and fix posts using german_credit task

related to mlr-org/mlr3#514

affects the following posts:

  • 2020-03-11-basics-german-credit (runs fine because no specific use of the internal structure of variables is made here)

  • 2020-03-11-mlr3pipelines-tutorial-german-credit (should run fine but is no longer very sensible because now there are only 3 integer features being used and the filter is using 3 features, therefore needs some minor changes)

  • 2020-03-11-mlr3tuning-tutorial-german-credit (runs fine but uses data from here https://github.com/mlr-org/mlr-outreach/raw/master/2019_madrid/mlr3tuning/randomsearch_3600.rds which uses the old german_credit; not sure whether other outreach files need to be changed/looked at as well)

  • 2020-03-30-stratification-blocking (runs fine, no changes needed)

  • 2020-01-31-encode-factors-for-xgboost (needs minor wording changes because there is now more than one ordered factor and unemployment_duration is now a factor)

Pro tipps: mlr3 & R6

Illustration of basic components of mlr3 and how they are composed.
E.g. Tasks, measures, resampling and syntax

Add option to NOT render a post, but instead have the user upload rendered html

A frequently occurring situation is that we want to use datasets from Kaggle,
but we can not easily access them during CI because getting them requires logging in to Kaggle and accepting competition rules.
In this case, we might want to allow people to simply upload the full, rendered HTML (and state in the text how to get the data).

Could this be solved via an additional flag in the post's YAML header?

This also holds for other cases, where building the post would require a lot of additional setup / overhead (e.g. running something on a GPU which is not available on CI).

Unify titles

We should try to find a unified style for the titles of the posts. They looks a bit chaotic ATM.

House Prices in King County Example

I have two comments regarding this example:

  1. Under the section titled "Engineering Features: Mutate Zip Codes" the house price is being regressed against its own median value in each zip code, obviously resulting in a lower RMSE. This was to be expected as the price and its group median are correlated while the zipcode remains present among features (regressors).

I would have thought that the median price becomes the new target in the task_train and task_test while the house price is removed from features list and the impact-coded zipcode replaces the high-cardinality zipcode factor.

Q: Would it be a huge mistake if the median price and/or the impact-coding of the zipcode were carried out in the kc_housing dataset instead at task level? Information leak occurs anyway.

  1. I think this comment relates to the Task$new() instance in general. The object

task_train$data() is of the data.table class.

Yet, the following, short and perfectly legit operation does not preserve the med_price variable:

task_train$data()[, med_price := median(price), by = 'zipcode']

and instead, the contorted workaround presented in Example is needed.

Q: What invalidates the above operation in mlr3?

Please advise, thank you!

cannot run your code from your article

I copied you code from this article Practical Tuning Series - Build an Automated Machine Learning System ,
when I run this code: graph_learner$param_set$values$branch.selection = to_tune(c("kknn", "svm", "ranger")),
I got this error:
Error in self$assert(xs) : Assertion on 'xs' failed: branch.selection: tune token invalid: to_tune(c("kknn", "svm", "ranger")) generates points that are not compatible with param selection. Bad value: [1] "kknn" Parameter: id class lower upper levels default 1: selection ParamInt 1 3 <NoDefault[3]>.
How should I correct it ?
Thank you.

Publish on r-bloggers

We should cherry-pick some posts from the gallery and publish them on r-bloggers via the mlr3 blog (maybe after polishing them a bit). Additionally, we should try to get the gallery into r-bloggers and r-weekly so that new posts are automatically published there.

@berndbischl your 2 cents?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.