tlverse / origami Goto Github PK

View Code? Open in Web Editor NEW

27.0 27.0 9.0 1.11 MB

:game_die: :crystal_ball: Comprehensive Cross-Validation Engine

Home Page: https://tlverse.org/origami

License: GNU General Public License v3.0

R 93.77% Makefile 1.27% TeX 4.96%

cross-validation machine-learning

origami's Introduction

R/`tlverse`: Your One Stop for Targeted Learning in R

The tlverse is an umbrella R package for targeted learning with the tlverse ecosystem of R packages.

library(tlverse) loads the following core packages:

sl3 for Ensemble Machine (Super) Learning
tmle3 for Targeted Minimum Loss-based Estimation (TMLE)

as well as the following helper packages:

delayed for parallelizing dependent tasks
origami for cross-validation

and packages for individual tmle3 parameters:

tmle3mopttx for targeted learning and variable importance with optimal individualized (categorical) treatments
tmle3shift for targeted learning and variable importance with stochastic interventions

Installation

The tlverse ecosystem of packages are currently hosted at https://github.com/tlverse, not yet on CRAN. You can use the devtools package to install them:

install.packages("devtools")
devtools::install_github("tlverse/tlverse")

The tlverse depends on a large number of other packages that are also hosted on GitHub. Because of this, you may see the following error:

Error: HTTP error 403.
  API rate limit exceeded for 71.204.135.82. (But here's the good news:
  Authenticated requests get a higher rate limit. Check out the documentation
  for more details.)

  Rate limit remaining: 0/60
  Rate limit reset at: 2019-03-04 19:39:05 UTC

  To increase your GitHub API rate limit
  - Use `usethis::browse_github_pat()` to create a Personal Access Token.
  - Use `usethis::edit_r_environ()` and add the token as `GITHUB_PAT`.

This just means that R tried to install too many packages from GitHub in too short of a window. To fix this, you need to tell R how to use GitHub as your user (you'll need a GitHub user account). Follow these two steps:

Use usethis::browse_github_pat() to create a Personal Access Token.
Use usethis::edit_r_environ() and add the token as GITHUB_PAT.

origami's People

Contributors

Stargazers

Watchers

Forkers

jucheng1992 osofr imalenica bunhoel guhjy yue-you lauren-eylerdang jrgant

origami's Issues

removed from CRAN due to failing tests in R 3.6+

At or around 6 March, CRAN maintainers sent notification of origami's unit tests failing due to a change in the default PRNG in R-devel (3.6 at the time). As of 22 April, the package has been removed from CRAN (https://cran.r-project.org/package=origami). We should fix this ASAP.

Dear maintainer,

Please see the problems shown on
<https://cran.r-project.org/web/checks/check_results_origami.html>.

The check problems with current R-devel are from

     • The default method for generating from a discrete uniform
       distribution (used in sample(), for instance) has been changed.
       ...
       The previous method can be requested using RNGkind() or
       RNGversion() if necessary for reproduction of old results.

To make your package successfully pass the checks for current R-devel
and R-release you may find it most convenient to insert

  suppressWarnings(RNGversion("3.5.0"))

before calling set.seed() in your example, vignette and test code (where
the difference in RNG sample kinds matters, of course).

Note that this ensures using the (old) non-uniform "Rounding" sampler
for all 3.x versions of R, and does not add an R version dependency.
Note also that the new "Rejection" sampler which R will use from 3.6.0
onwards by default is definitely preferable over the old one, so that
the above should really only be used as a temporary measure for
reproduction of the previous behavior (and the run time tests relying on
it).

Please correct before 2019-03-20 to safely retain your package on CRAN.

Forgotten future.seed = TRUE in future_lapply()

The new future 1.16.0 has an optional check for cases when parallel RNGs are forgotten. This check is by default disabled, but can be set to signal a warning or an error. When enabling errors, it will reveal this error when building your vignette:

$ export R_FUTURE_RNG_ONMISUSE=error
$ R CMD build origami 
* checking for file 'origami/DESCRIPTION' ... OK
* preparing 'origami':
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... ERROR
--- re-building ‘generalizedCV.Rmd’ using rmarkdown
Quitting from lines 249-254 (generalizedCV.Rmd) 
Error: processing vignette 'generalizedCV.Rmd' failed with diagnostics:
UNRELIABLE VALUE: Future ('future_lapply-1') unexpectedly generated random numbers
without specifying argument '[future.]seed'. There is a risk that those random numbers are
not statistically sound and the overall results might be invalid. To fix this, specify argument
argument '[future.]seed', e.g. 'seed=TRUE'. This ensures that proper, parallel-safe random
 numbers are produced via the L'Ecuyer-CMRG method. To disable this check, set option
 'future.rng.onMisuse' to "ignore".
--- failed re-building ‘generalizedCV.Rmd’

SUMMARY: processing the following file failed:
  ‘generalizedCV.Rmd’

Error: Vignette re-building failed.
Execution halted

The fix is to make sure to pass future_lapply(..., future.seed = TRUE).

This is nothing urgent, but you might wanna fix this already now, since CV functions that rely on random numbers should really use a parallel-safe RNG. There will be a several more releases of the future package before this becomes an error by default, if ever. The plan is to enable warnings in a not too far future. For a background, see HenrikBengtsson/future#353

future::values() is defunct and will soon to be removed

Hi. You use

origami/NAMESPACE

Line 39 in e1b8fe6

importFrom(future,values)

Searching your code, it doesn't look like you're actually calling values(), so I think you can drop it. If you do use values() somewhere, please use value() from future instead; it's 100%-backward compatible.

values() has been defunct since future 1.23.0 (2021-10-30) and deprecated since future 1.20.0 (2020-10-30). Now, I'd like to remove values() completely from future to clean up the API. I'd appreciate if you could fix this sooner rather than later. Thanks.

v-fold cross-validation fails when V = 1

On occasion, the tools in the origami package will be used to build tools that allow estimation under different cross-validation schemes but that hide the use of this tool from the end-user. In such cases, certain parameters relevant for cross-validation (e.g., the number of folds) might be exposed to the end-user, but origami might behave in unexpected ways in these situations. One such example comes up in folds_vfold, where the end-user might be asked to simply select a number of folds. In such a case, it's perfectly plausible for the user to set V = 1, which is equivalent to a lack of cross-validation.

In such a case, cross_validate fails, likely because make_folds does not generate an object matching its expectations. While it could be left up to the package designer to write custom code that explicitly handles the V = 1 case, it seems that this should be handled internally by origami, perhaps by simply generating a folds list of length 2, with the training and validation sub-lists being exactly identical (i.e., seq_len(n) for n being nrow() of the data). A minimal example is given in this test: https://github.com/tlverse/origami/blob/foldsvfold%3D1/tests/testthat/test-foldsvfold_equals_1.R

folds_rolling_origin on multivariate data

Hello @jeremyrcoyle and @nhejazi, Can we apply folds_rolling_origin on multivariate data where we have one target and a few independent variables? I have given below reproducible example but I am getting the error, Error in seq.int(first_window, last_window, by = batch) :
'to' must be of length 1

library(origami)
library(tidymodels)

# Core Packages
library(tidyverse)
library(lubridate)
library(timetk)
df1 <- Quandl(code = "FRED/PINCOME",
              type = "raw",
              collapse = "monthly",
              order = "asc",
              end_date="2017-12-31")
df2 <- Quandl(code = "FRED/GDP",
              type = "raw",
              collapse = "monthly",
              order = "asc",
              end_date="2017-12-31")

per <- df1 %>% rename(PI = Value)%>% select(-Date)
gdp <- df2 %>% rename(GDP = Value) 

data <- cbind(gdp,per)

data1 <- tk_augment_differences(
  .data = data,
  .value = GDP:PI,
  .lags = 1,
  .differences = 1,
  .log = TRUE,
  .names = "auto") %>%
  select(-GDP,-PI) %>%
  
  rename(GDP = GDP_lag1_diff1,PI = PI_lag1_diff1) %>% 
  drop_na()

horizon    <- 15
lag_period <- 15

data_pre_full <- data1 %>%
  # Add future window----
bind_rows(
  future_frame(.data = .,.date_var = Date, .length_out = horizon)
) %>%      
  
  # add lags----
tk_augment_lags(
  .value =  GDP : PI   , 
  .lags = lag_period) 
   

data_prepared_tbl <-   data_pre_full %>%
  
  filter(!is.na(GDP)) %>% 
  dplyr::select(-GDP : -PI)  %>%  
  drop_na()

folds <- folds_rolling_origin(
  data_prepared_tbl,
  first_window = 50, validation_size = 1, gap = 0, batch = 10
)

Repeated cross-validation

It would be nice to support repeated cross-validation, since that is a nice way to improve precision over normal k-fold cross-validation. It's also what caret generally recommends (example) so a lot of people tend to use it.

Ideally it would be possible to add repeats on in an iterative manner.

Test conditionally on 'forecast' package - otherwise check error

I think your

origami/tests/testthat/test-overall_timeseries.R

Line 1 in a53ccfa

library(forecast)

needs to be using:
```r
if (require("forecast")) {
   ...
}

instead, because the 'forecast' package is optional (listed only in Suggests). Without the package installed, you get:

checking tests ...

 ERROR
Running the tests in ‘tests/testthat.R’ failed.
Last 13 lines of output:
  Version: 1.0.0
  > library(data.table)
  > test_check("origami")
  ── 1. Error: (unknown) (@test-overall_timeseries.R#1)  ─────────────────────────
  there is no package called 'forecast'
  1: library(forecast) at testthat/test-overall_timeseries.R:1
  2: stop(txt, domain = NA)
  
  ══ testthat results  ═══════════════════════════════════════════════════════════
  OK: 24 SKIPPED: 0 FAILED: 1
  1. Error: (unknown) (@test-overall_timeseries.R#1) 
  
  Error: testthat unit tests failed
  Execution halted
  Error while shutting down parallel: unable to terminate some child processes
checking re-building of vignette outputs ... WARNING

Error in re-building vignettes:
  ...
Warning in engine$weave(file, quiet = quiet, encoding = enc) :
  Pandoc (>= 1.12.3) and/or pandoc-citeproc not available. Falling back to R Markdown v1.
Quitting from lines 286-321 (generalizedCV.Rmd) 
Error: processing vignette 'generalizedCV.Rmd' failed with diagnostics:
there is no package called 'forecast'
Execution halted
checking package dependencies ... NOTE

Package suggested but not available for checking: ‘forecast’

Partial argument names

Hi. FYI, with

options(warnPartialMatchArgs = TRUE)

I get warnings like:

Warning in rep(seq_len(V), length = n) :
  partial argument match of 'length' to 'length.out'

They come from, for instance,

origami/R/fold_funs.R

Line 48 in 299a2e0

folds <- rep(seq_len(V), length = n)

which should be:

rep(seq_len(V), length.out = n)

if cv_fun accepts argument x, this conflicts with future_lapply argument x

MRE:

do_nothing <- function(fold, x){
  #does nothing
}

folds <- make_folds(mtcars)
cross_validate(do_nothing, folds, x=mtcars)

Returns

Error: is.function(FUN) is not TRUE

because mtcars is being passed as x to future_lapply, which is supposed to get folds

clearer code styling in wrappers (via JOSS)

This should be good to go as far as JOSS is concerned, but I have a few additional comments that I hope you'll find helpful. The package looks very useful, but the examples don't quite feel right to me. I'd sort of compare it to the idea of something being "pythonic" in python. As you probably know, there are basically two distinct styles in R, base and tidy. This package isn't in the tidy philosophy (which is fine), but it doesn't feel quite right with base either. If we take your linear models example, you could swap out the use of strings for formulas, so the first few lines might look like:

  f <- formula(reg_form)
  mf <- model.frame(f, data=data)
  out_var_ind <- 1

  # split up data into training and validation sets
  train_data <- training(mf)
  valid_data <- validation(mf)

This way you could actually automate all of this for the user, and then the cv_lm() function would only need to do the fitting. Then in cross_validate(), you would pass reg_form=mpg ~ .. You could also play the game of attaching the data to the folds object (sort of like how lm() itself does) to simplify argument passing. You could even do this with a separate formula interface, and leave the current more general interface as is.

There may be some issues here that I haven't considered and you already have, making this harder than my gut instinct is leading me to believe. But if not, I think it's worth considering touching up the interface to use formulas.

unequal cluster sizes

Currently we handle clustering by making folds for the clusters, and then expanding into folds for the individuals. This can result in unequal fold sizes if the clusters are of unequal size. We should implement (at least an option) to balance the fold sizes.

get rid of S4 for folds

requested by @osofr - "Its incomprehensible"

example with future and parallel CV?

Is it possible to provide a quick working example for parallel CV evaluation using future back-end?

Presumably, parallel CV is working? If its not implemented yet, I am happy to close this issue, thanks so much!

make_folds is unnecessarily slow for big data and v-fold cross-validation

In the interest of generality, a fold is currently defined as training and validation index vectors. This allows some observations to be neither training nor validation, which is relevant for some cross-validation schemes, but not relevant for the most common v-fold case. This doesn't matter for small data, but for big sample sizes (say 1e6 observations), this takes a lot of time, and generates quite a bit more data than is really necessary to describe the folds in v-fold cross-validation.

Therefore, let's modify things so that a missing training index vector implicitly defines a training set as all observations not in the validation set.

Let's also try to optimize make_folds for big data a bit. It's particularly slow when cluster_ids are specified. For example: make_folds(cluster_ids=seq_len(1e6),V=10)

Broken link on doc pages

The Overview tab link is broken.

Stratified CV brakes in particular setting

Stratified CV brakes when the number of events for some stratum is greater than V, and for others it's less than V. This is the error: Error in strata_folds[[strata]][[v]] : subscript out of bounds.

Here's an example; this fails because number of level 1 events is < V and number of level 0 events > V.

strata_ids <- c(rep(0,15), rep(1,9)) 
folds <- origami::make_folds(strata_ids = strata_ids)

None of these examples fail:

strata_ids1 <- c(rep(0,15), rep(1,15)) # even prevalence, both levels > V 
folds1 <- origami::make_folds(strata_ids = strata_ids1)

strata_ids2 <- c(rep(0,20), rep(1,10)) # uneven prevalence, both levels > V 
folds2 <- origami::make_folds(strata_ids = strata_ids2)

strata_ids3 <- c(rep(0,5), rep(1,5)) # even prevalence, both levels < V 
folds3 <- origami::make_folds(strata_ids = strata_ids3)

strata_ids4 <- c(rep(0,8), rep(1,9)) # uneven prevalence, both levels < V 
folds4 <- origami::make_folds(strata_ids = strata_ids4)

Added-value of origami compared to existing packages?

Hi there,

I have just been made aware of your package here.

I am the maintainer of another CV framework (sperrorest) with a lot of partitioning methods (see all the partitioning methods here).

I am about to deprecate my package and integrate all partition functions into mlr because it is a much bigger framework providing all kind of possibilities when it comes to learning, error erstimation, tuning etc and it provides a good documentation.

I would like to ask

are you aware of packages like mlr or caret (I guess you are)
what are the benefits of origami?

Please do not be offended by this issue; I am really curious about your motivation and reasons as both our packages seem to be somewhat similar.

Error when passing optional arguments to `cross_validate`

Minimal example below, taken from vignette. The function cross_validate currently errors out whenever I try to pass any optional arguments to cv_fun . The error appears to be downstream with future package. This is crucial to my functionality, I have to pass optional arguments to cv_fun. Any advice on work-arounds?

library(origami)
cvlm <- function(fold, mydata) {
    train_data <- training(mydata)
    valid_data <- validation(mydata)

    mod <- lm(mpg ~ ., data = train_data)
    preds <- predict(mod, newdata = valid_data)
    list(coef = data.frame(t(coef(mod))), SE = ((preds - valid_data$mpg)^2))
}

# cross-validated estimate
folds <- make_folds(mtcars)
results <- cross_validate(cvlm, folds, mtcars)
results <- cross_validate(cvlm, folds, mydata = mtcars)

In general, I think it might not be the best practice to provide an example with a function cvlm that depends on some object (mtcars) that was defined in the calling environment of this function. This creates quite a bit of confusion and makes it hard to read the code. That's just my opinion though.

future_lapply(): Please use future_lapply() of the future.apply package

Hi, future 1.7.0 is now on CRAN and I'm now starting the process of deprecating the usage future::future_lapply(), which has moved to the future.apply package. Please upgrade your code to use:

importFrom(future.apply, future_lapply)

future.apply::future_lapply()

This should be a straightforward update without any surprises.

In the next release of the future package, future::future_lapply() will produce a formal "deprecated" warning and in the release after that it will produce a "defunct" error.

DOI Issues

The DOI listed for the package references package version 1.0 instead of 0.8.0 and links to the wrong release tag under "related identifiers". Additionally, there is a fourth author listed under the DOI who is not listed in the package DESCRIPTION (Cheng; Chris Kennedy).

Broken vignette link in README

README.md introductory vignette link http://origami.tlverse.org/origami/articles/generalizedCV.html
gives "Page not found"

disable globals by default

Having the core function cross_validate use globals by default seems to cause unexpected problems in downstream dependencies. We should consider disabling it by default. Idea originally by @jeremyrcoyle.