tidymodels / rsample Goto Github PK

View Code? Open in Web Editor NEW

329.0 24.0 57.0 26.72 MB

Classes and functions to create and summarize resampling objects

Home Page: https://rsample.tidymodels.org

License: Other

R 100.00%

rsample's Introduction

rsample

Overview

The rsample package provides functions to create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used for:

resampling for estimating the sampling distribution of a statistic
estimating model performance using a holdout set

The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set, but this package does not include code for modeling or calculating statistics. The Working with Resample Sets vignette gives a demonstration of how rsample tools can be used when building models.

Note that resampled data sets created by rsample are directly accessible in a resampling object but do not contain much overhead in memory. Since the original data is not modified, R does not make an automatic copy.

For example, creating 50 bootstraps of a data set does not create an object that is 50-fold larger in memory:

library(rsample)
library(mlbench)

data(LetterRecognition)
lobstr::obj_size(LetterRecognition)
#> 2,644,640 B

set.seed(35222)
boots <- bootstraps(LetterRecognition, times = 50)
lobstr::obj_size(boots)
#> 6,686,776 B

# Object size per resample
lobstr::obj_size(boots)/nrow(boots)
#> 133,735.5 B

# Fold increase is <<< 50
as.numeric(lobstr::obj_size(boots)/lobstr::obj_size(LetterRecognition))
#> [1] 2.528426

^{Created on 2022-02-28 by the reprex
package (v2.0.1)}

The memory usage for 50 bootstrap samples is less than 3-fold more than the original data set.

Installation

To install it, use:

install.packages("rsample")

And the development version from GitHub with:

# install.packages("pak")
pak::pak("rsample")

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on Posit Community.
If you think you have encountered a bug, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

rsample's People

Contributors

Stargazers

Watchers

rsample's Issues

Consider renaming `fill()` to not clash with `tidyr::fill()`

Currently, loading rsample masks fill(). Is there a better name?

library(rsample)
#> Warning: package 'rsample' was built under R version 3.4.4
#> Loading required package: broom
#> Warning: package 'broom' was built under R version 3.4.4
#> Loading required package: tidyr
#> Warning: package 'tidyr' was built under R version 3.4.4
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill
library(tidyr)

mtcars <- mtcars[1:5,]
mtcars$mpg[1] <- NA

fill(mtcars, mpg, .direction = "up")
#> Error in UseMethod("fill"): no applicable method for 'fill' applied to an object of class "data.frame"

tidyr::fill(mtcars, mpg, .direction = "up")
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2

Created on 2018-08-15 by the reprex package (v0.2.0).

retain classes when row filtering

Currently, because of how [ works, filtering a row from a tibble strips the rsample specific classes:

> class(vfold_cv(mtcars))
[1] "vfold_cv"   "rset"       "tbl_df"     "tbl"        "data.frame"
> class(vfold_cv(mtcars) %>% dplyr::filter(id != "Fold01"))
[1] "tbl_df"     "tbl"        "data.frame"
> class(vfold_cv(mtcars)[-1,])
[1] "tbl_df"     "tbl"        "data.frame"
> class(vfold_cv(mtcars)[,-2])
[1] "tbl_df"     "tbl"        "data.frame"
> # but 
> class(vfold_cv(mtcars) %>% dplyr::select(-id))
[1] "vfold_cv"   "rset"       "tbl_df"     "tbl"        "data.frame"

A new method might work by first saving the non-data-frame classes, calling [.tbl_df, then restoring the lost classes (instead of re-implementing the method and keeping 99% the same)

Nested resampling vingette

Per topepo/caret#70, add a simple example of the analysis of nested resampling as a vignette.

Hardcoded rset_att limits extendability

Filtering attributes against a hardcoded list in maybe_rset means this package can't be extended in ways that add new attributes. This is also unexpected as new_rset doesn't warn or error if passed unordained attributes.

Here's an example:

rset <- rsample::vfold_cv(mtcars)

names(attributes(rset))
#> [1] "row.names" "class"     "names"     "v"         "repeats"   "strata"

names(attributes(dplyr::mutate(rset, x = 1:n())))
#> [1] "names"     "class"     "row.names" "v"         "repeats"   "strata"


custom_rset <- rsample:::new_rset(
  splits = list(
    rsample:::rsplit(mtcars, 1:10, 11:32),
    rsample:::rsplit(mtcars, 1:22, 23:32)
  ),
  ids = c("Slice1", "Slice2"),
  attrib = list(repeats = 1, foo = "bar"),
  subclass = c("custom", "rset")
)

names(attributes(custom_rset))
#> [1] "row.names" "class"     "names"     "repeats"   "foo"

# foo is dropped
names(attributes(dplyr::mutate(custom_rset, x = 1:n())))
#> [1] "names"     "class"     "row.names" "repeats"

Created on 2018-07-10 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.4 (2018-03-15)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Detroit             
#>  date     2018-07-10
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                          
#>  abind        1.4-5      2016-07-21 CRAN (R 3.4.4)                  
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.4)                  
#>  backports    1.1.2      2017-12-13 CRAN (R 3.4.4)                  
#>  base       * 3.4.4      2018-03-16 local                           
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.4.4)                  
#>  bindrcpp   * 0.2.2      2018-03-29 CRAN (R 3.4.4)                  
#>  broom        0.4.5      2018-07-03 cran (@0.4.5)                   
#>  class        7.3-14     2015-08-30 CRAN (R 3.4.0)                  
#>  compiler     3.4.4      2018-03-16 local                           
#>  CVST         0.2-2      2018-05-26 CRAN (R 3.4.4)                  
#>  datasets   * 3.4.4      2018-03-16 local                           
#>  ddalpha      1.3.4      2018-06-23 CRAN (R 3.4.4)                  
#>  DEoptimR     1.0-8      2016-11-19 CRAN (R 3.4.4)                  
#>  devtools     1.13.5     2018-02-18 CRAN (R 3.4.4)                  
#>  digest       0.6.15     2018-01-28 CRAN (R 3.4.4)                  
#>  dimRed       0.1.0      2017-05-04 CRAN (R 3.4.4)                  
#>  dplyr        0.7.6      2018-06-29 cran (@0.7.6)                   
#>  DRR          0.0.3      2018-01-06 CRAN (R 3.4.4)                  
#>  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.4)                  
#>  foreign      0.8-70     2018-04-23 CRAN (R 3.4.4)                  
#>  geometry     0.3-6      2015-09-09 CRAN (R 3.4.4)                  
#>  glue         1.2.0      2017-10-29 CRAN (R 3.4.4)                  
#>  gower        0.1.2      2017-02-23 CRAN (R 3.4.4)                  
#>  graphics   * 3.4.4      2018-03-16 local                           
#>  grDevices  * 3.4.4      2018-03-16 local                           
#>  grid         3.4.4      2018-03-16 local                           
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.4)                  
#>  ipred        0.9-6      2017-03-01 CRAN (R 3.4.4)                  
#>  kernlab      0.9-26     2018-04-30 CRAN (R 3.4.4)                  
#>  knitr        1.20       2018-02-20 CRAN (R 3.4.4)                  
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.3.3)                  
#>  lava         1.6.1      2018-03-28 CRAN (R 3.4.4)                  
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.4.4)                  
#>  magic        1.5-8      2018-01-26 CRAN (R 3.4.4)                  
#>  magrittr     1.5        2014-11-22 CRAN (R 3.4.4)                  
#>  MASS         7.3-50     2018-04-30 CRAN (R 3.4.4)                  
#>  Matrix       1.2-14     2018-04-09 CRAN (R 3.4.4)                  
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.4.4)                  
#>  methods    * 3.4.4      2018-03-16 local                           
#>  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.4)                  
#>  nlme         3.1-137    2018-04-07 CRAN (R 3.4.4)                  
#>  nnet         7.3-12     2016-02-02 CRAN (R 3.4.0)                  
#>  parallel     3.4.4      2018-03-16 local                           
#>  pillar       1.2.3      2018-05-25 CRAN (R 3.4.4)                  
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.4)                  
#>  pls          2.6-0      2016-12-18 CRAN (R 3.4.4)                  
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.4.4)                  
#>  prodlim      2018.04.18 2018-04-18 CRAN (R 3.4.4)                  
#>  psych        1.8.4      2018-05-06 CRAN (R 3.4.4)                  
#>  purrr        0.2.5      2018-05-29 CRAN (R 3.4.4)                  
#>  R6           2.2.2      2017-06-17 CRAN (R 3.4.4)                  
#>  Rcpp         0.12.17    2018-05-18 CRAN (R 3.4.4)                  
#>  RcppRoll     0.3.0      2018-06-05 CRAN (R 3.4.4)                  
#>  recipes      0.1.3      2018-07-09 Github (topepo/recipes@b1b5da9) 
#>  reshape2     1.4.3      2017-12-11 CRAN (R 3.4.4)                  
#>  rlang        0.2.0.9001 2018-07-06 Github (tidyverse/rlang@b4f810f)
#>  rmarkdown    1.10       2018-06-11 CRAN (R 3.4.4)                  
#>  robustbase   0.93-1     2018-06-23 CRAN (R 3.4.4)                  
#>  rpart        4.1-13     2018-02-23 CRAN (R 3.4.3)                  
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.4)                  
#>  rsample      0.0.2      2017-11-12 CRAN (R 3.4.4)                  
#>  sfsmisc      1.1-2      2018-03-05 CRAN (R 3.4.4)                  
#>  splines      3.4.4      2018-03-16 local                           
#>  stats      * 3.4.4      2018-03-16 local                           
#>  stringi      1.2.3      2018-06-12 cran (@1.2.3)                   
#>  stringr      1.3.1      2018-05-10 CRAN (R 3.4.4)                  
#>  survival     2.42-3     2018-04-16 CRAN (R 3.4.4)                  
#>  tibble       1.4.2      2018-01-22 CRAN (R 3.4.4)                  
#>  tidyr        0.8.1      2018-05-18 CRAN (R 3.4.4)                  
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.4.4)                  
#>  timeDate     3043.102   2018-02-21 CRAN (R 3.4.4)                  
#>  tools        3.4.4      2018-03-16 local                           
#>  utils      * 3.4.4      2018-03-16 local                           
#>  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                  
#>  yaml         2.1.19     2018-05-01 CRAN (R 3.4.4)

Is there a reason to only allow certain attributes? I'd be happy to remove that restriction in a PR if not.

assessment and analysis functions on objects other than rsplit

I had great trouble fixing a bug recently, both assessment and analysis returned the analysis set. The reason appeared to be that my rsplit element was still in a list. See the following minimal example.

library(tidyverse)
library(rsample)
mtcars_spl <- mtcars %>% vfold_cv()
mtcars_spl$splits[1] %>% assessment()
mtcars_spl$splits[1] %>% analysis()

This is because as.data.frame is called by both functions. Wouldn't it be better if they only work on rsplit objects and throw an informative error when called on other objects?

Combination of rsample with Amelia for missing values

Hi! I was just wondering if it'd be possible to use the rsample package with Amelia, where multiple imputation is applied, involving imputing m values for each missing cell in a data matrix and creating m "completed" data sets. Given the usual uncertainty in the imputated values, Amelia offers some confidence with whatever metrics is computed, but I'm not sure if this can be combined/extended with the rsample and tidyposterior packages, which would be ideal. Any comments on this are highly appreciated!

Kind regards,
Roberto

summary and analysis code

Basic mean, sd, summary, and other methods for naive summaries of columns that are not splits, id's, or other complex objects. Maybe default it to any numeric columns and make sure to filter out apparent sampling for bootstrapping. Also use tidyselect for tidyverse name specifications.
Add more complex methods for these summaries that are bias corrected (such as 632, 632+)
Consider an as.boot or similar function that allows to go from the tibble output to the ones in other package.
Also considering importing those functions and writing wrappers to do traditional resampling statistics from the tibble.

what should be exposed to the function in `along`?

right now, it runs map on a single column in the tibble (e.g. splits). It should perhaps get the whole row of the tibble exposed so that other columns can be accessed. This might future-proof when models are created using information in other columns.

Also... until purrr is parallelized, it might make sense to use foreach to operate across rows instead of using map.

Add `tidytemplate` to Suggests + Remote

If you try to build the pkgdown site, you'll get an error if you don't have the not-yet-on-CRAN tidytemplate; should add this to Suggests + Remotes.

I can do this.

Error in initial_split {rsample} documentation

The example in the current initial_split documentation is:

set.seed(1353)
car_split <- mc_cv(mtcars)
train_data <- training(car_split)
test_data <- testing(car_split)

I think the correct version should be:

set.seed(1353)
car_split <- mc_cv(mtcars)
train_data <- training(car_split$splits[[1]])
test_data <- testing(car_split$splits[[1]])

Now, if the initial_split function is used instead of mc_cv, then the correct version is:

set.seed(1353)
car_split <- initial_split(mtcars)
train_data <- training(car_split)
test_data <- testing(car_split)

Importing `recipes` introduces many implicit dependencies

At current moment, installation of rsample (supposing tidyverse is installed) causes many packages to be installed as dependencies of recipes. I had 20 extra packages installed (excluding recipes itself):

List of extra packages

‘numDeriv’, ‘SQUAREM’, ‘abind’, ‘lava’, ‘kernlab’, ‘CVST’, ‘DEoptimR’, ‘magic’, ‘prodlim’, ‘DRR’, ‘robustbase’, ‘sfsmisc’, ‘geometry’, ‘ipred’, ‘dimRed’, ‘timeDate’, ‘ddalpha’, ‘gower’, ‘RcppRoll’, ‘pls’

I understand the reason to facilitate using the recipes with rsample by providing prepper() wrapper, but I think 20 packages not explicitly needed for core functionality is a little bit too much for one simple wrapper. Maybe it is a good idea to move recipes back to Suggests, mention it with wrapper in vignette, and possibly add rsample into recipes Import (as it already has many imported packages)? Keep rsample more light-weight?

Also current situation might be inconvenient for package authors willing to import rsample.

rset constructor

I think you should have something like this:

new_rset <- function(splits, ..., subclass = character()) {
  stopifnot(is.list(splits))

  structure(list(
    splits = splits,
    ...
  ), class = c(subclass, "rset"))
}

That's also a good place to document the details of the object structure.

add a forecast example to vignettes

Use an example from the repo page.

Add a leave-group-out resampling method

Similar to caret's groupKFold.

Simplified bootstrapping protocols for simple cases

I'm following up on our conversation on Twitter. I'd like to propose that there is a use case for a very basic bootstrapping interface that imposes minimal cognitive load on the user. In the simplest scenario, I want to be able to take an existing dplyr or ggplot pipeline and simply insert the word "bootstrap" somewhere and have a bootstrapped version of that pipeline. This would be easy to do with a function that generates a bootstrapped version of the input data frame, but in general generating bootstraps on the fly will be more memory efficient, in particular for very large data sets or larger numbers of bootstraps.

First, let me provide some examples of what I mean by going from an existing pipeline to a bootstrapped one and how I have solved this problem with some functions I have written, which are available here.

library(dplyr)
library(broom)
library(ggplot2)

# without bootstrap
iris %>% group_by(Species) %>%
  summarize(
    mean_sepal_length = mean(Sepal.Length),
    mean_petal_length = mean(Sepal.Length)
  )
#> # A tibble: 3 x 3
#>   Species    mean_sepal_length mean_petal_length
#>   <fct>                  <dbl>             <dbl>
#> 1 setosa                  5.01              5.01
#> 2 versicolor              5.94              5.94
#> 3 virginica               6.59              6.59

# with bootstrap
iris %>% group_by(Species) %>%
  ungeviz::bootstrap_summarize(3,
    mean_sepal_length = mean(Sepal.Length),
    mean_petal_length = mean(Sepal.Length)
  )
#> # A tibble: 9 x 4
#>   Species    .draw mean_sepal_length mean_petal_length
#>   <fct>      <int>             <dbl>             <dbl>
#> 1 setosa         1              5.07              5.07
#> 2 setosa         2              4.98              4.98
#> 3 setosa         3              4.95              4.95
#> 4 versicolor     1              5.99              5.99
#> 5 versicolor     2              6.03              6.03
#> 6 versicolor     3              5.99              5.99
#> 7 virginica      1              6.55              6.55
#> 8 virginica      2              6.71              6.71
#> 9 virginica      3              6.59              6.59

# without bootstrap
iris %>% group_by(Species) %>%
  filter(Species != "setosa") %>%
  do(
    tidy(lm(Sepal.Length ~ Petal.Length, data = .))
  )
#> # A tibble: 4 x 6
#> # Groups:   Species [2]
#>   Species    term         estimate std.error statistic  p.value
#>   <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 versicolor (Intercept)     2.41     0.446       5.39 2.08e- 6
#> 2 versicolor Petal.Length    0.828    0.104       7.95 2.59e-10
#> 3 virginica  (Intercept)     1.06     0.467       2.27 2.77e- 2
#> 4 virginica  Petal.Length    0.996    0.0837     11.9  6.30e-16

# with bootstrap
iris %>% group_by(Species) %>%
  filter(Species != "setosa") %>%
  ungeviz::bootstrap_do(3,
    tidy(lm(Sepal.Length ~ Petal.Length, data = .))
  )
#> # A tibble: 12 x 7
#>    Species    .draw term         estimate std.error statistic  p.value
#>    <fct>      <int> <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#>  1 versicolor     1 (Intercept)    2.70      0.462     5.84   4.47e- 7
#>  2 versicolor     1 Petal.Length   0.752     0.105     7.14   4.45e- 9
#>  3 versicolor     2 (Intercept)    2.49      0.354     7.04   6.43e- 9
#>  4 versicolor     2 Petal.Length   0.806     0.0843    9.55   1.12e-12
#>  5 versicolor     3 (Intercept)    2.33      0.453     5.14   5.06e- 6
#>  6 versicolor     3 Petal.Length   0.848     0.104     8.12   1.45e-10
#>  7 virginica      1 (Intercept)    0.0159    0.540     0.0295 9.77e- 1
#>  8 virginica      1 Petal.Length   1.17      0.0953   12.3    1.80e-16
#>  9 virginica      2 (Intercept)    1.28      0.586     2.18   3.42e- 2
#> 10 virginica      2 Petal.Length   0.945     0.106     8.89   1.05e-11
#> 11 virginica      3 (Intercept)    0.987     0.447     2.21   3.20e- 2
#> 12 virginica      3 Petal.Length   1.02      0.0801   12.7    5.72e-17

# without bootstrap
mtcars %>%
  ggplot(aes(hp, mpg)) + geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# with bootstrap
mtcars %>%
  ungeviz::bootstrap(10) %>%
  ggplot(aes(hp, mpg, group = .draw)) + geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'

I think all of these cases are very intuitive. I wrote bootstrap_summarize() and bootstrap_do() because they can summarize and do on bootstrap samples generated on the fly. I wrote bootstrap(), which simply generates a bootstrapped dataset, because I found that I needed it, for example in plotting.

One case I don't know how to handle is the following:

library(dplyr)
library(tidyr)
library(purrr)
library(broom)

iris %>% group_by(Species) %>%
  filter(Species != "setosa") %>%
  nest() %>%
  mutate(
    model = map(data, ~lm(Sepal.Length ~ Petal.Length, data = .x)),
    coef = map(model, tidy)
  ) %>%
  select(Species, coef) %>%
  unnest()
#> # A tibble: 4 x 6
#>   Species    term         estimate std.error statistic  p.value
#>   <fct>      <chr>           <dbl>     <dbl>     <dbl>    <dbl>
#> 1 versicolor (Intercept)     2.41     0.446       5.39 2.08e- 6
#> 2 versicolor Petal.Length    0.828    0.104       7.95 2.59e-10
#> 3 virginica  (Intercept)     1.06     0.467       2.27 2.77e- 2
#> 4 virginica  Petal.Length    0.996    0.0837     11.9  6.30e-16

I now frequently see this kind of workflow recommended instead of the do() workflow I used above, but it's not entirely clear to me why. It requires quite a bit more code, and much more mental processing.

In any case, I'd be happy to contribute these functions (or variants thereof) to rsample. Maybe the bootstrap() function should be called bootstrap_df() to highlight that it creates a data frame and to separate it from the existing bootstraps() function.

@hadley proposed alternatively that a bootstrap() function could set up virtual bootstraps that then could be processed by summarize(). I like this idea in principle, but the question is what functions have to be modified. Both summarize() and do() should respect these virtual bootstraps I think. The ggplot package could get a fortify() method that simply expands virtual bootstraps to actual bootstraps. But what about nest() and mutate()? Would we expect the following pattern to work? Users would probably expect that it would.

iris %>% group_by(Species) %>%
  filter(Species != "setosa") %>%
  bootstrap(10) %>%
  nest() %>%
  mutate(
    model = map(data, ~lm(Sepal.Length ~ Petal.Length, data = .x)),
    coef = map(model, tidy)
  ) %>%
  select(Species, coef) %>%
  unnest()

Write an RViews post for `rsample`

Best practices when using `rsample` in modelling packages

I'd like to convince a professor of mine to use rsample in a modelling package she's working on. The hope would be to deal with one new abstraction from rsample and gain several new types of resampling in the process.

The model currently has two options for users to specify data (with a recipes interface coming soon!):

a data matrix X together with a response vector y
a formula formula together with a data frame data

The user specifies how many folds to use and potentially how many times to repeat the cross validation via some arguments to the model fitting function.

My initial thought was to extend the formula interface to allow the data argument to be an rset. But I don't think there's a clean parallel for the matrix interface since rset generators need a data frame.

An alternative might be for to write a resampling specification in the vein of recipes that could then be applied to either a matrix or a data frame. This is more appealing to me since it keeps the data entry and model calibration options separate. In our case at least, this model calibration specification could come as a package deal including about resampling, loss function, loss strategy (i.e. min RMSE vs sparsest within 1-se) and some other custom tuning knobs.

The workflow might then look something like this

trained_model <- 
  model(formula/X,
        data/y,
        calibration = new_calibration(method = "vfold_cv",
                                      v = 5,
                                      loss = "rmse",
                                      rule = "1-se",
                                      custom_option = "blah"),
        ...)

What do you think is the best direction to go here?

(Aside: is there a way to extract train/test indices from an rsplit object? This would let us reuse a lot of existing code).

Block bootstrapping (eg time series)

I can't quite make out how one might use this package to do block bootstraps (eg resampling disjoint groups, or a moving/overlapping block bootstrap) that might be used for time-series data.

Am I just missing how to do that, or if not, are there any plans to add that functionality?

add recipes example in live code

Concordance in Survival Analysis example

I'm not an expert on this, but shouldn't the function coxph be used instead of survreg?

mod_fit <- function(x, form, ...) 
  survreg(form, data = analysis(x), ...)

I was actually following this example for my own analysis, but several references (like this one) mention that the concordance index should be between 0.5 and 1, otherwise the model is really poor. Since I was getting values around 0.3, I was wondering what I was doing wrong and if the models that I was getting were really that bad. Then I found out that survreg fits a parametric survival regression model while coxph fits a Cox proportional hazards regression model. Using the second one you get more reasonable concordance indexes. Also, using the latter and a similar coding scheme as get_concord, you can get the calibration slopes to further validate the models.

splits function

I'm not sure I get the advantage of

splits(bt_resamples, .elem = "splits")[[1]]

over

bt_resamples$splits$splits[[1]]

Maybe we could reduce the need for it with a better print method:

i.e. instead of

# A tibble: 3 x 2
        splits         id
        <list>      <chr>
1 <S3: rsplit> Bootstrap1
2 <S3: rsplit> Bootstrap2
3 <S3: rsplit> Bootstrap3

We could have

# A tibble: 3 x 2
        splits         id
        <list>      <chr>
1  <32/12/32> Bootstrap1
2  <32/12/32> Bootstrap2
3  <32/12/32> Bootstrap3

Although maybe a little more structure would help, i.e. <32/12 : 32>?

Merge additional samplers implemented in the`resamplr` package?

In https://github.com/jrnold/resamplr, I had implemented quite a few samplers extending the resample objects from the modelr package. This included: bootstrap, jackknife, random test/train sets, k-fold cross-validation, leave-one-out and leave-p-out cross-validation, time-series cross validation, time-series k-fold cross validation, permutations, rolling windows. They also generally respected grouped data frames from clustered or stratified sampling. However, my package seems redundant now. I was wondering if you were interested in me incorporating any samplers from resamplr that aren't in rsample if I were to submit a pull request.

nested CV: oversampled data for inner loop but not for outer loop

Hi,

I have an imbalanced 3-class problem and hence I want to use (SMOTE) oversampling in order to train/tune the classifiers more reliably.
But, I would like to evaluate my classifiers in the outside loop using the original data, because I know that, in the future, my new data will be imbalanced.

Using e.g. nested_cv() and having a look at the source codes, I found now ways to do this easily, since there is only one data argument, with the inside and outside being built on the same data.

Have you any ideas on how to do this using you very handy framework ?

Expose "constructors" for rsample & rset objects

Lately I've been making splits and rsets in some non-standard ways for internal (work) packages. I've been getting along fine using new_rset/make_splits/rsplit, but there's always a little discomfort in depending on triple-colon hidden functions in another package.

Maybe this is already on the roadmap, but I would greatly appreciate exported functions for making custom objects from this package. I suspect you'll want to wait until you've solidified the API some more, but hopefully by making a note of it here you might consider @export-ing new_rset/make_splits when the time is right. Thanks!

remote data reference?

Since we are always referring to the same data set it would be advantageous to make the data object a reference. I could see:

an assignment of the data to the global environment or a custom environment, or
a data.table object

If implemented, it should be optional.

@hadley any thoughts? I'm sure that this occurred to you in modelr

write an assignment methods for splits

Once a rset object is created, there should be a shortcut to get to the split objects:

bootstraps <- boot(Sacramento, times = 2000, oob = FALSE)
# Get to the resamples using
bootstraps$splits$splits

The extraction function splits currently accesses this but we should have a shortcut to assign one or more columns to the tibble bootstraps$splits sort of like

rownames(x) <- some_vector

we should be able to do

add(rset_obj, value) <- along(rset_obj, func)

(or maybe include or append or some other verb would be better).

This should work if the rhs is a vector or matrix/data.frame/tibble although another verb might be better for the multivariate case (such as join, or even cbind)

conversion function for caret resampling indices

rsample2caret or something like that to generate the index and indexOut list with proper labels. Make a generic method.

Bayesian analysis of model resampling results

Similar to this paper and caret's resamples using stan.

Add `shuffle` arg to initial_split

I'm a big fan of initial_split(); feels very natural to make one "big" split, then e.g. k-fold the training set of the big split.

Would it make sense to allow for ordered/non-shuffling samples, for use in timeseries problems where further sampling would be done by rolling_origin?

The easiest solution is probably to add an arg like shuffle = FALSE to initial_split, which would mirror how scikit-learn does it (http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

It's easy to imagine going deeper, allowing for e.g. the skip arg of rolling_origin, or an order_by (similar to tidymodels/recipes#171), but starting small with a shuffle arg might be a good start.

Add context() calls to test files.

This package doesn't play nice when running tests-only, e.g. devtools::test() or w/ RStudio. If all tests pass it's fine, but if one is broken you get unhelpful errors and no sense of locality, making TDD very difficult.

Adding one context() call to each test_*.R file should fix this.

I can do this.

stratification on numeric data

caret currently stratified by dicing the data into percentiles. I think the rsample algorithm should be something like:

if #unique < 5
|  determine if there are values that occur infrequently and pool
|  convert to factor
else 
|  determine #breaks for percentiles based on sample size
|  convert data to bins
end

Irregardless of data type, we should check to see if the stratification failed and do an unstratified split with a corresponding warning.

should rsplits have an extra class for the split type?

It might be nice to have this so that methods can be made to compute the complement of an analysis set and other stuff.

rolling_origin() should allow for an expanding window + sliding window in the same call

This Uber post talks about how useful it can be to expand a window up until you have enough data, and then proceed to use a sliding window. I think that could be useful and I don't think rolling_origin() does this at the moment. https://eng.uber.com/forecasting-introduction/

permutations

The package should have a permute sampler that has a dplyr variable selector argument for choosing which variables should be shuffled. An original (or better named) option would signal that a copy of the original data is contained as a resample.

This would not happen for the first release unless selectr is already on CRAN.

This isn't really a resample, so maybe it shouldn't be here.

Typo in "Recipes with rsample" vignette

Hi,

I accidentally posted this in the recipes repo before I realized this was an rsample vignette:
tidymodels/recipes#98

The problem is in this vignette:

https://cran.r-project.org/web/packages/rsample/vignettes/Recipes_and_rsample.html

Right before the "Working with Resamples" section there is this sentence:

The next section will explore recipes using bootstrap resampling is used for modeling:

I believe they mean

The next section will explore recipes when bootstrap resampling is used for modeling:

The next section will explore recipes using bootstrap resampling for modeling:

[Feature Request] Shuffle - for permutation tests

It would be nice to have a function similar to bootstraps, but with replace = FALSE and allows the users to specify a value column and a key column. The use-case for a function like this would be for permutation tests.

Following the example from the website, I could see something like

median_diff <- function(splits) {
  median(x$MonthlyIncome[x$Gender == "Female"]) - median(x$MonthlyIncome[x$Gender == "Male"])     
}

shuffle_resamples <- shuffle(attrition, key = Gender, value = MonthlyIncome, times = 500)

shuffle_resamples$wage_diff <- map_dbl(shuffle_resamples $splits, median_diff)

Because the keys are shuffled with replacement this should create a null distribution which could be compared to the sample median_diff.

A idea for an example, rolling_origin() over irregular time chunks

This is not an issue, more of an example idea for rolling_origin().

I've been working closely with the tsibble team on some of their rolling functions, and we worked on rolling over irregular calendar periods. I think this could be a neat addition for rolling_origin() and luckily it's already built in. This could be a neat example for the docs somewhere, and I think the idea overall is pretty powerful, especially for time series modeling work where you want to ensure you are using calendar windows rather than fixed windows of say 5 periods.

library(rsample)
library(riingo)
library(tidyverse)
library(tsibble)

aapl <- riingo::riingo_prices("AAPL", start_date = "2000-01-01")

aapl
#> # A tibble: 4,658 x 14
#>    ticker date                close  high   low  open volume adjClose
#>    <chr>  <dttm>              <dbl> <dbl> <dbl> <dbl>  <int>    <dbl>
#>  1 AAPL   2000-01-03 00:00:00 112.  112.  102.  105.  4.78e6     3.56
#>  2 AAPL   2000-01-04 00:00:00 102.  111.  101.  108.  4.57e6     3.26
#>  3 AAPL   2000-01-05 00:00:00 104   111.  103   104.  6.95e6     3.30
#>  4 AAPL   2000-01-06 00:00:00  95   107    95   106.  6.86e6     3.02
#>  5 AAPL   2000-01-07 00:00:00  99.5 101    95.5  96.5 4.11e6     3.16
#>  6 AAPL   2000-01-10 00:00:00  97.8 102.   94.8 102   4.51e6     3.10
#>  7 AAPL   2000-01-11 00:00:00  92.8  99.4  90.5  95.9 3.94e6     2.95
#>  8 AAPL   2000-01-12 00:00:00  87.2  95.5  86.5  95   8.71e6     2.77
#>  9 AAPL   2000-01-13 00:00:00  96.8  98.8  92.5  94.5 9.22e6     3.07
#> 10 AAPL   2000-01-14 00:00:00 100.  102.   99.4 100   3.49e6     3.19
#> # ... with 4,648 more rows, and 6 more variables: adjHigh <dbl>,
#> #   adjLow <dbl>, adjOpen <dbl>, adjVolume <int>, divCash <dbl>,
#> #   splitFactor <dbl>

# Nest monthly
aapl_monthly_nested <- aapl %>%
  mutate(ym = yearmonth(date)) %>%
  nest(-ym)

aapl_monthly_nested
#> # A tibble: 223 x 2
#>          ym data              
#>       <mth> <list>            
#>  1 2000 Jan <tibble [20 × 14]>
#>  2 2000 Feb <tibble [20 × 14]>
#>  3 2000 Mar <tibble [23 × 14]>
#>  4 2000 Apr <tibble [19 × 14]>
#>  5 2000 May <tibble [22 × 14]>
#>  6 2000 Jun <tibble [22 × 14]>
#>  7 2000 Jul <tibble [20 × 14]>
#>  8 2000 Aug <tibble [23 × 14]>
#>  9 2000 Sep <tibble [20 × 14]>
#> 10 2000 Oct <tibble [22 × 14]>
#> # ... with 213 more rows

# Then roll over the months
# 5 months in each slice (irregular number of days in each 5 month set)
aapl_rolled <- aapl_monthly_nested %>%
  rolling_origin(cumulative = FALSE)

# Now you can map over irregular rolling subsets
map(aapl_rolled$splits, ~ analysis(.x)) %>%
  head
#> [[1]]
#> # A tibble: 5 x 2
#>         ym data              
#>      <mth> <list>            
#> 1 2000 Jan <tibble [20 × 14]>
#> 2 2000 Feb <tibble [20 × 14]>
#> 3 2000 Mar <tibble [23 × 14]>
#> 4 2000 Apr <tibble [19 × 14]>
#> 5 2000 May <tibble [22 × 14]>
#> 
#> [[2]]
#> # A tibble: 5 x 2
#>         ym data              
#>      <mth> <list>            
#> 1 2000 Feb <tibble [20 × 14]>
#> 2 2000 Mar <tibble [23 × 14]>
#> 3 2000 Apr <tibble [19 × 14]>
#> 4 2000 May <tibble [22 × 14]>
#> 5 2000 Jun <tibble [22 × 14]>
#> 
#> [[3]]
#> # A tibble: 5 x 2
#>         ym data              
#>      <mth> <list>            
#> 1 2000 Mar <tibble [23 × 14]>
#> 2 2000 Apr <tibble [19 × 14]>
#> 3 2000 May <tibble [22 × 14]>
#> 4 2000 Jun <tibble [22 × 14]>
#> 5 2000 Jul <tibble [20 × 14]>
#> 
#> [[4]]
#> # A tibble: 5 x 2
#>         ym data              
#>      <mth> <list>            
#> 1 2000 Apr <tibble [19 × 14]>
#> 2 2000 May <tibble [22 × 14]>
#> 3 2000 Jun <tibble [22 × 14]>
#> 4 2000 Jul <tibble [20 × 14]>
#> 5 2000 Aug <tibble [23 × 14]>
#> 
#> [[5]]
#> # A tibble: 5 x 2
#>         ym data              
#>      <mth> <list>            
#> 1 2000 May <tibble [22 × 14]>
#> 2 2000 Jun <tibble [22 × 14]>
#> 3 2000 Jul <tibble [20 × 14]>
#> 4 2000 Aug <tibble [23 × 14]>
#> 5 2000 Sep <tibble [20 × 14]>
#> 
#> [[6]]
#> # A tibble: 5 x 2
#>         ym data              
#>      <mth> <list>            
#> 1 2000 Jun <tibble [22 × 14]>
#> 2 2000 Jul <tibble [20 × 14]>
#> 3 2000 Aug <tibble [23 × 14]>
#> 4 2000 Sep <tibble [20 × 14]>
#> 5 2000 Oct <tibble [22 × 14]>

# When writing the function, you access the full rolling months' worth of data 
# with this bind_rows idea
.x <- analysis(aapl_rolled$splits[[1]])
bind_rows(.x$data)
#> # A tibble: 104 x 14
#>    ticker date                close  high   low  open volume adjClose
#>    <chr>  <dttm>              <dbl> <dbl> <dbl> <dbl>  <int>    <dbl>
#>  1 AAPL   2000-01-03 00:00:00 112.  112.  102.  105.  4.78e6     3.56
#>  2 AAPL   2000-01-04 00:00:00 102.  111.  101.  108.  4.57e6     3.26
#>  3 AAPL   2000-01-05 00:00:00 104   111.  103   104.  6.95e6     3.30
#>  4 AAPL   2000-01-06 00:00:00  95   107    95   106.  6.86e6     3.02
#>  5 AAPL   2000-01-07 00:00:00  99.5 101    95.5  96.5 4.11e6     3.16
#>  6 AAPL   2000-01-10 00:00:00  97.8 102.   94.8 102   4.51e6     3.10
#>  7 AAPL   2000-01-11 00:00:00  92.8  99.4  90.5  95.9 3.94e6     2.95
#>  8 AAPL   2000-01-12 00:00:00  87.2  95.5  86.5  95   8.71e6     2.77
#>  9 AAPL   2000-01-13 00:00:00  96.8  98.8  92.5  94.5 9.22e6     3.07
#> 10 AAPL   2000-01-14 00:00:00 100.  102.   99.4 100   3.49e6     3.19
#> # ... with 94 more rows, and 6 more variables: adjHigh <dbl>,
#> #   adjLow <dbl>, adjOpen <dbl>, adjVolume <int>, divCash <dbl>,
#> #   splitFactor <dbl>

Created on 2018-07-10 by the reprex package (v0.2.0).

Add an attribute for the resample identifier(s)

e.g. id, id2, fold etc

issues when loading a rsample tibble with rdata

Hi,

I know this isn't the great contribution, but I wanted to report a ("bug") problem I found when loading a big dataset. My dataset has 86 rows and 7000 columns (gene expression dataset) and when I wanted to save my environment and loading it again I found out the memory limit of my CPU (32 GB Windows 10 currently) was going overflowed. I understand that this is because rsample is using a memory-efficient storage of large data when working but when loading this isn't working because R needs to read each dataset before?...
I was using the classic commands:
save from Rstudio Environment option
load(my_env.Rdata)

Error message (classic):
Memory Allocation “Error: cannot allocate vector of size 75.1 Mb”

I already increased memory.limit, checked the CPU Performance.

carlos,

group-preserving sampling

I don't know what to call this, but I'll try to explain my use-case.

I've got a set of data I want to split up for cross-validation (assume v-folding). These observations have a grouping variable, and I want to ensure all groups are kept together, and never split up, when sampling here. Almost an opposite of strata.

As an example, we could use mtcars and the cyl variable; there's 3 unique values (4, 6, 8), so a 3-fold of this type should produce one fold where the assessment is only cyl = 4, another where cyl = 6, etc.

To do that now I have to k-fold on just distinct(mtcars, cyl), and then do something hacky to "expand" those folds.

Would it be possible to combine nested_cv with #23 to achieve this? If not, and this is worthy of inclusion, I'd be happy to help code it up.

Here's my hack:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(rsample)
#> Loading required package: broom

# suppose we want to keep cylinder-groups together
# we'll vfold those instead of the whole thing
initial_fold <- mtcars %>%
  distinct(cyl) %>%
  vfold_cv(v = 3)

# take a look
initial_fold %>%
  pull(splits) %>%
  map(assessment)
#> $`1`
#>   cyl
#> 2   4
#> 
#> $`2`
#>   cyl
#> 3   8
#> 
#> $`3`
#>   cyl
#> 1   6

# take an existing rsplit object and expand given the original it was subsampled from
expand_split <- function(split, orig) {
  
  vars <- colnames(split$data)
  
  rows_in  <- which(pull(orig, vars) %in% pull(analysis(split), vars))
  rows_out <- which(pull(orig, vars) %in% pull(assessment(split), vars))
  
  rsample:::rsplit(orig, sample(rows_in), sample(rows_out))  # can't forget to shuffle
}

# now apply that to each split
expanded_fold <- initial_fold %>%
  mutate(splits = map(splits, expand_split, mtcars))

# take a look
expanded_fold %>%
  pull(splits) %>%
  map(assessment)
#> $`1`
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> 
#> $`2`
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> 
#> $`3`
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4

# other side
expanded_fold %>%
  pull(splits) %>%
  map(analysis)
#> $`1`
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> 
#> $`2`
#>                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Merc 280C      17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
#> Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
#> Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#> Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Valiant        18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Merc 280       19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
#> Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> 
#> $`3`
#>                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
#> Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
#> Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
#> Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
#> Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
#> Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
#> Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
#> Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
#> Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
#> Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
#> Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
#> Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
#> Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
#> AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
#> Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#> Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
#> Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
#> Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#> Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
#> Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
#> Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2

Also available in this gist.

Tried to make a multi-variable version, but it's a lot harder to get "indices of rows in tibble x that match tibble y" than I expected; dplyr pushes that all down into Rcpp-land for the *_join functions :(

along, split, data structures etc

Is there any way we could make rset a tibble, instead of including a tibble? I think that would make it possible to eliminate along() and split().

Allow for resampling of vectors

Max - Thanks for all of the excellent work you do!

I was writing some code to calculate bootstrapped statistics and found the desire to do this on a vector instead of a dataframe.

Cheers.

Minimal, reproducible example:

# Setup
library(rsample)
library(purrr)

# What I want to accomplish
statistic <- rnorm(1000)
bootstrapped_mean <- map_dbl(bootstraps(statistic)$splits, ~ mean(analysis(.x)))
#> Error: `num` should be > 0
summary(bootstrapped_mean)
#> Error in summary(bootstrapped_mean): object 'bootstrapped_mean' not found

# What I have to do
df = data.frame(statistic = rnorm(1000))
bootstrapped_mean <- map_dbl(bootstraps(df)$splits, ~ mean(analysis(.x)$statistic))
summary(bootstrapped_mean)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#> -0.10787 -0.09432 -0.06594 -0.06679 -0.04551  0.04240

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       America/New_York            
#>  date     2018-05-19
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source        
#>  abind        1.4-5      2016-07-21 CRAN (R 3.5.0)
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.5.0)
#>  backports    1.1.2      2017-12-13 CRAN (R 3.5.0)
#>  base       * 3.5.0      2018-04-23 local         
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.5.0)
#>  bindrcpp     0.2.2      2018-03-29 CRAN (R 3.5.0)
#>  broom      * 0.4.4      2018-03-29 CRAN (R 3.5.0)
#>  class        7.3-14     2015-08-30 CRAN (R 3.5.0)
#>  compiler     3.5.0      2018-04-23 local         
#>  CVST         0.2-1      2013-12-10 CRAN (R 3.5.0)
#>  datasets   * 3.5.0      2018-04-23 local         
#>  ddalpha      1.3.3      2018-04-30 CRAN (R 3.5.0)
#>  DEoptimR     1.0-8      2016-11-19 CRAN (R 3.5.0)
#>  devtools     1.13.5     2018-02-18 CRAN (R 3.5.0)
#>  digest       0.6.15     2018-01-28 CRAN (R 3.5.0)
#>  dimRed       0.1.0      2017-05-04 CRAN (R 3.5.0)
#>  dplyr        0.7.4      2017-09-28 CRAN (R 3.5.0)
#>  DRR          0.0.3      2018-01-06 CRAN (R 3.5.0)
#>  evaluate     0.10.1     2017-06-24 CRAN (R 3.5.0)
#>  foreign      0.8-70     2017-11-28 CRAN (R 3.5.0)
#>  geometry     0.3-6      2015-09-09 CRAN (R 3.5.0)
#>  glue         1.2.0      2017-10-29 CRAN (R 3.5.0)
#>  gower        0.1.2      2017-02-23 CRAN (R 3.5.0)
#>  graphics   * 3.5.0      2018-04-23 local         
#>  grDevices  * 3.5.0      2018-04-23 local         
#>  grid         3.5.0      2018-04-23 local         
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.5.0)
#>  ipred        0.9-6      2017-03-01 CRAN (R 3.5.0)
#>  kernlab      0.9-26     2018-04-30 CRAN (R 3.5.0)
#>  knitr        1.20       2018-02-20 CRAN (R 3.5.0)
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.5.0)
#>  lava         1.6.1      2018-03-28 CRAN (R 3.5.0)
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.5.0)
#>  magic        1.5-8      2018-01-26 CRAN (R 3.5.0)
#>  magrittr     1.5        2014-11-22 CRAN (R 3.5.0)
#>  MASS         7.3-49     2018-02-23 CRAN (R 3.5.0)
#>  Matrix       1.2-14     2018-04-13 CRAN (R 3.5.0)
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)
#>  methods    * 3.5.0      2018-04-23 local         
#>  mnormt       1.5-5      2016-10-15 CRAN (R 3.5.0)
#>  nlme         3.1-137    2018-04-07 CRAN (R 3.5.0)
#>  nnet         7.3-12     2016-02-02 CRAN (R 3.5.0)
#>  parallel     3.5.0      2018-04-23 local         
#>  pillar       1.2.2      2018-04-26 CRAN (R 3.5.0)
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.5.0)
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.5.0)
#>  prodlim      2018.04.18 2018-04-18 CRAN (R 3.5.0)
#>  psych        1.8.4      2018-05-06 CRAN (R 3.5.0)
#>  purrr      * 0.2.4      2017-10-18 CRAN (R 3.5.0)
#>  R6           2.2.2      2017-06-17 CRAN (R 3.5.0)
#>  Rcpp         0.12.16    2018-03-13 CRAN (R 3.5.0)
#>  RcppRoll     0.2.2      2015-04-05 CRAN (R 3.5.0)
#>  recipes      0.1.2      2018-01-11 CRAN (R 3.5.0)
#>  reshape2     1.4.3      2017-12-11 CRAN (R 3.5.0)
#>  rlang        0.2.0      2018-02-20 CRAN (R 3.5.0)
#>  rmarkdown    1.9        2018-03-01 CRAN (R 3.5.0)
#>  robustbase   0.93-0     2018-04-24 CRAN (R 3.5.0)
#>  rpart        4.1-13     2018-02-23 CRAN (R 3.5.0)
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.5.0)
#>  rsample    * 0.0.2      2017-11-12 CRAN (R 3.5.0)
#>  sfsmisc      1.1-2      2018-03-05 CRAN (R 3.5.0)
#>  splines      3.5.0      2018-04-23 local         
#>  stats      * 3.5.0      2018-04-23 local         
#>  stringi      1.2.2      2018-05-02 CRAN (R 3.5.0)
#>  stringr      1.3.0      2018-02-19 CRAN (R 3.5.0)
#>  survival     2.41-3     2017-04-04 CRAN (R 3.5.0)
#>  tibble       1.4.2      2018-01-22 CRAN (R 3.5.0)
#>  tidyr      * 0.8.0      2018-01-29 CRAN (R 3.5.0)
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.5.0)
#>  timeDate     3043.102   2018-02-21 CRAN (R 3.5.0)
#>  tools        3.5.0      2018-04-23 local         
#>  utils      * 3.5.0      2018-04-23 local         
#>  withr        2.1.2      2018-03-15 CRAN (R 3.5.0)
#>  yaml         2.1.19     2018-05-01 CRAN (R 3.5.0)

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.5.0 (2018-04-23)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  English_United States.1252  
#>  tz       America/New_York            
#>  date     2018-05-19
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source        
#>  abind        1.4-5      2016-07-21 CRAN (R 3.5.0)
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.5.0)
#>  backports    1.1.2      2017-12-13 CRAN (R 3.5.0)
#>  base       * 3.5.0      2018-04-23 local         
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.5.0)
#>  bindrcpp     0.2.2      2018-03-29 CRAN (R 3.5.0)
#>  broom      * 0.4.4      2018-03-29 CRAN (R 3.5.0)
#>  class        7.3-14     2015-08-30 CRAN (R 3.5.0)
#>  compiler     3.5.0      2018-04-23 local         
#>  CVST         0.2-1      2013-12-10 CRAN (R 3.5.0)
#>  datasets   * 3.5.0      2018-04-23 local         
#>  ddalpha      1.3.3      2018-04-30 CRAN (R 3.5.0)
#>  DEoptimR     1.0-8      2016-11-19 CRAN (R 3.5.0)
#>  devtools     1.13.5     2018-02-18 CRAN (R 3.5.0)
#>  digest       0.6.15     2018-01-28 CRAN (R 3.5.0)
#>  dimRed       0.1.0      2017-05-04 CRAN (R 3.5.0)
#>  dplyr        0.7.4      2017-09-28 CRAN (R 3.5.0)
#>  DRR          0.0.3      2018-01-06 CRAN (R 3.5.0)
#>  evaluate     0.10.1     2017-06-24 CRAN (R 3.5.0)
#>  foreign      0.8-70     2017-11-28 CRAN (R 3.5.0)
#>  geometry     0.3-6      2015-09-09 CRAN (R 3.5.0)
#>  glue         1.2.0      2017-10-29 CRAN (R 3.5.0)
#>  gower        0.1.2      2017-02-23 CRAN (R 3.5.0)
#>  graphics   * 3.5.0      2018-04-23 local         
#>  grDevices  * 3.5.0      2018-04-23 local         
#>  grid         3.5.0      2018-04-23 local         
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.5.0)
#>  ipred        0.9-6      2017-03-01 CRAN (R 3.5.0)
#>  kernlab      0.9-26     2018-04-30 CRAN (R 3.5.0)
#>  knitr        1.20       2018-02-20 CRAN (R 3.5.0)
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.5.0)
#>  lava         1.6.1      2018-03-28 CRAN (R 3.5.0)
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.5.0)
#>  magic        1.5-8      2018-01-26 CRAN (R 3.5.0)
#>  magrittr     1.5        2014-11-22 CRAN (R 3.5.0)
#>  MASS         7.3-49     2018-02-23 CRAN (R 3.5.0)
#>  Matrix       1.2-14     2018-04-13 CRAN (R 3.5.0)
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.5.0)
#>  methods    * 3.5.0      2018-04-23 local         
#>  mnormt       1.5-5      2016-10-15 CRAN (R 3.5.0)
#>  nlme         3.1-137    2018-04-07 CRAN (R 3.5.0)
#>  nnet         7.3-12     2016-02-02 CRAN (R 3.5.0)
#>  parallel     3.5.0      2018-04-23 local         
#>  pillar       1.2.2      2018-04-26 CRAN (R 3.5.0)
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.5.0)
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.5.0)
#>  prodlim      2018.04.18 2018-04-18 CRAN (R 3.5.0)
#>  psych        1.8.4      2018-05-06 CRAN (R 3.5.0)
#>  purrr      * 0.2.4      2017-10-18 CRAN (R 3.5.0)
#>  R6           2.2.2      2017-06-17 CRAN (R 3.5.0)
#>  Rcpp         0.12.16    2018-03-13 CRAN (R 3.5.0)
#>  RcppRoll     0.2.2      2015-04-05 CRAN (R 3.5.0)
#>  recipes      0.1.2      2018-01-11 CRAN (R 3.5.0)
#>  reshape2     1.4.3      2017-12-11 CRAN (R 3.5.0)
#>  rlang        0.2.0      2018-02-20 CRAN (R 3.5.0)
#>  rmarkdown    1.9        2018-03-01 CRAN (R 3.5.0)
#>  robustbase   0.93-0     2018-04-24 CRAN (R 3.5.0)
#>  rpart        4.1-13     2018-02-23 CRAN (R 3.5.0)
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.5.0)
#>  rsample    * 0.0.2      2017-11-12 CRAN (R 3.5.0)
#>  sfsmisc      1.1-2      2018-03-05 CRAN (R 3.5.0)
#>  splines      3.5.0      2018-04-23 local         
#>  stats      * 3.5.0      2018-04-23 local         
#>  stringi      1.2.2      2018-05-02 CRAN (R 3.5.0)
#>  stringr      1.3.0      2018-02-19 CRAN (R 3.5.0)
#>  survival     2.41-3     2017-04-04 CRAN (R 3.5.0)
#>  tibble       1.4.2      2018-01-22 CRAN (R 3.5.0)
#>  tidyr      * 0.8.0      2018-01-29 CRAN (R 3.5.0)
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.5.0)
#>  timeDate     3043.102   2018-02-21 CRAN (R 3.5.0)
#>  tools        3.5.0      2018-04-23 local         
#>  utils      * 3.5.0      2018-04-23 local         
#>  withr        2.1.2      2018-03-15 CRAN (R 3.5.0)
#>  yaml         2.1.19     2018-05-01 CRAN (R 3.5.0)

Easier flattening via unnest.rset method

In some cases working with a list column of tibbles is overkill. For example:

 bootstraps(mtcars) %>% 
   mutate(train = map(splits, assessment)) %>%
   unnest(train) %>% 
   group_by(id) %>% 
   summarize(mean = mean(mpg))

I propose adding an rset method for unnest() so that only a single line is need to flatten an rset:

unnest.rset <- function (data, dataset = c("assessment", "analysis"), ...) {
  mutate(train = map(splits, get(dataset))) %>%
    unnest(train) %>% 
    group_by(id)
}

or an unnest_by() once that comes to dplyr.

Remove deprecated fill() in 0.0.4

This issue serves as a post to remove fill() in v0.0.4

Package vignettes

From current pkgdown versions

as.data.frame.rsplit

I think you can simplify to:

as.data.frame.rsplit <- function(x, data = "analysis", ...) {
  x$data[as.integer(x, data = data), , drop = FALSE]
}

Might be a good idea for as.integer to be a bit more stricter:

as.integer.rsplit <- function(x, data = c("analysis", "assessment"), ...) {
  data <- match.args(data)
  ...
}

(that would also allow you to use prefixes)

library(tidyverse)
library(rsample)
#> Loading required package: broom
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill

tibble(
  id = LETTERS[1:6],
  val = 1:6
) %>%
  rolling_origin(initial = 5, assess = 1)  # defaults
#> Error: There should be at least 6 nrows in `data`

Created on 2018-09-07 by the reprex package (v0.2.0).

Session info

devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.4 (2018-03-15)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Detroit             
#>  date     2018-09-07
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                             
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.4)                     
#>  backports    1.1.2      2017-12-13 CRAN (R 3.4.4)                     
#>  base       * 3.4.4      2018-03-16 local                              
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.4.4)                     
#>  bindrcpp     0.2.2      2018-03-29 CRAN (R 3.4.4)                     
#>  broom      * 0.5.0      2018-07-17 CRAN (R 3.4.4)                     
#>  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.4)                     
#>  cli          1.0.0      2017-11-05 CRAN (R 3.4.4)                     
#>  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.4)                     
#>  compiler     3.4.4      2018-03-16 local                              
#>  crayon       1.3.4      2017-09-16 CRAN (R 3.4.4)                     
#>  datasets   * 3.4.4      2018-03-16 local                              
#>  devtools     1.13.6     2018-06-27 CRAN (R 3.4.4)                     
#>  digest       0.6.16     2018-08-22 CRAN (R 3.4.4)                     
#>  dplyr      * 0.7.6      2018-06-29 cran (@0.7.6)                      
#>  evaluate     0.11       2018-07-17 CRAN (R 3.4.4)                     
#>  forcats    * 0.3.0      2018-02-19 CRAN (R 3.4.4)                     
#>  ggplot2    * 3.0.0      2018-07-03 cran (@3.0.0)                      
#>  glue         1.3.0      2018-07-17 CRAN (R 3.4.4)                     
#>  graphics   * 3.4.4      2018-03-16 local                              
#>  grDevices  * 3.4.4      2018-03-16 local                              
#>  grid         3.4.4      2018-03-16 local                              
#>  gtable       0.2.0      2016-02-26 CRAN (R 3.4.4)                     
#>  haven        1.1.2      2018-06-27 CRAN (R 3.4.4)                     
#>  hms          0.4.2.9000 2018-07-03 Github (tidyverse/hms@2e0a39a)     
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.4)                     
#>  httr         1.3.1      2017-08-20 CRAN (R 3.4.4)                     
#>  jsonlite     1.5        2017-06-01 CRAN (R 3.4.4)                     
#>  knitr        1.20       2018-02-20 CRAN (R 3.4.4)                     
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.3.3)                     
#>  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.4)                     
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.4.4)                     
#>  magrittr     1.5        2014-11-22 CRAN (R 3.4.4)                     
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.4.4)                     
#>  methods    * 3.4.4      2018-03-16 local                              
#>  modelr       0.1.2      2018-05-11 CRAN (R 3.4.4)                     
#>  munsell      0.5.0      2018-06-12 CRAN (R 3.4.4)                     
#>  nlme         3.1-137    2018-04-07 CRAN (R 3.4.4)                     
#>  pillar       1.3.0      2018-07-14 CRAN (R 3.4.4)                     
#>  pkgconfig    2.0.2      2018-08-16 CRAN (R 3.4.4)                     
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.4.4)                     
#>  purrr      * 0.2.5      2018-05-29 CRAN (R 3.4.4)                     
#>  R6           2.2.2      2017-06-17 CRAN (R 3.4.4)                     
#>  Rcpp         0.12.18    2018-07-23 CRAN (R 3.4.4)                     
#>  readr      * 1.2.0      2018-07-06 Github (tidyverse/readr@4b2e93a)   
#>  readxl       1.1.0      2018-04-20 CRAN (R 3.4.4)                     
#>  rlang        0.2.2      2018-08-16 cran (@0.2.2)                      
#>  rmarkdown    1.10       2018-06-11 CRAN (R 3.4.4)                     
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.4)                     
#>  rsample    * 0.0.2.9000 2018-09-07 Github (tidymodels/rsample@69e9782)
#>  rvest        0.3.2      2016-06-17 CRAN (R 3.4.4)                     
#>  scales       1.0.0      2018-08-09 CRAN (R 3.4.4)                     
#>  stats      * 3.4.4      2018-03-16 local                              
#>  stringi      1.2.4      2018-07-20 CRAN (R 3.4.4)                     
#>  stringr    * 1.3.1      2018-05-10 CRAN (R 3.4.4)                     
#>  tibble     * 1.4.2      2018-01-22 CRAN (R 3.4.4)                     
#>  tidyr      * 0.8.1      2018-05-18 CRAN (R 3.4.4)                     
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.4.4)                     
#>  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.4)                     
#>  tools        3.4.4      2018-03-16 local                              
#>  utils      * 3.4.4      2018-03-16 local                              
#>  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                     
#>  xml2         1.2.0      2018-01-24 CRAN (R 3.4.4)                     
#>  yaml         2.2.0      2018-07-25 CRAN (R 3.4.4)

The inequality is in https://github.com/tidymodels/rsample/blob/master/R/rolling_origin.R#L42; I'm hoping I can just remove the =, I'll see if that breaks any tests or if any should be added to prevent regressions.

don't save the holdout index for some resampling types

When the holdout is the complement of the data being retained, derive the holdout on the fly. This wouldn't work for nested resampling or rolling origin resampling but it can otherwise save memory.

tidymodels / rsample Goto Github PK

rsample's Introduction

rsample

Overview

Installation

Contributing

rsample's People

Contributors

Stargazers

Watchers

Forkers

rsample's Issues

Minimal, reproducible example:

Recommend Projects

Recommend Topics

Recommend Org