I'm following up on our conversation on Twitter. I'd like to propose that there is a use case for a very basic bootstrapping interface that imposes minimal cognitive load on the user. In the simplest scenario, I want to be able to take an existing dplyr or ggplot pipeline and simply insert the word "bootstrap" somewhere and have a bootstrapped version of that pipeline. This would be easy to do with a function that generates a bootstrapped version of the input data frame, but in general generating bootstraps on the fly will be more memory efficient, in particular for very large data sets or larger numbers of bootstraps.
First, let me provide some examples of what I mean by going from an existing pipeline to a bootstrapped one and how I have solved this problem with some functions I have written, which are available here.
library(dplyr)
library(broom)
library(ggplot2)
# without bootstrap
iris %>% group_by(Species) %>%
summarize(
mean_sepal_length = mean(Sepal.Length),
mean_petal_length = mean(Sepal.Length)
)
#> # A tibble: 3 x 3
#> Species mean_sepal_length mean_petal_length
#> <fct> <dbl> <dbl>
#> 1 setosa 5.01 5.01
#> 2 versicolor 5.94 5.94
#> 3 virginica 6.59 6.59
# with bootstrap
iris %>% group_by(Species) %>%
ungeviz::bootstrap_summarize(3,
mean_sepal_length = mean(Sepal.Length),
mean_petal_length = mean(Sepal.Length)
)
#> # A tibble: 9 x 4
#> Species .draw mean_sepal_length mean_petal_length
#> <fct> <int> <dbl> <dbl>
#> 1 setosa 1 5.07 5.07
#> 2 setosa 2 4.98 4.98
#> 3 setosa 3 4.95 4.95
#> 4 versicolor 1 5.99 5.99
#> 5 versicolor 2 6.03 6.03
#> 6 versicolor 3 5.99 5.99
#> 7 virginica 1 6.55 6.55
#> 8 virginica 2 6.71 6.71
#> 9 virginica 3 6.59 6.59
# without bootstrap
iris %>% group_by(Species) %>%
filter(Species != "setosa") %>%
do(
tidy(lm(Sepal.Length ~ Petal.Length, data = .))
)
#> # A tibble: 4 x 6
#> # Groups: Species [2]
#> Species term estimate std.error statistic p.value
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor (Intercept) 2.41 0.446 5.39 2.08e- 6
#> 2 versicolor Petal.Length 0.828 0.104 7.95 2.59e-10
#> 3 virginica (Intercept) 1.06 0.467 2.27 2.77e- 2
#> 4 virginica Petal.Length 0.996 0.0837 11.9 6.30e-16
# with bootstrap
iris %>% group_by(Species) %>%
filter(Species != "setosa") %>%
ungeviz::bootstrap_do(3,
tidy(lm(Sepal.Length ~ Petal.Length, data = .))
)
#> # A tibble: 12 x 7
#> Species .draw term estimate std.error statistic p.value
#> <fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor 1 (Intercept) 2.70 0.462 5.84 4.47e- 7
#> 2 versicolor 1 Petal.Length 0.752 0.105 7.14 4.45e- 9
#> 3 versicolor 2 (Intercept) 2.49 0.354 7.04 6.43e- 9
#> 4 versicolor 2 Petal.Length 0.806 0.0843 9.55 1.12e-12
#> 5 versicolor 3 (Intercept) 2.33 0.453 5.14 5.06e- 6
#> 6 versicolor 3 Petal.Length 0.848 0.104 8.12 1.45e-10
#> 7 virginica 1 (Intercept) 0.0159 0.540 0.0295 9.77e- 1
#> 8 virginica 1 Petal.Length 1.17 0.0953 12.3 1.80e-16
#> 9 virginica 2 (Intercept) 1.28 0.586 2.18 3.42e- 2
#> 10 virginica 2 Petal.Length 0.945 0.106 8.89 1.05e-11
#> 11 virginica 3 (Intercept) 0.987 0.447 2.21 3.20e- 2
#> 12 virginica 3 Petal.Length 1.02 0.0801 12.7 5.72e-17
# without bootstrap
mtcars %>%
ggplot(aes(hp, mpg)) + geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
![](https://camo.githubusercontent.com/c8c268b7aca085edd8c90798ff0af355b2f04aaa4a271358c97a5bea9801e348/68747470733a2f2f692e696d6775722e636f6d2f7873446f6e47562e706e67)
# with bootstrap
mtcars %>%
ungeviz::bootstrap(10) %>%
ggplot(aes(hp, mpg, group = .draw)) + geom_smooth(se = FALSE)
#> `geom_smooth()` using method = 'loess' and formula 'y ~ x'
![](https://camo.githubusercontent.com/8ec1bc57345ef47f41e6214cb8f0e1d8dfba93158be156c7f85952517fbc2586/68747470733a2f2f692e696d6775722e636f6d2f4b5970774335592e706e67)
I think all of these cases are very intuitive. I wrote bootstrap_summarize()
and bootstrap_do()
because they can summarize and do on bootstrap samples generated on the fly. I wrote bootstrap()
, which simply generates a bootstrapped dataset, because I found that I needed it, for example in plotting.
One case I don't know how to handle is the following:
library(dplyr)
library(tidyr)
library(purrr)
library(broom)
iris %>% group_by(Species) %>%
filter(Species != "setosa") %>%
nest() %>%
mutate(
model = map(data, ~lm(Sepal.Length ~ Petal.Length, data = .x)),
coef = map(model, tidy)
) %>%
select(Species, coef) %>%
unnest()
#> # A tibble: 4 x 6
#> Species term estimate std.error statistic p.value
#> <fct> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor (Intercept) 2.41 0.446 5.39 2.08e- 6
#> 2 versicolor Petal.Length 0.828 0.104 7.95 2.59e-10
#> 3 virginica (Intercept) 1.06 0.467 2.27 2.77e- 2
#> 4 virginica Petal.Length 0.996 0.0837 11.9 6.30e-16
I now frequently see this kind of workflow recommended instead of the do()
workflow I used above, but it's not entirely clear to me why. It requires quite a bit more code, and much more mental processing.
In any case, I'd be happy to contribute these functions (or variants thereof) to rsample. Maybe the bootstrap()
function should be called bootstrap_df()
to highlight that it creates a data frame and to separate it from the existing bootstraps()
function.
@hadley proposed alternatively that a bootstrap()
function could set up virtual bootstraps that then could be processed by summarize()
. I like this idea in principle, but the question is what functions have to be modified. Both summarize()
and do()
should respect these virtual bootstraps I think. The ggplot package could get a fortify()
method that simply expands virtual bootstraps to actual bootstraps. But what about nest()
and mutate()
? Would we expect the following pattern to work? Users would probably expect that it would.
iris %>% group_by(Species) %>%
filter(Species != "setosa") %>%
bootstrap(10) %>%
nest() %>%
mutate(
model = map(data, ~lm(Sepal.Length ~ Petal.Length, data = .x)),
coef = map(model, tidy)
) %>%
select(Species, coef) %>%
unnest()