Giter VIP home page Giter VIP logo

Comments (8)

mikemahoney218 avatar mikemahoney218 commented on September 14, 2024 1

Changing the helper to append rows is dramatically faster:

balance_prop_helper <- function(prop, data_ind, v, replace) {

  freq_table <- vec_count(data_ind$..group, sort = "location")
  n <- nrow(freq_table)

  purrr::map(
    seq_len(v),
    function(x) {
      row_idx <- sample.int(nrow(freq_table), n, replace = replace)
      work_table <- freq_table[row_idx, ]
      cumulative_proportion <- cumsum(work_table$count) / sum(freq_table$count)

      # if bootstrapping with small groups,
      # it's possible to repeatedly sample only small groups,
      # meaning you construct bootstrap sets with too few observations
      #
      # so here we loop until we've gotten the right number of groups
      while (max(cumulative_proportion) <= 1) {
        row_idx <- sample.int(nrow(freq_table), n, replace = replace)
        work_table <- rbind(work_table, freq_table[row_idx, ])
        cumulative_proportion <- cumsum(work_table$count) / sum(freq_table$count)
      }

      crosses_target <- which(cumulative_proportion > prop)[[1]]
      is_closest <- cumulative_proportion[c(crosses_target, crosses_target - 1)]
      is_closest <- which.min(abs(is_closest - prop)) - 1
      crosses_target <- crosses_target - is_closest
      out <- work_table[seq_len(crosses_target), ]
      out$assignment <- x
      out
    }
  ) %>%
    list_rbind()
}

That said, I haven't checked this to make sure it actually works in all the ways we care about -- CI suggests it maybe makes us slightly worse at matching strata, and it triggers a new "many-to-many" warning in dplyr::right_join(). I'd need to spend more time digging in to get this over the line (but if anyone gets a chance to look first, hopefully this helps!)

from rsample.

EmilHvitfeldt avatar EmilHvitfeldt commented on September 14, 2024

Hello @StevenWallaert Do you mind clarifying a bit? Below I was able to apply group_bootstraps() on a data.frame with 10000 rows and 500 groups. Does this mirror your setup?

library(rsample)
library(modeldata)
library(dplyr)

ames <- slice_sample(ames, n = 10000, replace = TRUE)

ames$group <- sample(1:500, size = 10000, replace = TRUE)

tictoc::tic()
group_bootstraps(ames, group)
#> # Group bootstrap sampling 
#> # A tibble: 25 × 2
#>    splits               id         
#>    <list>               <chr>      
#>  1 <split [9991/3667]>  Bootstrap01
#>  2 <split [9998/3715]>  Bootstrap02
#>  3 <split [10007/3741]> Bootstrap03
#>  4 <split [9991/3528]>  Bootstrap04
#>  5 <split [9996/3837]>  Bootstrap05
#>  6 <split [10005/3593]> Bootstrap06
#>  7 <split [9998/3668]>  Bootstrap07
#>  8 <split [10007/3974]> Bootstrap08
#>  9 <split [10001/3729]> Bootstrap09
#> 10 <split [10001/3612]> Bootstrap10
#> # ℹ 15 more rows
tictoc::toc()
#> 7.219 sec elapsed

from rsample.

StevenWallaert avatar StevenWallaert commented on September 14, 2024

Hey, thanks for your reply.
I tried your code above, and got following result:

> tictoc::tic()
> group_bootstraps(ames, group)
# Group bootstrap sampling 
# A tibble: 25 × 2
   splits               id         
   <list>               <chr>      
 1 <split [10002/4060]> Bootstrap01
 2 <split [10001/3692]> Bootstrap02
 3 <split [10004/3750]> Bootstrap03
 4 <split [10008/3592]> Bootstrap04
 5 <split [10001/3844]> Bootstrap05
 6 <split [9993/3562]>  Bootstrap06
 7 <split [10008/3613]> Bootstrap07
 8 <split [10002/3590]> Bootstrap08
 9 <split [10004/3675]> Bootstrap09
10 <split [9994/3379]>  Bootstrap10
# ℹ 15 more rows
# ℹ Use `print(n = ...)` to see more rows
> tictoc::toc()
11.04 sec elapsed

However, when I do the following with df as my datatframe, I get:

> tictoc::tic()
> df %>% 
+   group_bootstraps(group = ID)
# Group bootstrap sampling 
# A tibble: 25 × 2
   splits               id         
   <list>               <chr>      
 1 <split [10605/3971]> Bootstrap01
 2 <split [10613/3715]> Bootstrap02
 3 <split [10611/3827]> Bootstrap03
 4 <split [10608/3748]> Bootstrap04
 5 <split [10611/3605]> Bootstrap05
 6 <split [10611/3707]> Bootstrap06
 7 <split [10594/3907]> Bootstrap07
 8 <split [10600/4025]> Bootstrap08
 9 <split [10627/3705]> Bootstrap09
10 <split [10613/3780]> Bootstrap10
# ℹ 15 more rows
# ℹ Use `print(n = ...)` to see more rows
> tictoc::toc()
120.94 sec elapsed

roughly ten times longer. My df is nothing special: 10607 rows, 17 columns (ames is 75 columns).

Problem is that I need to go to 2000 resamples and I don't have a clue how long it will take. I usually abort because I'm afraid it will take too long. Maybe the option of a progress bar would be helpful?

from rsample.

mikemahoney218 avatar mikemahoney218 commented on September 14, 2024

Do you have a ton of groups? It seems like something in the per-group implementation is inefficient (as the person who wrote that code, I'd believe it 😛 ) and the function takes dramatically longer as the number of groups increases:

bench::press(
  n = c(2, 5, 10, 50, 100, 1000, 2000),
  {
    df <- modeldata::sim_noise(
      num_samples = 10607, 
      num_vars = 16,
      outcome = "classification", 
      num_classes = n
    )
    bench::mark(
      rsample::group_bootstraps(df, class)
    )
  }
)
#> # A tibble: 7 × 7
#>   expression                    n      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 rsample::group_bootstrap…     2  61.68ms  77.85ms  10.1       78.75MB  28.5   
#> 2 rsample::group_bootstrap…     5  71.09ms 143.22ms   8.43       75.5MB  30.3   
#> 3 rsample::group_bootstrap…    10   55.7ms  82.39ms  10.9       75.76MB  30.8   
#> 4 rsample::group_bootstrap…    50  70.44ms  83.85ms  11.2       82.37MB  28.0   
#> 5 rsample::group_bootstrap…   100 227.88ms 256.82ms   3.89     103.97MB  17.5   
#> 6 rsample::group_bootstrap…  1000    1.55m    1.55m   0.0107     7.27GB   0.430 
#> 7 rsample::group_bootstrap…  2000    12.1m    12.1m   0.00138   46.19GB   0.0565

Created on 2024-02-13 with reprex v2.0.2

(I see now you said 500 groups -- I'm wondering if they're unbalanced, and the issues come up when some groups are particularly small)

from rsample.

mikemahoney218 avatar mikemahoney218 commented on September 14, 2024

Yeah, think the cause might be unbalanced classes (look at the difference in memory usage, wowza):

set.seed(1107)
df <- modeldata::sim_noise(
  num_samples = 10607, 
  num_vars = 16
)

df$group_1 <- sample.int(500, 10607, TRUE)
df$group_2 <- sample(c(
  rep(1, 4243), 
  rep(2, 4243), 
  sample.int(2121, 2121, TRUE)
), 10607)

bench::mark(
  rsample::group_bootstraps(df, group_1),
  rsample::group_bootstraps(df, group_2),
  check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 rsample::group_bootstraps(df, group… 12.2s  12.2s   0.0820     1.69GB   0.164 
#> 2 rsample::group_bootstraps(df, group…  7.4m   7.4m   0.00225   29.02GB   0.0901

Created on 2024-02-13 with reprex v2.0.2

from rsample.

mikemahoney218 avatar mikemahoney218 commented on September 14, 2024

Attached is a profile (readable via profvis::profvis()) of group_bootstraps(df, group_2). Almost all of the time is being spent subsetting freq_table here:

image

Looking at the source of balance_prop_helper(), I see a comment that suggests what's going wrong:

balance_prop_helper <- function(prop, data_ind, v, replace) {

  freq_table <- vec_count(data_ind$..group, sort = "location")

  # Calculate how many groups to sample each iteration
  # If sampling with replacement,
  # set `n` to the number of resamples we'd need
  # if we somehow got the smallest group every time.
  # If sampling without replacement, just reshuffle all the groups.
  n <- nrow(freq_table)
  if (replace) n <- n * prop * sum(freq_table$count) / min(freq_table$count)
  n <- ceiling(n)

  purrr::map(
    seq_len(v),
    function(x) {
      row_idx <- sample.int(nrow(freq_table), n, replace = replace)
      work_table <- freq_table[row_idx, ]
      cumulative_proportion <- cumsum(work_table$count) / sum(freq_table$count)
      crosses_target <- which(cumulative_proportion > prop)[[1]]
      is_closest <- cumulative_proportion[c(crosses_target, crosses_target - 1)]
      is_closest <- which.min(abs(is_closest - prop)) - 1
      crosses_target <- crosses_target - is_closest
      out <- work_table[seq_len(crosses_target), ]
      out$assignment <- x
      out
    }
  ) %>%
    list_rbind()
}

That n calculation is the problem. If you're working with a table that has 10607 rows and 500 groups, the worst case scenario for bootstrapping happens when a group has only one member, where n becomes 500 * 1 * 10607 / 1 == 5303500. That means for each bootstrap iteration, we're generating a 5,303,500 row data frame (before throwing out almost all of it), which becomes quite the performance hit.

So I think it would make sense to not do this -- get rid of the n over-sampling, and instead sample only the necessary number of times groups should be replicated.

Few other notes:

  • I think this only impacts bootstrapping, as n only gets inflated if replace = TRUE
  • If feasible, you might also consider lumping your smallest groups together into an "other" category, which should speed this up

group_bootstraps.zip

from rsample.

StevenWallaert avatar StevenWallaert commented on September 14, 2024

Yes indeed, I have many small groups (between 1 and 120 members, 90% between 3 and 40 members). For now I work around by sampling group ids with replacement and then join the original tibble.
I'm not sure I entirely understand the n calculation issue (but that is ok).
The suggestion to lump groups together won't work for constructing percentile confidence intervals (as the bootstrapping should try to replicate the original sampling procedure as much as possible). However, for things like model selection the suggestion is probably fine.

from rsample.

PalimPalimPalim avatar PalimPalimPalim commented on September 14, 2024

I have observed the same behaviour and my workaround is using tidyr nest and unnest.

dat <- tibble(x = rep(1:1000, 2), y = 1:2000) 

f <- function(df, column){
  tibble(
    "estimate" = mean(pull(df, {{column}})),
    "term" = "mean"
  )
}


# Version 1 using rsample::group_bootstraps
start_time <- Sys.time()

dat %>%
  group_bootstraps(group = x, times = 10) %>%
  mutate(mean_stats = purrr::map(splits, ~ f(analysis(.), y))) %>% 
  int_pctl(mean_stats)

print(difftime(Sys.time(), start_time, units = "secs"))

# Version 2 using tidyr nest and unnest
start_time <- Sys.time()
dat %>%
  nest(.by = x) %>%
  bootstraps(times = 10) %>%
  mutate(mean_stats = purrr::map(splits, ~ f(unnest(analysis(.), data), y))) %>% 
  int_pctl(mean_stats)

print(difftime(Sys.time(), start_time, units = "secs"))

Based on my understanding of group_bootstraps both versions should be equivalant, but the version using nest and unnest is roughly 20 times quicker.

from rsample.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.