Giter VIP home page Giter VIP logo

Comments (13)

DavisVaughan avatar DavisVaughan commented on August 9, 2024 1

FYI @ClaytonYJ I haven't forgotten about this, just been busy trying to get a new pkg on CRAN with my free time. Hopefully I'll get to it in the next few days

from rsample.

rwarnung avatar rwarnung commented on August 9, 2024 1

Thank you two! (@DavisVaughan and @ClaytonJY) this nesting technique seems to be the solution. I will play some more with it but this is really elegant. Please go ahead to collect the bounty! thank you!

from rsample.

topepo avatar topepo commented on August 9, 2024

That's excellent! @mdancho84 and I spoke about adding something like this a while back. Would you like to do a PR and add some notes to rolling_origin and maybe this vignette? I might be good to have a good time series data set in the package to use for examples.

from rsample.

DavisVaughan avatar DavisVaughan commented on August 9, 2024

Sure I can:

  • Add an example to rolling_origin()
  • Add a note at the bottom of the vignette, recreating the rolling slices using rolling_origin(drinks_nested_yearly, initial = 20, assess = 1, cumulative = FALSE) and explain the benefits of this with irregular time series (even though the drinks data set looks to be regular)

Would the drinks dataset from FRED be a good one to include? I can redownload it and include it as well.
(Random side note, also adding a few ideas to a better R pkg for FRED data, fredr)

from rsample.

ClaytonJY avatar ClaytonJY commented on August 9, 2024

@DavisVaughan this nesting trick is super cool. Any thoughts on how to extend this to respect gaps in non-adjacent forecasting horizons? If I want to use things I know this morning to predict tomorrow's closing price, I can't use yesterday's observations, since the response won't be known until after close today.

I was discussing this with Max a bit in #43, and included the code I use to do that now, but per your technique here nest() + rolling_origin() (+ filter() for selective sampling) could do everything but "respect the gap". My only thought is to avoid computing the response pre-sampling, and write a custom panel-data-compatible recipe (step_panel_lead()?) which drops rows when dplyr::lead() comes up NA. Any clues on a simpler approach?

from rsample.

DavisVaughan avatar DavisVaughan commented on August 9, 2024

Can you lay out a full example for me? I looked through the code in the vc_resample function but can't seem to piece together why its useful (im sure it is). Beyond needing a full example, a few thoughts are:

  • If you are standing at 9:30am 2018-07-13, and predicting 4pm 2018-07-14, why can't you use data from yesterday, 2018-07-12? It seems like at 9:30am today you would have yesterday's close.

  • If you are standing at today, the current vc_resample function returns all the rows past the split date. This is kind of interesting, I definitely don't think rolling_origin allows this right now. Is this purposeful and useful? Are you really predicting that many days out?

from rsample.

ClaytonJY avatar ClaytonJY commented on August 9, 2024

First, the questions:

  1. those two points are across different lengths of time; 9:30am today to 4pm tomorrow is more than a day, 30.5 hours, but 30.5 hours from 9:30am yesterday is 4pm today, 6.5 hours later than where I'm standing, so I can't train on that. The most recent data I can train on is from two days ago, because the the last closing price I have where I'm standing is yeserday's close, and 30.5 hours earlier is 2 days ago @ 9:30am.
  2. with a true backtest I'd at least fix the horizon, even if long, which vc_resample isn't doing. It grabs all the after-observations so I can compare test observations across models trained at different times (split_dates); is my model actually learning from recent examples? How much? If last month's model isn't much worse than last week's model at predicting today, that tells me something. If you train model A and B one month apart, but for each only predict the month after training, you don't know how much of that is because of changes in model (controllable) vs. changes in data generation process (not controllable).

Suppose we're only tracking one thing over time, and we've got closing prices for every day this week (e.g. 4pm arrival) as well as some strange feature magic, that we know at 9:30 am each day.

We could have that in a tibble like

library(tidyverse)

tbl <- tibble(
  date  = c(as.Date("2018-07-13") - 4:0),
  close = seq(100, by = 100, length.out = length(date)),
  magic = seq_along(close)
)

tbl
#> # A tibble: 5 x 3
#>   date       close magic
#>   <date>     <dbl> <int>
#> 1 2018-07-09   100     1
#> 2 2018-07-10   200     2
#> 3 2018-07-11   300     3
#> 4 2018-07-12   400     4
#> 5 2018-07-13   500     5

We want to predict the percentage change in closing price from today's close and tomorrow's, so we compute that, then use rolling_origin to make some time-dependent splits.

tbl <- tbl %>%
  mutate(pct_change = (lead(close) - close) / close) %>%
  filter(complete.cases(.))  # drops last row only

library(rsample)
#> Loading required package: broom
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill

rset <- rolling_origin(tbl, initial = 1)

rset
#> # Rolling origin forecast resampling 
#> # A tibble: 3 x 2
#>   splits       id    
#>   <list>       <chr> 
#> 1 <S3: rsplit> Slice1
#> 2 <S3: rsplit> Slice2
#> 3 <S3: rsplit> Slice3

Let's look at one of them

list(analysis(rset$splits[[1]]), assessment(rset$splits[[1]]))
#> [[1]]
#> # A tibble: 1 x 4
#>   date       close magic pct_change
#>   <date>     <dbl> <int>      <dbl>
#> 1 2018-07-09   100     1          1
#> 
#> [[2]]
#> # A tibble: 1 x 4
#>   date       close magic pct_change
#>   <date>     <dbl> <int>      <dbl>
#> 1 2018-07-10   200     2        0.5

This violates the information barrier; if you could compute the response for 7/9, that means it's post-close on 7/10, so it's already much later (6.5 hours) than you would have wanted to make the prediction for the 7/10 observation. Or, vice versa, if you want to make a prediction for 7/10, it's that morning, so you don't have the close you'd need (7/10) to compute the response for the 7/9 observation.

Created on 2018-07-13 by the reprex package (v0.2.0).

vc_resample is one way to respect this gap, via custom rsample-ing. Another option would be to instead make a custom recipe where the recipe drops the appropriate training rows. A third option is to not pre-compute the response and have a custom recipe do the pct_change = (lead(close) - close) / close) on each side of the split independently and drop NA's, so long as you make the assess arg of rolling_origin one longer, so 2 instead of the default 1 in this case.

Lemme know if that makes sense!

from rsample.

DavisVaughan avatar DavisVaughan commented on August 9, 2024
  1. Oh I see so you're saying you need today's closing price (gotten at 4pm) as the dependent variable that goes along with yesterday's independent variables. (Answering the question, are yesterday's features predictive of today's closing price?). On the other hand, you could just run today's model at 4:01pm today so you'd have that data point. But I guess if you need to make trading decisions today based on your forecast of tomorrow's price, that won't work. (The exception is if you are not using today's close in your model, see the example below)

  2. I think rolling_origin() enforces the fact that your assessment set has to be of the same size for every slice. It would have to change to incorporate all of what you might want to do here (Im not against the idea, just stating a fact). I think with a suitably long data set you could still do a fixed assessment size that is really long and be able to compare last month's model with last week's.

Is this reasoning below not correct in your mind? Are you using the close price in your model to predict tomorrow's pct change? That would complicate things, otherwise I think this reasoning is sound.

library(tidyverse)
library(rsample)

tbl <- tibble(
  date  = c(as.Date("2018-07-13") - 4:0),
  close = seq(100, by = 100, length.out = length(date)),
  magic = seq_along(close)
)

tbl_2 <- tbl %>%
  
  # This is what actually happens tomorrow
  mutate(pct_change = (close / lag(close) - 1)) %>%
  
  # We back up what actually happens tomorrow so we can use today's info to predict it
  mutate(pct_change_tomorrow = lead(pct_change)) %>%
  
  # Let's remove pct_change now because thats confusing otherwise
  select(-pct_change) %>%
  
  slice(-nrow(.)) %>%
  
  # If I'm standing at 9:30am today, I can use `magic` today to predict tomorrow's change
  # I cannot use `close` to predict tomorrow's change because i get that at 4pm
  # but thats irrelevant because its not in the model and its just a feature
  # that calculates what ill be predicting.
  select(-close)

# At this point im not violating info barrier?
# I just created the response variable using info Ill get at 4pm, but im not
# using that 4pm info in my model so im ok.
tbl_2
#> # A tibble: 4 x 3
#>   date       magic pct_change_tomorrow
#>   <date>     <int>               <dbl>
#> 1 2018-07-09     1               1    
#> 2 2018-07-10     2               0.5  
#> 3 2018-07-11     3               0.333
#> 4 2018-07-12     4               0.25

ro <- rolling_origin(tbl_2, 1, 1, cumulative = FALSE)

# This should be fine
analysis(ro$splits[[1]])
#> # A tibble: 1 x 3
#>   date       magic pct_change_tomorrow
#>   <date>     <int>               <dbl>
#> 1 2018-07-09     1                   1
assessment(ro$splits[[1]])
#> # A tibble: 1 x 3
#>   date       magic pct_change_tomorrow
#>   <date>     <int>               <dbl>
#> 1 2018-07-10     2                 0.5

In implementation, I'm standing at 2018-07-14 and I've got this trained model so when I get a new magic number at 9:30am I can say, ok this model tells me the prediction for the change in today's close (whatever that will be) to tomorrow's close (whatever that will be) is going to be XXX.

from rsample.

DavisVaughan avatar DavisVaughan commented on August 9, 2024

Another variant is that you can use the close from yesterday at 4pm.

library(tidyverse)
library(rsample)
#> Loading required package: broom
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill

tbl <- tibble(
  date  = c(as.Date("2018-07-13") - 4:0),
  close = seq(100, by = 100, length.out = length(date)),
  magic = seq_along(close)
)

tbl_2 <- tbl %>%
  
  # This is what actually happens tomorrow
  mutate(pct_change = (close / lag(close) - 1)) %>%
  
  # We back up what actually happens tomorrow so we can use today's info to predict it
  mutate(pct_change_tomorrow = lead(pct_change)) %>%
  
  # Let's remove pct_change now because thats confusing otherwise
  select(-pct_change) %>%
  
  slice(-nrow(.)) %>%
  
  # We CAN use the close from the end of the day before as an extra feature
  # if we are standing at 9:30am today
  mutate(close_yesterday = lag(close)) %>%
  
  # If I'm standing at 9:30am today, I can use `magic` today to predict tomorrow's change
  # I cannot use `close` to predict tomorrow's change because i get that at 4pm
  # but thats irrelevant because its not in the model and its just a feature
  # that calculates what ill be predicting.
  select(-close)

# At this point im not violating info barrier?
# I just created the response variable using info Ill get at 4pm, but im not
# using that 4pm info in my model so im ok.
tbl_2
#> # A tibble: 4 x 4
#>   date       magic pct_change_tomorrow close_yesterday
#>   <date>     <int>               <dbl>           <dbl>
#> 1 2018-07-09     1               1                  NA
#> 2 2018-07-10     2               0.5               100
#> 3 2018-07-11     3               0.333             200
#> 4 2018-07-12     4               0.25              300

ro <- rolling_origin(tbl_2, 1, 1, cumulative = FALSE)

# This should be fine
analysis(ro$splits[[1]])
#> # A tibble: 1 x 4
#>   date       magic pct_change_tomorrow close_yesterday
#>   <date>     <int>               <dbl>           <dbl>
#> 1 2018-07-09     1                   1              NA
assessment(ro$splits[[1]])
#> # A tibble: 1 x 4
#>   date       magic pct_change_tomorrow close_yesterday
#>   <date>     <int>               <dbl>           <dbl>
#> 1 2018-07-10     2                 0.5             100

Created on 2018-07-14 by the reprex package (v0.2.0).

from rsample.

ClaytonJY avatar ClaytonJY commented on August 9, 2024
  1. Answering the question, are yesterday's features predictive of today's closing price?

    That's right, and yes waiting until after 4pm/today's close is too late. The exception isn't quite right: when training, if I need today's close in the same row as today's magic, it doesn't make a difference if it's for a predictor or the outcome; if I need it, I need it. Yes, waiting until after close today is definitely too late. You're correct that needing today's close as a feature makes things worse, but that's on the prediction/assessment only; it would force me to wait until after close today, which is too late.

  2. If I do observation censoring (which does still need to happen) outside of the sampling (e.g with custom recipe or a custom model fitting function), the biggest remaining difference would be with this assess piece, which could be fixed by allowing something like assess = Inf or assess = -1. Making assess bigger isn't quite what I want; that cuts out both later-trained models and overlap across models further apart.

In the example, notice your tbl_2 is identical to my tbl and your first split is identical to mine; if there's a problem in one, there's a problem in both. I suspect your multi-step response construction is where you got confused; it's equivalent to what I did, but thinking of it as the also-equivalent lead(close) / close - 1 is perhaps even more clear.

To train on 7/9, I need 7/10 close to compute the response, so it must be after 4pm 7/10, which is too late to care about a prediction made with 7/10 features. After training through the 7/9 features, which I can do only after close on 7/10, the first prediction I could make in practice would be with the 7/11 features, since it would already be too late to care about the 7/10 prediction.

In your second example, you need to make initial = 2 so you don't start with that incomplete training set, but you're correct that using yesterday's close as a feature is no problem.


Maybe this would be more clear with a longer window: suppose I want to use today's magic to predict the close in two days. We can also simplify the response to just be that future close, without any of the percent-change stuff; for the issues at hand it doesn't matter what our formula is there, what matters is how far away the furthest close we need is.

library(tidyverse)
library(rsample)
#> Loading required package: broom
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill

tbl <- tibble(
  date  = c(as.Date("2018-07-13") - 4:0),
  close = seq(100, by = 100, length.out = length(date)),
  magic = seq_along(close)
)

tbl
#> # A tibble: 5 x 3
#>   date       close magic
#>   <date>     <dbl> <int>
#> 1 2018-07-09   100     1
#> 2 2018-07-10   200     2
#> 3 2018-07-11   300     3
#> 4 2018-07-12   400     4
#> 5 2018-07-13   500     5

window <- 2

tbl <- tbl %>%
  mutate(
    lead_date = lead(date, window),
    response  = lead(close, window)
  ) %>% 
  filter(complete.cases(.))          # drops <window> rows from end

tbl
#> # A tibble: 3 x 5
#>   date       close magic lead_date  response
#>   <date>     <dbl> <int> <date>        <dbl>
#> 1 2018-07-09   100     1 2018-07-11      300
#> 2 2018-07-10   200     2 2018-07-12      400
#> 3 2018-07-11   300     3 2018-07-13      500

ro <- rolling_origin(tbl, 1, 1, cumulative = FALSE)

analysis(ro$splits[[1]])
#> # A tibble: 1 x 5
#>   date       close magic lead_date  response
#>   <date>     <dbl> <int> <date>        <dbl>
#> 1 2018-07-09   100     1 2018-07-11      300

assessment(ro$splits[[1]])
#> # A tibble: 1 x 5
#>   date       close magic lead_date  response
#>   <date>     <dbl> <int> <date>        <dbl>
#> 1 2018-07-10   200     2 2018-07-12      400

Created on 2018-07-14 by the reprex package (v0.2.0).

Now I can't train on that analysis row until after close on 7/11, so the first thing I'd want in assessment would have a date of 7/12. Conversely, when I'm ready to make a prediction using magic from the morning of 7/10, the last close I could have trained on would be from 7/9, so the associated magic would be from 7/7 (ignoring weekend), even if training happened before today's magic came in (as it should). Thus two dates can't be trained on.

This is where long-term forecasting gets tricky; if you predict X days out, you also lose X days of training data(or X-1 if you don't have a time gap between features and responses, like waiting till post-close to predict), because you don't have their associated responses yet.

We could make our 9:30/4pm more explicit & store features separate from prices, and would if using tools like flyingfox/zipline to really make sure we're never looking ahead, but the idea here is to avoid getting into all that.

Does that make the need for observation censoring any more clear?


Another thing about my vc_resample that can't be solved with nest + rolling origin is accounting for different lengths of weeks (if not training each day). If I want to train only on weekends, skip can't handle last week only having 4 trading days instead of 5. That's why I operate on date/lead_date to make my splits; I can control my training intervals according to an irregular calendar. Post-split filtering of the rset object would do it too, but at the cost of generating way more than what I need.

Conceptually, I think it'd make more sense to only worry about applying the right information barrier to a no-leads-included table with the resampling and then generate the response, lead/lag features, and drop rows in a recipe, but there's so much boilerplate involved it'd be even more code than I have here, and I understand rsample a bit better than recipes right now. Maybe I'll explore this more in the future.

from rsample.

rwarnung avatar rwarnung commented on August 9, 2024

Hi, I have a very similar concern. It is about combining rolling origin forecast resampling and group v-fold cross-validation in rsample.
I have asked the question on SO.
In fact my example is training and assassing on whole months only but the aim and the application that I think of is rather general (group can be some factor and nevertheless the time series structure should be preserved.
I don't have the solution but the following was my example:

## generate some data
library(tidyverse)
library(lubridate)
library(rsample)
my_dates = seq(as.Date("2018/1/1"), as.Date("2018/8/20"), "days")
some_data = data_frame(dates = my_dates) 
some_data$values = runif(length(my_dates))
some_data = some_data %>% mutate(month = as.factor(month(dates))) 

This gives data of the following form

 A tibble: 232 x 3
   dates      values month 
   <date>      <dbl> <fctr>
 1 2018-01-01 0.235  1     
 2 2018-01-02 0.363  1     
 3 2018-01-03 0.146  1     
 4 2018-01-04 0.668  1     
 5 2018-01-05 0.0995 1     
 6 2018-01-06 0.163  1     
 7 2018-01-07 0.0265 1     
 8 2018-01-08 0.273  1     
 9 2018-01-09 0.886  1     
10 2018-01-10 0.239  1  

Then we can e.g. produce samples that take 20 weeks of data and test on future 5 weeks (the parameter skip skips some rows extra):

rolling_origin_resamples <- rolling_origin(
  some_data,
  initial    = 7*20,
  assess     = 7*5,
  cumulative = TRUE,
  skip       = 7
)

We can check the data with the following code and see no overlap:

rolling_origin_resamples$splits[[1]] %>% analysis %>% tail
# A tibble: 6 x 3
  dates       values month 
  <date>       <dbl> <fctr>
1 2018-05-15 0.678   5     
2 2018-05-16 0.00112 5     
3 2018-05-17 0.339   5     
4 2018-05-18 0.0864  5     
5 2018-05-19 0.918   5     
6 2018-05-20 0.317   5 

### test data of first split:
rolling_origin_resamples$splits[[1]] %>% assessment
# A tibble: 6 x 3
  dates      values month 
  <date>      <dbl> <fctr>
1 2018-05-21  0.912 5     
2 2018-05-22  0.403 5     
3 2018-05-23  0.366 5     
4 2018-05-24  0.159 5     
5 2018-05-25  0.223 5     
6 2018-05-26  0.375 5   

Alternatively we can split by months:

## sampling by month:
gcv_resamples = group_vfold_cv(some_data, group = "month", v = 5)
gcv_resamples$splits[[1]]  %>% analysis %>% select(month) %>% summary
gcv_resamples$splits[[1]] %>% assessment %>% select(month) %>% summary

The solution by an SO user was a partial answer and not using rsample:
split data into a list by month

df <- split(some_data, some_data$month)

lapply along list elements defining train and test sets

df <- lapply(seq_along(df)[-length(df)], function(x){
  train <- do.call(rbind, df[1:x])
  test <- df[x+1]
  return(list(train = train,
              test = test))
  
})

the result df is a list of 7 elements each containing a train and test data frames.

from rsample.

ClaytonJY avatar ClaytonJY commented on August 9, 2024

@rwarnung if I understand correctly, you want to apply rolling-forward CV at the month level instead of the date level; per @DavisVaughan solution above, we can achieve that with some nesting:

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(rsample)
#> Loading required package: broom
#> 
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#> 
#>     fill

# same data generation as before
my_dates = seq(as.Date("2018/1/1"), as.Date("2018/8/20"), "days")
some_data = data_frame(dates = my_dates)
some_data$values = runif(length(my_dates))
some_data = some_data %>% mutate(month = as.factor(month(dates)))

# need nest()
library(tidyr)

# nest by month, then resample
rset <- some_data %>%
  group_by(month) %>%
  nest() %>%
  rolling_origin(initial = 1)

# doesn't show which month is which :(
rset
#> # Rolling origin forecast resampling 
#> # A tibble: 7 x 2
#>   splits       id    
#>   <list>       <chr> 
#> 1 <S3: rsplit> Slice1
#> 2 <S3: rsplit> Slice2
#> 3 <S3: rsplit> Slice3
#> 4 <S3: rsplit> Slice4
#> 5 <S3: rsplit> Slice5
#> 6 <S3: rsplit> Slice6
#> 7 <S3: rsplit> Slice7

# only January (31 days)
analysis(rset$splits[[1]])$data
#> [[1]]
#> # A tibble: 31 x 2
#>    dates      values
#>    <date>      <dbl>
#>  1 2018-01-01  0.179
#>  2 2018-01-02  0.719
#>  3 2018-01-03  0.119
#>  4 2018-01-04  0.889
#>  5 2018-01-05  0.429
#>  6 2018-01-06  0.269
#>  7 2018-01-07  0.600
#>  8 2018-01-08  0.792
#>  9 2018-01-09  0.760
#> 10 2018-01-10  0.804
#> # ... with 21 more rows

# only February (28 days)
assessment(rset$splits[[1]])$data
#> [[1]]
#> # A tibble: 28 x 2
#>    dates      values
#>    <date>      <dbl>
#>  1 2018-02-01 0.645 
#>  2 2018-02-02 0.233 
#>  3 2018-02-03 0.321 
#>  4 2018-02-04 0.0927
#>  5 2018-02-05 0.750 
#>  6 2018-02-06 0.302 
#>  7 2018-02-07 0.861 
#>  8 2018-02-08 0.713 
#>  9 2018-02-09 0.0454
#> 10 2018-02-10 0.656 
#> # ... with 18 more rows

Created on 2018-08-24 by the reprex package (v0.2.0).


This does add some workflow overhead, as you have to "unpack" the splits ($data). It also has the downside of hiding information about what is in each split, but you could add a mutate step to extract some info from each split (e.g. month of assesment set) and add it as a new column.

I should also point out that factor is probably a bad way to store month in this case; I'd suggest either letting it be an integer, or using floor_date so each is the first date of the month. This also makes it easier to follow my suggestion above to pull data into a new column of the rset.

(@DavisVaughan the nest thing is all you, feel free to go take his bounty on SO)

from rsample.

github-actions avatar github-actions commented on August 9, 2024

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

from rsample.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.