Comments (13)
FYI @ClaytonYJ I haven't forgotten about this, just been busy trying to get a new pkg on CRAN with my free time. Hopefully I'll get to it in the next few days
from rsample.
Thank you two! (@DavisVaughan and @ClaytonJY) this nesting technique seems to be the solution. I will play some more with it but this is really elegant. Please go ahead to collect the bounty! thank you!
from rsample.
That's excellent! @mdancho84 and I spoke about adding something like this a while back. Would you like to do a PR and add some notes to rolling_origin
and maybe this vignette? I might be good to have a good time series data set in the package to use for examples.
from rsample.
Sure I can:
- Add an example to
rolling_origin()
- Add a note at the bottom of the vignette, recreating the rolling slices using
rolling_origin(drinks_nested_yearly, initial = 20, assess = 1, cumulative = FALSE)
and explain the benefits of this with irregular time series (even though thedrinks
data set looks to be regular)
Would the drinks
dataset from FRED be a good one to include? I can redownload it and include it as well.
(Random side note, also adding a few ideas to a better R pkg for FRED data, fredr
)
from rsample.
@DavisVaughan this nesting trick is super cool. Any thoughts on how to extend this to respect gaps in non-adjacent forecasting horizons? If I want to use things I know this morning to predict tomorrow's closing price, I can't use yesterday's observations, since the response won't be known until after close today.
I was discussing this with Max a bit in #43, and included the code I use to do that now, but per your technique here nest()
+ rolling_origin()
(+ filter()
for selective sampling) could do everything but "respect the gap". My only thought is to avoid computing the response pre-sampling, and write a custom panel-data-compatible recipe (step_panel_lead()
?) which drops rows when dplyr::lead()
comes up NA
. Any clues on a simpler approach?
from rsample.
Can you lay out a full example for me? I looked through the code in the vc_resample
function but can't seem to piece together why its useful (im sure it is). Beyond needing a full example, a few thoughts are:
-
If you are standing at 9:30am 2018-07-13, and predicting 4pm 2018-07-14, why can't you use data from yesterday, 2018-07-12? It seems like at 9:30am today you would have yesterday's close.
-
If you are standing at today, the current
vc_resample
function returns all the rows past the split date. This is kind of interesting, I definitely don't thinkrolling_origin
allows this right now. Is this purposeful and useful? Are you really predicting that many days out?
from rsample.
First, the questions:
- those two points are across different lengths of time; 9:30am today to 4pm tomorrow is more than a day, 30.5 hours, but 30.5 hours from 9:30am yesterday is 4pm today, 6.5 hours later than where I'm standing, so I can't train on that. The most recent data I can train on is from two days ago, because the the last closing price I have where I'm standing is yeserday's close, and 30.5 hours earlier is 2 days ago @ 9:30am.
- with a true backtest I'd at least fix the horizon, even if long, which
vc_resample
isn't doing. It grabs all the after-observations so I can compare test observations across models trained at different times (split_dates
); is my model actually learning from recent examples? How much? If last month's model isn't much worse than last week's model at predicting today, that tells me something. If you train model A and B one month apart, but for each only predict the month after training, you don't know how much of that is because of changes in model (controllable) vs. changes in data generation process (not controllable).
Suppose we're only tracking one thing over time, and we've got closing prices for every day this week (e.g. 4pm arrival) as well as some strange feature magic
, that we know at 9:30 am each day.
We could have that in a tibble like
library(tidyverse)
tbl <- tibble(
date = c(as.Date("2018-07-13") - 4:0),
close = seq(100, by = 100, length.out = length(date)),
magic = seq_along(close)
)
tbl
#> # A tibble: 5 x 3
#> date close magic
#> <date> <dbl> <int>
#> 1 2018-07-09 100 1
#> 2 2018-07-10 200 2
#> 3 2018-07-11 300 3
#> 4 2018-07-12 400 4
#> 5 2018-07-13 500 5
We want to predict the percentage change in closing price from today's close and tomorrow's, so we compute that, then use rolling_origin
to make some time-dependent splits.
tbl <- tbl %>%
mutate(pct_change = (lead(close) - close) / close) %>%
filter(complete.cases(.)) # drops last row only
library(rsample)
#> Loading required package: broom
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#>
#> fill
rset <- rolling_origin(tbl, initial = 1)
rset
#> # Rolling origin forecast resampling
#> # A tibble: 3 x 2
#> splits id
#> <list> <chr>
#> 1 <S3: rsplit> Slice1
#> 2 <S3: rsplit> Slice2
#> 3 <S3: rsplit> Slice3
Let's look at one of them
list(analysis(rset$splits[[1]]), assessment(rset$splits[[1]]))
#> [[1]]
#> # A tibble: 1 x 4
#> date close magic pct_change
#> <date> <dbl> <int> <dbl>
#> 1 2018-07-09 100 1 1
#>
#> [[2]]
#> # A tibble: 1 x 4
#> date close magic pct_change
#> <date> <dbl> <int> <dbl>
#> 1 2018-07-10 200 2 0.5
This violates the information barrier; if you could compute the response for 7/9, that means it's post-close on 7/10, so it's already much later (6.5 hours) than you would have wanted to make the prediction for the 7/10 observation. Or, vice versa, if you want to make a prediction for 7/10, it's that morning, so you don't have the close you'd need (7/10) to compute the response for the 7/9 observation.
Created on 2018-07-13 by the reprex package (v0.2.0).
vc_resample
is one way to respect this gap, via custom rsample
-ing. Another option would be to instead make a custom recipe where the recipe drops the appropriate training rows. A third option is to not pre-compute the response and have a custom recipe do the pct_change = (lead(close) - close) / close)
on each side of the split independently and drop NA's, so long as you make the assess
arg of rolling_origin
one longer, so 2 instead of the default 1 in this case.
Lemme know if that makes sense!
from rsample.
-
Oh I see so you're saying you need today's closing price (gotten at 4pm) as the dependent variable that goes along with yesterday's independent variables. (Answering the question, are yesterday's features predictive of today's closing price?). On the other hand, you could just run today's model at 4:01pm today so you'd have that data point. But I guess if you need to make trading decisions today based on your forecast of tomorrow's price, that won't work. (The exception is if you are not using today's close in your model, see the example below)
-
I think
rolling_origin()
enforces the fact that your assessment set has to be of the same size for every slice. It would have to change to incorporate all of what you might want to do here (Im not against the idea, just stating a fact). I think with a suitably long data set you could still do a fixed assessment size that is really long and be able to compare last month's model with last week's.
Is this reasoning below not correct in your mind? Are you using the close
price in your model to predict tomorrow's pct change? That would complicate things, otherwise I think this reasoning is sound.
library(tidyverse)
library(rsample)
tbl <- tibble(
date = c(as.Date("2018-07-13") - 4:0),
close = seq(100, by = 100, length.out = length(date)),
magic = seq_along(close)
)
tbl_2 <- tbl %>%
# This is what actually happens tomorrow
mutate(pct_change = (close / lag(close) - 1)) %>%
# We back up what actually happens tomorrow so we can use today's info to predict it
mutate(pct_change_tomorrow = lead(pct_change)) %>%
# Let's remove pct_change now because thats confusing otherwise
select(-pct_change) %>%
slice(-nrow(.)) %>%
# If I'm standing at 9:30am today, I can use `magic` today to predict tomorrow's change
# I cannot use `close` to predict tomorrow's change because i get that at 4pm
# but thats irrelevant because its not in the model and its just a feature
# that calculates what ill be predicting.
select(-close)
# At this point im not violating info barrier?
# I just created the response variable using info Ill get at 4pm, but im not
# using that 4pm info in my model so im ok.
tbl_2
#> # A tibble: 4 x 3
#> date magic pct_change_tomorrow
#> <date> <int> <dbl>
#> 1 2018-07-09 1 1
#> 2 2018-07-10 2 0.5
#> 3 2018-07-11 3 0.333
#> 4 2018-07-12 4 0.25
ro <- rolling_origin(tbl_2, 1, 1, cumulative = FALSE)
# This should be fine
analysis(ro$splits[[1]])
#> # A tibble: 1 x 3
#> date magic pct_change_tomorrow
#> <date> <int> <dbl>
#> 1 2018-07-09 1 1
assessment(ro$splits[[1]])
#> # A tibble: 1 x 3
#> date magic pct_change_tomorrow
#> <date> <int> <dbl>
#> 1 2018-07-10 2 0.5
In implementation, I'm standing at 2018-07-14
and I've got this trained model so when I get a new magic number at 9:30am I can say, ok this model tells me the prediction for the change in today's close (whatever that will be) to tomorrow's close (whatever that will be) is going to be XXX.
from rsample.
Another variant is that you can use the close from yesterday at 4pm.
library(tidyverse)
library(rsample)
#> Loading required package: broom
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#>
#> fill
tbl <- tibble(
date = c(as.Date("2018-07-13") - 4:0),
close = seq(100, by = 100, length.out = length(date)),
magic = seq_along(close)
)
tbl_2 <- tbl %>%
# This is what actually happens tomorrow
mutate(pct_change = (close / lag(close) - 1)) %>%
# We back up what actually happens tomorrow so we can use today's info to predict it
mutate(pct_change_tomorrow = lead(pct_change)) %>%
# Let's remove pct_change now because thats confusing otherwise
select(-pct_change) %>%
slice(-nrow(.)) %>%
# We CAN use the close from the end of the day before as an extra feature
# if we are standing at 9:30am today
mutate(close_yesterday = lag(close)) %>%
# If I'm standing at 9:30am today, I can use `magic` today to predict tomorrow's change
# I cannot use `close` to predict tomorrow's change because i get that at 4pm
# but thats irrelevant because its not in the model and its just a feature
# that calculates what ill be predicting.
select(-close)
# At this point im not violating info barrier?
# I just created the response variable using info Ill get at 4pm, but im not
# using that 4pm info in my model so im ok.
tbl_2
#> # A tibble: 4 x 4
#> date magic pct_change_tomorrow close_yesterday
#> <date> <int> <dbl> <dbl>
#> 1 2018-07-09 1 1 NA
#> 2 2018-07-10 2 0.5 100
#> 3 2018-07-11 3 0.333 200
#> 4 2018-07-12 4 0.25 300
ro <- rolling_origin(tbl_2, 1, 1, cumulative = FALSE)
# This should be fine
analysis(ro$splits[[1]])
#> # A tibble: 1 x 4
#> date magic pct_change_tomorrow close_yesterday
#> <date> <int> <dbl> <dbl>
#> 1 2018-07-09 1 1 NA
assessment(ro$splits[[1]])
#> # A tibble: 1 x 4
#> date magic pct_change_tomorrow close_yesterday
#> <date> <int> <dbl> <dbl>
#> 1 2018-07-10 2 0.5 100
Created on 2018-07-14 by the reprex package (v0.2.0).
from rsample.
-
Answering the question, are yesterday's features predictive of today's closing price?
That's right, and yes waiting until after 4pm/today's close is too late. The exception isn't quite right: when training, if I need today's
close
in the same row as today'smagic
, it doesn't make a difference if it's for a predictor or the outcome; if I need it, I need it. Yes, waiting until after close today is definitely too late. You're correct that needing today's close as a feature makes things worse, but that's on the prediction/assessment only; it would force me to wait until after close today, which is too late. -
If I do observation censoring (which does still need to happen) outside of the sampling (e.g with custom recipe or a custom model fitting function), the biggest remaining difference would be with this
assess
piece, which could be fixed by allowing something likeassess = Inf
orassess = -1
. Makingassess
bigger isn't quite what I want; that cuts out both later-trained models and overlap across models further apart.
In the example, notice your tbl_2
is identical to my tbl
and your first split is identical to mine; if there's a problem in one, there's a problem in both. I suspect your multi-step response construction is where you got confused; it's equivalent to what I did, but thinking of it as the also-equivalent lead(close) / close - 1
is perhaps even more clear.
To train on 7/9, I need 7/10 close to compute the response, so it must be after 4pm 7/10, which is too late to care about a prediction made with 7/10 features. After training through the 7/9 features, which I can do only after close on 7/10, the first prediction I could make in practice would be with the 7/11 features, since it would already be too late to care about the 7/10 prediction.
In your second example, you need to make initial = 2
so you don't start with that incomplete training set, but you're correct that using yesterday's close as a feature is no problem.
Maybe this would be more clear with a longer window: suppose I want to use today's magic
to predict the close in two days. We can also simplify the response to just be that future close, without any of the percent-change stuff; for the issues at hand it doesn't matter what our formula is there, what matters is how far away the furthest close we need is.
library(tidyverse)
library(rsample)
#> Loading required package: broom
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#>
#> fill
tbl <- tibble(
date = c(as.Date("2018-07-13") - 4:0),
close = seq(100, by = 100, length.out = length(date)),
magic = seq_along(close)
)
tbl
#> # A tibble: 5 x 3
#> date close magic
#> <date> <dbl> <int>
#> 1 2018-07-09 100 1
#> 2 2018-07-10 200 2
#> 3 2018-07-11 300 3
#> 4 2018-07-12 400 4
#> 5 2018-07-13 500 5
window <- 2
tbl <- tbl %>%
mutate(
lead_date = lead(date, window),
response = lead(close, window)
) %>%
filter(complete.cases(.)) # drops <window> rows from end
tbl
#> # A tibble: 3 x 5
#> date close magic lead_date response
#> <date> <dbl> <int> <date> <dbl>
#> 1 2018-07-09 100 1 2018-07-11 300
#> 2 2018-07-10 200 2 2018-07-12 400
#> 3 2018-07-11 300 3 2018-07-13 500
ro <- rolling_origin(tbl, 1, 1, cumulative = FALSE)
analysis(ro$splits[[1]])
#> # A tibble: 1 x 5
#> date close magic lead_date response
#> <date> <dbl> <int> <date> <dbl>
#> 1 2018-07-09 100 1 2018-07-11 300
assessment(ro$splits[[1]])
#> # A tibble: 1 x 5
#> date close magic lead_date response
#> <date> <dbl> <int> <date> <dbl>
#> 1 2018-07-10 200 2 2018-07-12 400
Created on 2018-07-14 by the reprex package (v0.2.0).
Now I can't train on that analysis
row until after close on 7/11, so the first thing I'd want in assessment
would have a date of 7/12. Conversely, when I'm ready to make a prediction using magic
from the morning of 7/10, the last close I could have trained on would be from 7/9, so the associated magic
would be from 7/7 (ignoring weekend), even if training happened before today's magic
came in (as it should). Thus two dates can't be trained on.
This is where long-term forecasting gets tricky; if you predict X days out, you also lose X days of training data(or X-1 if you don't have a time gap between features and responses, like waiting till post-close to predict), because you don't have their associated responses yet.
We could make our 9:30/4pm more explicit & store features separate from prices, and would if using tools like flyingfox
/zipline
to really make sure we're never looking ahead, but the idea here is to avoid getting into all that.
Does that make the need for observation censoring any more clear?
Another thing about my vc_resample
that can't be solved with nest
+ rolling origin
is accounting for different lengths of weeks (if not training each day). If I want to train only on weekends, skip
can't handle last week only having 4 trading days instead of 5. That's why I operate on date
/lead_date
to make my splits; I can control my training intervals according to an irregular calendar. Post-split filtering of the rset
object would do it too, but at the cost of generating way more than what I need.
Conceptually, I think it'd make more sense to only worry about applying the right information barrier to a no-leads-included table with the resampling and then generate the response, lead/lag features, and drop rows in a recipe, but there's so much boilerplate involved it'd be even more code than I have here, and I understand rsample
a bit better than recipes
right now. Maybe I'll explore this more in the future.
from rsample.
Hi, I have a very similar concern. It is about combining rolling origin forecast resampling and group v-fold cross-validation in rsample.
I have asked the question on SO.
In fact my example is training and assassing on whole months only but the aim and the application that I think of is rather general (group can be some factor and nevertheless the time series structure should be preserved.
I don't have the solution but the following was my example:
## generate some data
library(tidyverse)
library(lubridate)
library(rsample)
my_dates = seq(as.Date("2018/1/1"), as.Date("2018/8/20"), "days")
some_data = data_frame(dates = my_dates)
some_data$values = runif(length(my_dates))
some_data = some_data %>% mutate(month = as.factor(month(dates)))
This gives data of the following form
A tibble: 232 x 3
dates values month
<date> <dbl> <fctr>
1 2018-01-01 0.235 1
2 2018-01-02 0.363 1
3 2018-01-03 0.146 1
4 2018-01-04 0.668 1
5 2018-01-05 0.0995 1
6 2018-01-06 0.163 1
7 2018-01-07 0.0265 1
8 2018-01-08 0.273 1
9 2018-01-09 0.886 1
10 2018-01-10 0.239 1
Then we can e.g. produce samples that take 20 weeks of data and test on future 5 weeks (the parameter skip
skips some rows extra):
rolling_origin_resamples <- rolling_origin(
some_data,
initial = 7*20,
assess = 7*5,
cumulative = TRUE,
skip = 7
)
We can check the data with the following code and see no overlap:
rolling_origin_resamples$splits[[1]] %>% analysis %>% tail
# A tibble: 6 x 3
dates values month
<date> <dbl> <fctr>
1 2018-05-15 0.678 5
2 2018-05-16 0.00112 5
3 2018-05-17 0.339 5
4 2018-05-18 0.0864 5
5 2018-05-19 0.918 5
6 2018-05-20 0.317 5
### test data of first split:
rolling_origin_resamples$splits[[1]] %>% assessment
# A tibble: 6 x 3
dates values month
<date> <dbl> <fctr>
1 2018-05-21 0.912 5
2 2018-05-22 0.403 5
3 2018-05-23 0.366 5
4 2018-05-24 0.159 5
5 2018-05-25 0.223 5
6 2018-05-26 0.375 5
Alternatively we can split by months:
## sampling by month:
gcv_resamples = group_vfold_cv(some_data, group = "month", v = 5)
gcv_resamples$splits[[1]] %>% analysis %>% select(month) %>% summary
gcv_resamples$splits[[1]] %>% assessment %>% select(month) %>% summary
The solution by an SO user was a partial answer and not using rsample:
split data into a list by month
df <- split(some_data, some_data$month)
lapply along list elements defining train and test sets
df <- lapply(seq_along(df)[-length(df)], function(x){
train <- do.call(rbind, df[1:x])
test <- df[x+1]
return(list(train = train,
test = test))
})
the result df is a list of 7 elements each containing a train and test data frames.
from rsample.
@rwarnung if I understand correctly, you want to apply rolling-forward CV at the month level instead of the date level; per @DavisVaughan solution above, we can achieve that with some nesting:
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(rsample)
#> Loading required package: broom
#>
#> Attaching package: 'rsample'
#> The following object is masked from 'package:tidyr':
#>
#> fill
# same data generation as before
my_dates = seq(as.Date("2018/1/1"), as.Date("2018/8/20"), "days")
some_data = data_frame(dates = my_dates)
some_data$values = runif(length(my_dates))
some_data = some_data %>% mutate(month = as.factor(month(dates)))
# need nest()
library(tidyr)
# nest by month, then resample
rset <- some_data %>%
group_by(month) %>%
nest() %>%
rolling_origin(initial = 1)
# doesn't show which month is which :(
rset
#> # Rolling origin forecast resampling
#> # A tibble: 7 x 2
#> splits id
#> <list> <chr>
#> 1 <S3: rsplit> Slice1
#> 2 <S3: rsplit> Slice2
#> 3 <S3: rsplit> Slice3
#> 4 <S3: rsplit> Slice4
#> 5 <S3: rsplit> Slice5
#> 6 <S3: rsplit> Slice6
#> 7 <S3: rsplit> Slice7
# only January (31 days)
analysis(rset$splits[[1]])$data
#> [[1]]
#> # A tibble: 31 x 2
#> dates values
#> <date> <dbl>
#> 1 2018-01-01 0.179
#> 2 2018-01-02 0.719
#> 3 2018-01-03 0.119
#> 4 2018-01-04 0.889
#> 5 2018-01-05 0.429
#> 6 2018-01-06 0.269
#> 7 2018-01-07 0.600
#> 8 2018-01-08 0.792
#> 9 2018-01-09 0.760
#> 10 2018-01-10 0.804
#> # ... with 21 more rows
# only February (28 days)
assessment(rset$splits[[1]])$data
#> [[1]]
#> # A tibble: 28 x 2
#> dates values
#> <date> <dbl>
#> 1 2018-02-01 0.645
#> 2 2018-02-02 0.233
#> 3 2018-02-03 0.321
#> 4 2018-02-04 0.0927
#> 5 2018-02-05 0.750
#> 6 2018-02-06 0.302
#> 7 2018-02-07 0.861
#> 8 2018-02-08 0.713
#> 9 2018-02-09 0.0454
#> 10 2018-02-10 0.656
#> # ... with 18 more rows
Created on 2018-08-24 by the reprex package (v0.2.0).
This does add some workflow overhead, as you have to "unpack" the splits ($data
). It also has the downside of hiding information about what is in each split, but you could add a mutate step to extract some info from each split (e.g. month of assesment set) and add it as a new column.
I should also point out that factor is probably a bad way to store month
in this case; I'd suggest either letting it be an integer, or using floor_date
so each is the first date of the month. This also makes it easier to follow my suggestion above to pull data into a new column of the rset
.
(@DavisVaughan the nest thing is all you, feel free to go take his bounty on SO)
from rsample.
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
from rsample.
Related Issues (20)
- Sorting of strata in training data from initial_split
- Grouped resampling breaks with non-missing `strata = NULL` HOT 1
- `inner_split()`: keep everything inside of `split_args` or not? HOT 1
- inner_split(): better labels
- inner_split(): S3 method to retrive splitting arguments HOT 3
- inner_split(): no initial_split() arguments HOT 2
- inner_split(): global prop argument
- Update naming for elements relating to potato set
- audit for backticked package names
- Use cli errors for `R/bootci.R`
- Use cli errors for `R/caret.R`
- Use cli errors in `R/initial_validation_split.R`
- Use cli errors in `R/labels.R`
- Use cli errors in `R/make_groups.R`, `R/mc.R`, `R/nest.R`
- Use cli errors in `R/misc.R`
- Use cli errors in `R/permutations.R`, `R/reg_intervals.R`
- Use cli errors in `R/rset.R`
- Use cli errors in `R/rsplit.R`
- Use cli errors in `R/slide.R`
- Use cli errors in `R/tidy.R`, `R/validation_set.R`, `R/vfold.R`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rsample.