Giter VIP home page Giter VIP logo

dateselectors.jl's Introduction

DateSelectors

Stable Dev CI code style blue ColPrac: Contributor's Guide on Collaborative Practices for Community Packages

Usage

DateSelectors.jl simplifies the partitioning of a collection of dates into non-contiguous validation and holdout sets in line with best practices for tuning hyper-parameters, for time-series machine learning.

The package exports the partition function, which assigns dates to the validation and holdout sets according to the DateSelector. The available DateSelectors are:

  1. NoneSelector: assigns all dates to the validation set.
  2. RandomSelector: randomly draws a subset of dates without replacement.
  3. PeriodicSelector: draws contiguous subsets of days periodically from the collection.

A notable trait of the DateSelectors is that the selection is invariant to the start and end-dates of collection itself. Thus you can shift the start and end dates, e.g. by a week, and the days in the overlapping period will consitently still be placed into holdout or validation as before. The only thing that controls if a date is selected or not is the parameters of the DateSelector itself.

See the examples in the docs for more info.

dateselectors.jl's People

Contributors

arnaudh avatar eperim avatar fchorney avatar glennmoy avatar iamed2 avatar ianlmgoddard avatar morris25 avatar mzgubic avatar nickrobinson251 avatar oxinabox avatar rofinn avatar samusz avatar thomasgudjonwright avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Forkers

stjordanis samusz

dateselectors.jl's Issues

Exclude specified dates from selection

We might want to exclude certain dates from our datasets. Something like:

date_range = Date(2018, 1, 1):Day(1):Date(2020, 1, 1)

# Normal date selection
selector = RandomSelector(42, 0.10, Day(3))
validation, holdout = partition(date_range, selector)

# But we don't want these dates
bad_dates =[Date(2019, 1, 1), Date(2019, 12, 1)]

# Do selection as usual, just don't return any of the bad dates
selector_exc = DateExclusionSelector(selector, bad_dates)
validation_exc, holdout_exc = partition(date_range, selector_exc)

# The following should hold:
@assert validation_exc == setdiff(validation, bad_dates)
@assert holdout_exc == setdiff(holdout, bad_dates)

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

More sophisticated sampling techniques

from @glennmoy

The idea behind randomly/periodically selecting weekly blocks is that it allows use to adequately sample a 2 year period with some statistical guarantees about the proportional representation of weekdays/weekends and seasons within the validation and holdout sets.

The implication is that this provides (albeit somewhat weaker) guarantees about the distribution of the underlying grid state, seasonality effects, and our performance over the period.

In the ensembling squad this assumption was undermined by one Problem as our model performance varied a lot between years and one year happened to be sampled more than the other.

The current RandomSelector is therefore not robust enough to provide any guarantees about the statistics of our returns which are necessary to provide a reliable baseline against which we can compare optimised models.

This issue is more to document the concern and some possible avenues for taking this forward with different, and more sophisticated, selectors in future. Namely:

As a simple remedy to the above, we might have instead done something like

  • Cluster the dates by season
  • Within each cluster, sort the dates by their some difficulty measure
  • Systematically select dates (e.g every second date or in blocks of 7) to ensure (roughly) proportional statistics

This would retain the same seasonal guarantees as before, somewhat weakened the weekday/weekend guarantee, but at the benefit of more similar return statistics. This is just a simple example, perhaps there's an easier/better way to doing it.

Moreoever, if we ever wish to discriminate by other criteria, e.g. grid regimes, the example gets more complicated but the same principle applies.

Add rejection sampling capability

It would be useful to be able to select the seed for (e.g.) the RandomSelector based on user-specified criteria. This would allow an implementation of more sophisticated sampling methods (cf. #7 ) via rejection sampling, with a compact representation (the seed) of the final date set. Something like the following:

function acceptance(validation_dates, holdout_dates)
    # We want 1st of January to be in the validation dates
    return DateTime(2021, 1, 1) in validation_dates
end

# Gives a seed with the desired properties to use with RandomSelector
seed = rejection_sample(RandomSelector, date_range, acceptance)

Precomputed Selection represented as a BitString

An idea @glennmoy and I were discussing

@glennmoy

> could we not generate a random bit string once, to represent whether a block will go in the validation/holdout?
> 
> This is just pseduo code but the idea would be e.g. 
> - take a large enough date range to capture any reasonable dates we would consider: date_range = `Date(1900, 1, 1):block_size:Date(2100, 1, 1)`
> - Generate a bit string of `length(date_range)` once with equal number of 0s/1s 
> - Mask the `date_range` using the bit string in global_validation/global_holdout
> - Intersect with the dates we provide to get the validation/holdout we care about
> 
> * only uses one random number 
> * no loops
> * exact fraction gets allocated (modulo block size)

@oxinabox replied

If we could actually fit it into some thing like a UInt1024 (which would from https://github.com/rfourquet/BitIntegers.jl)
that would cover us for 1024 blocks. which if doing block size of 1 day is not really enough.

But it would be nice because one could generate the bitstring in seperately to constructing the DateSelector, and pass it into the DateSelector as instead of the seed.
And that would be safe against the random number selector changing.
Such a 1024 bit string would look like 0x328d_8a1e_86f8_5c55_3cfe_223a_5904_02fd

Complicated bit would be that to achieve invarience to start date we probably need to do 200 years of days.
So would need to be a Int65536 which would not work great with BitIntegers.
But we could make it work.
Probably would want to Base64 encode it.

`partition` to accept a vector of dates

Disclaimer: I am not familiar with the design of this package so I might be missing something obvious.

That said, why are the inputs to partition restricted to Intervals and ranges of dates? IIUC allowing a vector of dates would solve #14 and remove the need for #17 since we could just filter the input dates.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.