wenjiez / tscv Goto Github PK

Time Series Cross-Validation -- an extension for scikit-learn

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

time-series cross-validation model-selection backtesting hyperparameter-optimization tuning-parameters machine-learning data-science

tscv's People

Contributors

Stargazers

Watchers

tscv's Issues

Retrained version of GapWalkForward: GapRollForward

The current implementation is based on legacy K-Fold cross-validation requiring an explicit value for the n_splits parameter. It puts the burden of calculating desired value of n_splits on the user.

A better implementation should allow the user to initiate a GapWalkForward class without specifying the value for n_splits. Instead, it can deduct the right value through the other inputs.

It is theoretically desirable to keep both channels of kickstarting a GapWalkForward class. In practice, however, it is hard to maintain both within a single class. Therefore, I decide to ~~deprecate the n_splits channel~~ implement a new class dubbed GapRollForward in v0.1.0 -- the version after the next.

Intution on setting number of gaps

If for example, I have data without gaps, when and why would I still create a break between my train and validation? I have seen the argument for setting gaps when the period that needs to be predicted may be N days after the train. Are there other reasons? And if so, what is the intuition on knowing how many gaps to include before/after the training set?

Improve the user experience of `gap_train_test_split`

(Optional) better error message when invalid/misspelt keyword parameters are given
check_consistent_length to ensure that all arguments have the same number of rows
Verify that the gap_size argument is a real number
Improve the warning message when both train_size and test_size arguments are given:

The train_size argument is overridden by test_size; in case of nonzero 'gap_size', an explicit value should be provided and cannot be implied by '1 - train_size - test_size'.

Continuous Integration

Error when Importing TSCV Gapwalkforward

Using TSCV Gapwalkforward successfully with Python 3.7.

Suddenly getting following error:

ImportError Traceback (most recent call last)
in
41 #Modeling
42
---> 43 from tscv import GapWalkForward
44 from sklearn.utils import shuffle
45 from sklearn.model_selection import KFold

~\Anaconda3\envs\py37\lib\site-packages\tscv_init_.py in
----> 1 from .split import GapCrossValidator
2 from .split import GapLeavePOut
3 from .split import GapKFold
4 from .split import GapWalkForward
5 from .split import gap_train_test_split

~\Anaconda3\envs\py37\lib\site-packages\tscv\split.py in
7
8 import numpy as np
----> 9 from sklearn.utils import indexable, safe_indexing
10 from sklearn.utils.validation import _num_samples
11 from sklearn.base import _pprint

ImportError: cannot import name 'safe_indexing' from 'sklearn.utils'

Any insight? I get this when simply importing Gapwalkforward.

Documentation

Documentation and examples do not address the splitting of data set into training and test sets.

If using one of the cross validators, does the data set need to be sorted in time order? Is there way to designate a datetime column so the class understands on what basis to sequentially split data?

Implement rolling overlapping CV

Like this:

    |========****    |
    |  ========****  |
    |    ========****|

May be params n_splits and test_size needed.

Import error with latest sklearn version

Hi guys, this issue occured after the upgrade to 1.1.3

ImportError: cannot import name '_pprint' from 'sklearn.base'

/.venv/lib/python3.10/site-packages/tscv/_split.py:19 in      │
│ <module>                                                                                         │
│                                                                                                  │
│    16 import numpy as np                                                                         │
│    17 from sklearn.utils import indexable                                                        │
│    18 from sklearn.utils.validation import _num_samples, check_consistent_length                 │
│ ❱  19 from sklearn.base import _pprint                                                           │
│    20 from sklearn.utils import _safe_indexing                                                   │
│    21                                                                                            │
│    22

Could you please fix it ?

Kind regards,
Jim

Implement Rep-Holdout

Thank you for this repository and the implemented CV-methods; especially GapRollForward. I was looking for exactly this package.

I was wondering if you are interested in implementing another CV-Method for time series, called Rep-Holdout. It is used in this evaluation paper (https://arxiv.org/abs/1905.11744) and has good performance compared to all other CV-methods - some of which you have implemented here.

As I understand it, it is somewhat like sklearn.model_selection.TimeSeriesSplit but with a randomized selection of all possible folds. Here is the description from the paper as an image:

The authors provided code in R but it is written very differently than how it needs to look in Python. I adapted your functions to implement it in python but I am not the best coder and it really only serves my purpose of tuning a specific model. Seeing as the performance of Rep-Holdout is good and -to me at least - it makes sense for time series cross validation, maybe you are interested in adding this function to your package?

Make it work with cross_val_predict

Is it possible to somehow make the CV work with cross_val_predict function. Fore example, if I try:

cv = GapWalkForward(n_splits=3, gap_size=1, test_size=2)
cross_val_predict(estimator=SGDClassifier(), X=X_sample, y=y_bin_sample, cv=cv, n_jobs=6)

it returns an error

ValueError: cross_val_predict only works for partitions

but I would like to have predictions so I can make consfusion matrx and other statistics.

Is it possible to make it work with your cross-validators?

Publish on conda

conda-forge/staged-recipes#14574

Stratify?

Documentation

Double count in `n_splits` in `GapLeavePOut`

TSCV/tscv/split.py

Lines 240 to 241 in f8b832f

 n_splits = max(n_samples - gap_after - self.p, 0) 

 n_splits += max(n_samples - self.p - gap_before, 0)

[Docs] Use this package for Nested Cross-Validation

This issue documents the way to use this package for Nested Cross-Validation. If you have any question, welcome to comment below.

Flat cross-validation vs. nested cross-validation

To clarify the meaning of these two terms in this specific issue, let me first describe them.

Flat cross-validation

Let us use 5-Fold as an example. In a 5-Fold flat cross-validation, you split the dataset into 5 subsets. Each time, you train a model from 4 of them and test it on the remaining one. Afterwards, you average the 5 scores yielded from the 5 test subsets.

ooooo: training subset
*****: test subset

ooooo ooooo ooooo ooooo *****
ooooo ooooo ooooo ***** ooooo
ooooo ooooo ***** ooooo ooooo
ooooo ***** ooooo ooooo ooooo
***** ooooo ooooo ooooo ooooo

Reasonably, the model you trained depends on both the algorithm you use and the hyperparameter you input. Therefore, the averaged score provides a criterion to evaluate both the algorithms and hyperparameters. I will later explain whether these evaluations are accurate enough, but for now it suffices to understand the basic procedure.

Nested cross-validation

In contrast to flat cross-validation, which evaluates both the algorithms and the hyperparameters in one fell swoop, nested cross-validation evaluates them in a hierarchical fashion. In the upper-level, it evaluates the algorithms; in the lower-level, it evaluates the hyperparameter within each algorihtm.

Let us still use the 5-Fold setup. First we, likewise, split the dataset into 5 subsets. Let us call it the macro split, which allows us to run each same experiment 5 times. In each run, we further split the training set into 5 sub-subsets. Let us call it the micro split. If the whole dataset has 25 samples, then the macro split sets 20 samples for training and 5 samples for testing in each run, and the micro split further splits the 20 training samples and sets 16 for training and 4 for testing.

Macro split:

12345 12345 12345 12345 *****   =>  further split to micro split -- No. 1
12345 12345 12345 ***** 12345   =>  further split to micro split -- No. 2
12345 12345 ***** 12345 12345   =>  further split to micro split -- No. 3
12345 ***** 12345 12345 12345   =>  further split to micro split -- No. 4
***** 12345 12345 12345 12345   =>  further split to micro split -- No. 5

(Indicative) micro split -- No. 1 (5 in total):

1111 2222 3333 4444 xxxx
1111 2222 3333 xxxx 5555
1111 2222 xxxx 4444 5555
1111 xxxx 3333 4444 5555
xxxx 2222 3333 4444 5555

In the upper-level macro split, we choose a target algorithm and dive into the lower-level micro split. With the target algorithm fixed, we vary the hyperparameters to get the evaluation for each hyperparameter and choose the optimal one. Then, we return to the upper level by fixing the hyperparameter as the optimal one and evaluate the target algorithm. Then, we choose another target algorithm and repeat the same procedure.

Let us call it 5x5 nested cross-validation. Of course, you can use, in general, a mxn nested cross-validation. The essence is to separate the evaluation of the algorithm from the evaluation of the hyperparameter.

Use nested cross-validation for time series.

In time series cross-validation, you need to introduce gaps, which makes the problem tricky. Luckily, we have an easy walk around. That is, the 2xn nested cross-validation is free:

2x4 nested cross-validation
---------------------------------

Macro split:

ooooo ooooo ooooo ooooo gap *****
***** gap ooooo ooooo ooooo ooooo

Micro split -- No. 1 (2 in total):

oooo oooo oooo gap ****
ooo ooo gap **** gap ooo
ooo gap **** gap ooo ooo
**** gap oooo oooo oooo

You can use my package tscv for this kind of 2xn nested cross-validation.

Why nested cross-validation?

The reason is that the algorithms with more hyperparameters have an edge in flat cross-validation. The dimension of the hyperparameters can be seen as the capacity of "bribery" of the algorithm. The more hyperparameters the algorithm owns, the more severely the algorithm compromise the test dataset. Flat cross-validation, by nature, favours those algorithms with rich hyperparameters. In contrast, nested cross-validation puts every algorithm on the same starting line. That is why nested cross-validation is preferred when comparing algorithms with significantly different dimensions of hyperparameters.

Then, does the nested cross-validation provides an accurate way to evaluate the final chosen model? No, though it help you to pick the best algorithm and its hyperparameter, the resulted model's performance is not under measurement. To explain it, we need some advanced statistics knowledge. To avoid bloating this issue, I will only mention here that model(x*) is different from model(x)|x=x*. The good news, however, is that if your algorithm does not have too many hyperparameters, the cross-validation error will not be too far away from the resulted model's error. Therefore, an algorithm with better performance in nested cross-validation likely leads to a model with better performance in terms of generation error.

Warning once is not enough

TSCV/tscv/split.py

Line 253 in f8b832f

warnings.warn(SINGLETON_WARNING, Warning)

This warning should appear for every occurrence. Use standard output instead.

GapWalkForward Issue with Scikit-learn 0.24.1

When I upgrade to Scikit-learn 0.24.1 I get an issue:

cannot import name 'safe_indexing' from 'sklearn.utils'

This appears to be a change within scikit-learn as indicated here:

https://stackoverflow.com/questions/65602076/yellowbrick-importerror-cannot-import-name-safe-indexing-from-sklearn-utils

No issue using scikit-learn 0.23.2

Release 0.0.4 for GridSearch compat

Would it be possible to issue a new release on PyPI to include the latest changes from this commit which aligns the get_n_splits method signature with the abstract method signature required by GridSearchCV?

split.py depends on deprecated / newly private method `_safe_indexing` in scikit-learn 0.24.0

Just flagging a minor issue:

We found this after poetry update-ing our dependencies, inadvertently bumping scikit-learn to 0.24.0. This broke code we have that uses tscv

relevant scikit-learn source-code from version 0.23.0
https://github.com/scikit-learn/scikit-learn/blob/0.23.0/sklearn/utils/__init__.py#L274-L275

The method has been made private in scikit-learn 0.24.0: https://github.com/scikit-learn/scikit-learn/blob/0.24.0/sklearn/utils/__init__.py#L271

I did not investigate further, we pinned scikit-learn to 0.23.0 and that's OK for now, but some refactoring may be in order to move off the private method.

Does this work with sklearn 1.2?

There seems to be some changes with the sklearn library that's causing some compatibility issues with tscv. Just wondering if im doing something wrong or tscv currently doesn't support sklearn 1.2

Deprecation message for `GapWalkForward`

http://www.zhengwenjie.net/tscvfuture/

	n_splits = max(n_samples - gap_after - self.p, 0)
	n_splits += max(n_samples - self.p - gap_before, 0)