heidelbergcement / hcrystalball Goto Github PK

A library that unifies the API for most commonly used libraries and modeling techniques for time-series forecasting in the Python ecosystem.

Home Page: https://hcrystalball.readthedocs.io/

License: MIT License

Jupyter Notebook 1.50% Python 98.46% Dockerfile 0.04%

sklearn fbprophet pmdarima sarimax time-series-forecasting time-series model-selection cross-validation statsmodels tbats

hcrystalball's People

Contributors

Stargazers

Watchers

Forkers

cdeil pavelkrizek therhaag amit2016-17 betatim sowmya-debug seanahmad kusumy sharabhshukla pavelg087 icue-solutions attilajuhaszhc ambader rohitdhankar detrading laurencezabanal michalchromcak overfittingstudyroom pyquantsharp

hcrystalball's Issues

[FEATURE] Add fit, predict parallel flow for multiple time-series data

Is your feature request related to a problem? Please describe.
Once the best models are selected via functional part (run_model_selection, resp. select_model_general) or object-oriented part (ModelSelector.select_model) one usually needs to re-fit the selected models (e.g. daily) and predict with them. This part is currently missing, while this core part might be called directly from hcrystalball.

Describe the solution you'd like

Run CV with the persistence of ModelSelectorResults or minimal setup (only best_model might suffice here)
Load persisted models
Get the new training data
Run the fit_predict flow, that takes data for a subset of the partitions for which one has the models
- Split new data training data
- Run fit mapped over the partitions
- Run predict mapped over the partitions
- Reduce predictions to 1 dataframe
- Return predictions

Describe alternatives you've considered
Adopt loading of persisted models within the flow, but as this might be stored on disk, but also in some db, let's leave that out-of-scope.

Additional context
We should ensure consistency between CV and fit-predict (frequency should be the same, horizon might change I guess, ...)

[FEATURE] Ensure random state for estimators

Is your feature request related to a problem? Please describe.
To make reproducible results models should be initialized with a random state.

Describe the solution you'd like
Running twice the same model fit and predict gives the same results if the random state is passed.

Describe alternatives you've considered
N/A

Additional context
N/A

[OTHER] Question: AR coefficients, sklearn lags understanding

Hello.

A very simple question.

Once I've estimated an autoregression as a wrapped linear model, say as

model_lr = get_sklearn_wrapper(LinearRegression, lags=10)
mlr = model_lr.fit(X[:-10], y[:-10])

how do I get the autoregression coefficients?

I cannot find the equivalent of scikit .coef_ , like

mlr = LinearRegression().fit(X, y)
mlr.coef_

Thanks.

[BUG] Loading of ModelSelector does not correctly bring back the country column

Describe the bug
When loading the model selector from the stored path, the country code column is set always to None regardless of the

To Reproduce
TBD

Expected behavior
The Country code column is correctly set, if not possible provide a warning at least.

Screenshots
N/A

Additional context
N/A

[FEATURE] Update to newer python version by default

Is your feature request related to a problem? Please describe.
Missing compatibility with new python features

Describe the solution you'd like
The default environment that the CI and notebooks run against is with the latest (conda resolved) python version

Describe alternatives you've considered
N/A

Additional context
N/A

Choose exact lags in 'get_sklearn_wrapper'

I'm trying to replicate statsmodels's Autoreg, with this wrapper and Ridge Regression. Results seems to be on point, however statsmodel gives the option to choose the exact lags to be included in the model.

Is there such an option in the get_sklearn_wrapper method?

[FEATURE] Add support for Greykite

Is your feature request related to a problem? Please describe.
No.

Describe the solution you'd like
Add support for the recently open-sourced GreyKite by LinkedIn.

Introductory Blog post, on LinkedIn Engineering Blog
Source, on Github
Paper, published on arxiv

Describe alternatives you've considered
I can try coding it up myself by using the wrapper class as a template, and make a pull request after I get the code to pass the unit tests, but I'm fairly new to python, and it might take me a week or more to make the code clean. As GreyKite was open-sourced less than one week ago, I wonder if someone has already implemented it in hcrystalball.

Additional context
The linked blog post states that the main advantages of Greykite over Prophet are speed and out-of-the-box accuracy. It would be useful to many if this were implemented in HCrystal Ball.

[FEATURE] do you have code for as usual time series as multivariate time series

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

do you have code for as usual time series as multivariate time series

[FEATURE] Add readable progress indication also to the console output

Is your feature request related to a problem? Please describe.
When running the model selection e.g. from the python script, there is a lack of progress indication.

Describe the solution you'd like
Use tqdm not only for jupyter environment, but also for the console. Might make tqdm hard dependency.

Additional context
Especially useful when long-running sessions are executed. Make sure to look at the parallel execution to place that correctly. The preferable way is to have it always visible (top/bottom of the output?) and be able to scroll through the logs.

[BUG] Pre-commit does not strip notebooks' kernel information making docs build fail

Describe the bug
Notebooks kernelspec name information can be changed locally by contributors making docs build to fail.

To Reproduce
Create ipykernel named hcrystalball, execute notebooks with that and push your changes

Expected behavior
Docs build is independent on the developers local setup

Screenshots
N/A

Additional context
For ease of development

[OTHER] Feedback on reduction to regression `get_sklearn_wrapper`

Hi, following our conversion yesterday, I have two questions about the get_sklearn_wrapper:

1. API: Why do you dynamically create the init and pass the class rather than simply using composition and passing the object?

forecaster = get_sklearn_wrapper(DummyRegressor, lags=3)  # you do this
forecaster = get_sklearn_wrapper(DummyRegressor(), lags=3)  # why not this?

This would avoid the issues you seem to have with local class definitions and pickling
This would allow you to pass composite models like pipelines or GridSearchCV into the wrapper
I really like the dynamic init creator, but my intuition tells me it's misapplied here.

2. Algorithm: Why not use standard recursive strategy?

from hcrystalball.wrappers import get_sklearn_wrapper
from sklearn.dummy import DummyRegressor
import pandas as pd
import numpy as np

index = pd.date_range("2000", periods=13, freq="Y")
y_train = pd.Series(np.arange(10), index=index[:-3])
X_train = pd.DataFrame(index=y_train.index)
X_test = pd.DataFrame(index=index[-3:])

model = get_sklearn_wrapper(DummyRegressor, lags=3)
model.fit(X_train, y_train)
model.predict(X_test)
# >>> 2010-12-31	7.0
# >>> 2011-12-31	7.0
# >>> 2012-12-31	7.0

# you use the first 3 values as lagged variables, the DummyRegressor simply computes the mean of the rest
# so shouldn't the result be the following?
y_train.iloc[3:].mean()  # >>> 6.0

# the problem seems to be in the way you generate the target series 
# you increase the gap between lagged variables and target to match the length of 
# forecasting horizon 
X, y = model._transform_data_to_tsmodel_input_format(X_train, y_train, len(X_test))
pd.concat([X, pd.Series(y, index=X.index, name="y")], axis=1).head()
#	lag_0	lag_1	lag_2	y
#5	2.0	1.0	0.0	5
#6	3.0	2.0	1.0	6
#7	4.0	3.0	2.0	7
#8	5.0	4.0	3.0	8
#9	6.0	5.0	4.0	9

Hope this helps!

[FEATURE] Add detrend transformer to model_selection for pipelines with random forest

Is your feature request related to a problem? Please describe.
For some of the time-series, Pipelines with random forest models do not perform well due to trend in the data (can't predict values bigger, than seen during training)

Describe the solution you'd like
Add detrend transformer to bring the possibility for such pipelines

Describe alternatives you've considered
Implement/import detrend from sktime

Additional context
Let's decide whether to implement or import detrend from other libraries along with the possibility for other provided transformers (e.g. sktime, scikit-lego, etc.)

[FEATURE] Implement deprecated sktime interface on hcrystalball

Is your feature request related to a problem? Please describe.
The current sktime interface to hcrystalball is going to be deprecated soon. A solution to still use hcrystalball with sktime would be to move the wrapper from sktime package to hcrystalball package.

Describe the solution you'd like
Copy the existing code from https://github.com/alan-turing-institute/sktime/blob/main/sktime/forecasting/hcrystalball.py to hcrystalball. Tests from sktime are now also portable, whihc means they can be imported from the package and could be run on that class. See here: https://www.sktime.org/en/stable/developer_guide/testing_framework.html

[BUG] Handling of country_code_column in ModelSelector

Describe the bug
When calling init on ModelSelector, country_code_column can be optionally supplied. This value is not used in ModelSelector.select_model, neither in ModelSelector.create_gridsearch.

To Reproduce

>>> from hcrystalball.model_selection import ModelSelector

>>> ms = ModelSelector(frequency='D', horizon=10, country_code_column="region")
>>> ms.create_gridsearch()
>>> ms.grid_search.estimator.steps
[('exog_passthrough', 'passthrough'),
 ('holiday', 'passthrough'),
 ('model', 'passthrough')]

Here ('holiday') step should have HolidayTransformer, not 'passthrough'.

Expected behavior
1 ModelSelector instance per country_code_column value -> country_code_column to be only supplied at init time and later used in further methods (same as e.g. frequency)

Additional context
Add any other context about the problem here.

[BUG] Exception with scikit-learn >=1.3.0

Describe the bug
This package raises an exception when trying to import hcrystallball.metrics with scikit-learn >=1.3.0

To Reproduce
install hcrystallball into a new (Python >=3.8) environment along with scikit-learn>=1.3.0.
Attempt to import hcrystallball.metrics

Note the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\...\lib\site-packages\hcrystalball\metrics\__init__.py", line 1, in <module>
    from ._scorer import get_scorer
  File "C:\...\lib\site-packages\hcrystalball\metrics\_scorer.py", line 5, in <module>
    from sklearn.metrics import SCORERS
ImportError: cannot import name 'SCORERS' from 'sklearn.metrics' (C:\...\lib\site-packages\sklearn\metrics\__init__.py)

Expected behavior
no exception

Screenshots
N/A

Additional context
SCORERS has been removed from sklearn.metrics as of 1.3.0

[FEATURE] Add wrapper for Theta model from statsmodels

Is your feature request related to a problem? Please describe.
I would like to see the Theta forecasting model of Assimakopoulos and Nikolopoulos (2000) in hcrystalball. This method also performed particularly well in forecasting competitions (M3) and it can be of interest for other practitioners

Describe the solution you'd like
Have wrapper for Theta model from statsmodels https://www.statsmodels.org/dev/generated/statsmodels.tsa.forecasting.theta.ThetaModel.html#statsmodels.tsa.forecasting.theta.ThetaModel with a unified interface.

[FEATURE] Create more robust ci check for python versions

Is your feature request related to a problem? Please describe.
Embedding HCrystalBall to a project workflow might show unexpected behavior, as we do not check for python version compatibilities (only 3.7 now).

Describe the solution you'd like
Re-think CI to add to the testing matrix also 3.6, 3.7, 3.8, and 3.9 python versions.

Describe alternatives you've considered
None

Additional context
The code of CI would need to change, as there is a conda environment built based on the environment.yml, which currently contains python=3.7.6. Also not sure right now, how to pass/enforce different python versions there easily.

[FEATURE] Non-continuous index in predict

Is your feature request related to a problem? Please describe.
Sometimes it might be handy to be able to pass an index for predict methods, which would have gaps. E.g. wanting only forecasting horizon of 1d, 3d, and 7d ahead.

Describe the solution you'd like

X_pred = pd.DataFrame(index=pd.DatetimeIndex(["2020-01-01", "2020-01-03", "2020-01-07"]))
preds = model.predict(X_pred)
assert X_pred.index.to_list() == preds.index.to_list()

Some models support that natively, for some models, the only way might be predicting for full range and later select from the results only X_pred values.

full_ind = pd.DatetimeIndex(X_pred.index.min(), X_pred.index.max())
preds = model.predict(pd.DataFrame(index=full_ind).merge(X_pred, left_index=True, right_index=True))
return preds.loc[X_pred.index.to_list()]

SklearnWrapper might just pick exact things (thus support list of concrete lags) - especially handy when having optimize_for_horizon set on True

Describe alternatives you've considered
Make it uniform for all models, while "wasting" computational time, but having just one implementation.

Additional context
Request partially raised in order to be compliant with sktime, but we could also discuss its usefulness in general.

[FEATURE] Support python 3.6

Is your feature request related to a problem? Please describe.
There seems to be no problem in using 3.6, so why not to keep it

Describe the solution you'd like
Change setup.cfg from python requires >=3.7 to >=3.6

Describe alternatives you've considered
Do not support

Additional context
Adds compatibility with sktime

[FEATURE] Do not allow for empty X_train, y_train in SklearnWrapper after lag creation

Is your feature request related to a problem? Please describe.
If a user calls fit and later predict on the SklearnWrapper while having training data of length 51, horizon 50, and lags 3, based on our implementation, he/she would end up with empty X_train, y_train due to the lag creation which introduces non-evitable data cuts.

Describe the solution you'd like
SklearnWrapper should raise ValueError in case there is nothing left in the data for fitting.

Describe alternatives you've considered
We might consider implementing other approaches (without lags) but should start with fixing this one.

Additional context
The issue occurred when implementing HCrystalBallForecaster in sktime

Optional dependencies not failing on import

When using minimal pip install hcrystalball installation and trying to run following code with prophet, expected behavior is import error for prophet wrapper

import pandas as pd
from hcrystalball.model_selection import ModelSelector
from hcrystalball.utils import get_sales_data 

df = get_sales_data(n_dates=100,  
                                  n_assortments=2,  
                                  n_states=2,  
                                  n_stores=2)  

ms = ModelSelector(horizon=10,  
                                   frequency='D',  
                                   country_code_column='HolidayCode', 
                                   )

ms.create_gridsearch(sklearn_models=True, 
                                     prophet_models=True, 
                                     n_splits = 2, 
                                     between_split_lag=None, 
                                     sklearn_models_optimize_for_horizon=False, 
                                     autosarimax_models=False,                                    
                                     tbats_models=False, 
                                     exp_smooth_models=False, 
                                     average_ensembles=False, 
                                     stacking_ensembles=False,                     
                                     exog_cols=['Open','Promo','SchoolHoliday','Promo2'], 
                                     )

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-3c1b160302bf> in <module>
----> 1 ms.create_gridsearch(sklearn_models=True,
      2                     n_splits = 2,
      3                     between_split_lag=None,
      4                     sklearn_models_optimize_for_horizon=False,
      5                     autosarimax_models=False,

~/miniconda3/envs/hcb_conda/lib/python3.8/site-packages/hcrystalball/model_selection/_model_selector.py in create_gridsearch(self, n_splits, between_split_lag, scoring, country_code_column, country_code, sklearn_models, sklearn_models_optimize_for_horizon, autosarimax_models, autoarima_dict, prophet_models, tbats_models, exp_smooth_models, average_ensembles, stacking_ensembles, stacking_ensembles_train_horizon, stacking_ensembles_train_n_splits, clip_predictions_lower, clip_predictions_upper, exog_cols)
    234         """
    235         params = {k: v for k, v in locals().items() if k not in ["self"]}
--> 236         self.grid_search = get_gridsearch(frequency=self.frequency, horizon=self.horizon, **params)
    237 
    238     def add_model_to_gridsearch(self, model):

~/miniconda3/envs/hcb_conda/lib/python3.8/site-packages/hcrystalball/model_selection/_configuration.py in get_gridsearch(frequency, horizon, n_splits, between_split_lag, scoring, country_code_column, country_code, sklearn_models, sklearn_models_optimize_for_horizon, autosarimax_models, autoarima_dict, prophet_models, tbats_models, exp_smooth_models, average_ensembles, stacking_ensembles, stacking_ensembles_train_horizon, stacking_ensembles_train_n_splits, clip_predictions_lower, clip_predictions_upper, exog_cols)
    227             {
    228                 "model": [
--> 229                     ProphetWrapper(
    230                         clip_predictions_lower=clip_predictions_lower,
    231                         clip_predictions_upper=clip_predictions_upper,

TypeError: __init__() got an unexpected keyword argument 'clip_predictions_lower'

Python version, neither filter for folders not taken into account from setup.cfg

For some reason, the following section is not taken into account when creating distribution for PyPI with python setup.py sdist bdist_wheel.

Such command creates hcrystalball-0.1.4.post0.dev13+gfcf6b6d.dirty-py2.py3-none-any.whl while support for Python 2 should not be included (python_requires = >=3.7)

Also dev and test folders are included in the tar.gz even they are mentioned in [options.packages.find] exclude

(hcb_conda) [hcrystalball-0.1.4.post0.dev13+gfcf6b6d.dirty] ls -l                                                 10:34:29  ☁  matplotlib_base ☂ ⚡ ✭
total 80
-rw-r--r--   1 michalchromcak  staff   370 Jul  2 10:09 AUTHORS.rst
-rw-r--r--   1 michalchromcak  staff   553 Jul  2 15:11 CHANGELOG.rst
-rw-r--r--   1 michalchromcak  staff  1083 Jun 29 13:35 LICENSE.txt
-rw-r--r--   1 michalchromcak  staff  7427 Jul  3 10:28 PKG-INFO
-rw-r--r--   1 michalchromcak  staff  5340 Jul  3 00:07 README.md
drwxr-xr-x   8 michalchromcak  staff   256 Jul  3 10:28 dev
drwxr-xr-x  16 michalchromcak  staff   512 Jul  3 10:28 docs
-rw-r--r--   1 michalchromcak  staff  1189 Jul  3 08:44 environment.yml
-rw-r--r--   1 michalchromcak  staff  1736 Jul  3 10:28 setup.cfg
-rw-r--r--   1 michalchromcak  staff   550 Jul  3 10:18 setup.py
drwxr-xr-x   4 michalchromcak  staff   128 Jul  3 10:28 src
drwxr-xr-x   5 michalchromcak  staff   160 Jul  3 10:28 tests

hcrystalball/setup.cfg

Lines 49 to 54 in 1b42c12

 python_requires = >=3.7 

 [options.packages.find] 

 where = src 

 exclude = 

 tests 

 dev

[FEATURE] Improve logging/warnings from third party libraries

Is your feature request related to a problem? Please describe.
When running the model selection (whether interactively or via embedded code in other functions), logging is too overwhelming.

Describe the solution you'd like
Either solution 1 or solution 2 to be implemented.

Add an option, which would introduce filters for warnings
- Show everything just once
- Filter particular libraries
- Filter based on log level and logger
Document best practices with HCrystalBall to make reasonable cuts in the logging

Describe alternatives you've considered
Leave it as is -> Too much of the repeated logging that is hiding valuable output (even though it can be filtered later on)

Additional context
An option to keep all the third party output should be still available to enable easy debugging of issues.

	python_requires = >=3.7
	[options.packages.find]
	where = src
	exclude =
	tests
	dev

heidelbergcement / hcrystalball Goto Github PK

hcrystalball's People

Contributors

Stargazers

Watchers

Forkers

hcrystalball's Issues

1. API: Why do you dynamically create the init and pass the class rather than simply using composition and passing the object?

2. Algorithm: Why not use standard recursive strategy?

Recommend Projects

Recommend Topics

Recommend Org