antoinecarme / pyaf Goto Github PK

View Code? Open in Web Editor NEW

458.0 19.0 73.0 288.43 MB

PyAF is an Open Source Python library for Automatic Time Series Forecasting built on top of popular pydata modules.

License: BSD 3-Clause "New" or "Revised" License

Python 91.64% R 0.01% Shell 0.01% Makefile 8.36% Procfile 0.01%

scikit-learn pandas jupyter forecasting exogenous benchmark seasonal time-series horizon autoregressive

pyaf's People

Contributors

Stargazers

Watchers

pyaf's Issues

New Dataset - Canards Tour d'Argent

Need to find a dataset reporting the day of each duck (these are numbered) served at "La Tour d'Argent".

Create a Heroku App

Need a RESTful API for PyAF and a Heroku demo.

https://pyaf.herokuapp.com/

Avoid unnecessary failures

Sometimes the signal is too small/easy to forecast. PyAF fails when the signal has only one row !!!

The goal here is to make pyaf as robust as possible against very small/bad datasets
PyAF should automatically produce reasonable/naive/trivial models in these cases.
It should not fail in any case (normal behavior expected, useful for M2M context)

Add a document about plotting features of PyAF

PyAF has a API call lEngine.standardPlots(). It gives some classical plots (signal against forecast, residues, trends, cycles, AR)

All the plots are generated with matplotlib

Document the plots generated.

The REST service (issue #20 ) also gives the same plots in a png/base64 encoding, to be documented.

Reduce memory footprint

Typing columns is not optimized for the moment ....

User Guide

At least

API documentation in Python code
Some examples

Certificate failure Python 3.6

This is not an issue with PyAF.

However, in case anyone else should encounter read_csv() failures on macosx when running examples, refer

https://bugs.python.org/issue28150

http://stackoverflow.com/questions/41691327/ssl-sslerror-ssl-certificate-verify-failed-certificate-verify-failed-ssl-c

Avoid mixing RMSE and MAPE as performance indicators for model selection

The code as it is today, uses RMSE sometimes and MAPE sometimes.

There is an option,

self.mModelSelection_Criterion = "L2";

That shoud be used for model selection everywhere. By the way it should be set to "MAPE" be default.

This is a very serious bug.

Improve Plots for Hierarchical Models

Plots that are generated for hierarchical models are too elementary.

Add more significant annotation for all the hierarchy nodes:

MAPE for the node model.
Top-Down average proportions for the edges.
Use a specific color for each level of the hierarchy.
Other annotations ?

Experiment with Neural Networks

It may be funny to test some neural network models in the competing models.

pyAF_introduction stalling during training?

The introductory notebook hangs calling lEngine.train(ozone_dataframe, )

I am investigating...

ipykernel==4.5.2
ipython==5.3.0
ipywidgets==6.0.0
jupyter==1.0.0
jupyter-client==5.0.0
jupyter-console==5.1.0
jupyter-core==4.3.0
Keras==1.2.2
matplotlib==2.0.0

Perform signal transformation in a uniform way

Some signal transformations need nothing, others need the signal to be positive (log = boxcox(0)).
Some perform some scale-translation invariance. needs to be applicavble to all transformations (new option).

Hierarchical Forecasting : Middle-out approach

This approach is still missing. Generate new columns with MO prefix based on base forecasts for all hierarchy levels.

Document the Algorithmic Details of PyAF

Need a document to describe the algorithmic aspects of time series forecasting in PyAF.

The overall algorithm
The detail of the signal decomposition
The machine learning aspects
Advanced usage/control of the algorithms.
5.Hierarchical forecasting.

Enforce type checking for time and signal columns

Time should be a time (np.dtype is time/date or numeric)
Signal should be numeric
issue error messages if time/signal column is not found in the training dataset and has correct type.

Add LnQ Performance measure

According to :

https://en.wikipedia.org/wiki/Symmetric_mean_absolute_percentage_error

A limitation to SMAPE is that if the actual value or forecast value is 0, the value of error will boom up to the upper-limit of error. (200% for the first formula and 100% for the second formula).

Provided the data are strictly positive, a better measure of relative accuracy can be obtained based on the log of the accuracy ratio: log(Ft / At) This measure is easier to analyse statistically, and has valuable symmetry and unbiasedness properties. When used in constructing forecasting models the resulting prediction corresponds to the geometric mean (Tofallis, 2015).

Investigate IoT Time Series Applications

At least check the possibility of using pyaf in this context.
pyaf is not aware of the data source type (time series database or web service, etc) as long as the dataset is stored in a pandas dataframe.
Is there a link with hierarchical models ?
A jupyter notebook is welcome with a real example.

Add a low-memory mode for Heroku

Heroku free dyno "only" has 512MB and low cpu.

Signal Transformation Computation is too slow

We copy the dataset and use a loop over the whole dataset which is too baaaaad :

'''

def specific_invert(self, df):
    df_orig = df.copy();
    df_orig.iloc[0] = df.iloc[0];
    for i in range(1,df.shape[0]):
        df_orig.iloc[i] = df.iloc[i] -  df.iloc[i - 1]
    return df_orig;

'''

There is room for improvement.

Add Support Vector Regression Models

Using scikit_learn SVR models (the same way we already use Ridge linear regression for AR/ARX models.).

Enhance artificial dataset tests.

Use Travis-CI

According to https://github.com/ripienaar/free-for-dev

travis-ci.org — Free for public GitHub repositories

Goal :
Put in place some continuous integration process for pyaf.

https://travis-ci.org/antoinecarme/pyaf

Assign Colors in a uniform way in the various plots.

In the API-generated plots, the forecast line color is not always the same. The same applies for other lines colors.

Artificial dataset test failure

The following test fails :

tests/artificial/transf_/trend_poly/cycle_7/ar_/test_artificial_1024__poly_7__20.py

Fails with the exception :

ValueError: shapes (1012,1030) and (1000,) not aligned: 1030 (dim 1) != 1000 (dim 0)

Seems to be a scikit learn usage error.

Performance/Profiling review

Need to review the software aspects of performance.

Benchmark : expsmooth R package Datasets

Need to test PyAF on all th e datasets given in this package

package :

https://cran.r-project.org/web/packages/expsmooth/index.html

datasets :

ausgdp dji gasprice partx utility bonds enplanements hospital ukcars vehicles canadagas fmsales jewelry unemp.cci visitors carparts freight mcopper usgdp xrates djiclose frexport msales usnetelec

Add the possibility to disable or force a model

At least for debugging purposes.

SignalDecomposition and TS_datasets missing?

download_all_stock_prices.py:

import SignalDecomposition as SigDec
import TS_datasets as tsds

tsds.download_yahoo_stock_prices();

Use Python Logging

instead of printing text messages on the standard output !!!

MAPE, sMAPE computations are slow

PyAF protects these indicators against zero values in the signal :

def protect_small_values(self, signal, estimator):
    eps = 1.0e-13;
    keepThis = (np.abs(signal) > eps);
    signal1 =  signal[keepThis];       
    estimator1 = estimator[keepThis];
    # self.dump_perf_data(signal , signal1);        
    return (signal1 , estimator1);

This is not necessary.

An approximation is better :

rel_error = abs(estimator - signal) / (abs(signal) + eps)
MAPE = mean(rel_error)

Failure in an artificial test

After updating #6 , The following test fails :

python3 tests/artificial/transf_log/trend_constant/cycle_12/ar_12/test_artificial_1024_log_constant_12_12_100.py

Improve numerical stability

On Heruku platform, there is some very tiny difference on dates generated by pyaf. This may impact the extraction of exogenous data.

Add a RAML doc for the RESTful API

https://en.wikipedia.org/wiki/RAML_%28software%29

RESTful API Modeling Language (RAML) is a YAML-based language for describing RESTful APIs.[2] It provides all the information necessary to describe RESTful or practically-RESTful APIs. Although designed with RESTful APIs in mind, RAML is capable of describing APIs that do not obey all constraints of REST (hence the description "practically-RESTful"). It encourages reuse, enables discovery and pattern-sharing, and aims for merit-based emergence of best practices.

Add Variance stabilizing transformations

https://en.wikipedia.org/wiki/Variance-stabilizing_transformation

In applied statistics, a variance-stabilizing transformation is a data transformation that is specifically chosen either to simplify considerations in graphical exploratory data analysis or to allow the application of simple regression-based or analysis of variance techniques.

The aim behind the choice of a variance-stabilizing transformation is to find a simple function ƒ to apply to values x in a data set to create new values y = ƒ(x) such that the variability of the values y is not related to their mean value.

Tuning Keras Models

Following #12 , MLP and LSTM Keras with theano backend were tested on a 24 cores machine (HP Z600).

These m odels may need some tuning (RNN architecture improvement).

Need also some artificial dataset validation.

Artificial dataset test failure - Warning about singular matrix

A second group of failures. The following warning is raised :

logs/test_artificial_1024_cumsum_constant_5__20.log:UserWarning: Singular matrix in solving dual problem. Using least-squares solution instead.

sample script :

tests/artificial/transf_cumsum/trend_constant/cycle_5/ar_/test_artificial_1024_cumsum_constant_5__20.py

Add a Jupyter Notebook to demonstrate the use of Hierarchical Forecasting

need a similar doc with hierarchical and grouped time series detailed examples.

Add a logistic transformation

A logistic transformation is more appropriate when the signal is a proportion.

Try to reduce memory usage

A lot of pandas dataframes are created internally at each step of the signal decomposition process.
Try to get rid of unnecessary dataframes.

Some memory profiling is also welcome.

Add MASE Performance measure

According to

https://www.otexts.org/fpp/2/5

MAPE/SMAPE are not reliable performance measures for model selection.

The authors recommend MASE (mean absolute scaled error):

MASE=mean(|qj|).

where qj is the scaled error.

Support Date types

antoine@z600:~/dev/python/packages/pyaf$ ipython3 tests/bench/test_yahoo.py
ACQUIRED_YAHOO_LINKS 4818
YAHOO_DATA_LINK AAPL https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/YahooFinance/nasdaq/yahoo_AAPL.csv
YAHOO_DATA_LINK GOOG https://raw.githubusercontent.com/antoinecarme/TimeSeriesData/master/YahooFinance/nasdaq/yahoo_GOOG.csv
load_yahoo_stock_prices my_test 2
BENCH_TYPE YAHOO_my_test OneDataFramePerSignal
BENCH_DATA YAHOO_my_test <pyaf.Bench.TS_datasets.cTimeSeriesDatasetSpec object at 0x7fdc9c5f7fd0>
TIME : Date N= 1246 H= 12 HEAD= ['2011-07-28T00:00:00.000000000' '2011-07-29T00:00:00.000000000'
'2011-08-01T00:00:00.000000000' '2011-08-02T00:00:00.000000000'
'2011-08-03T00:00:00.000000000'] TAIL= ['2016-07-05T00:00:00.000000000' '2016-07-06T00:00:00.000000000'
'2016-07-07T00:00:00.000000000' '2016-07-08T00:00:00.000000000'
'2016-07-11T00:00:00.000000000']
SIGNAL : GOOG N= 1246 H= 12 HEAD= [ 610.941019 603.691033 606.771021 592.40099 601.171059] TAIL= [ 694.950012 697.77002 695.359985 705.630005 715.090027]
GOOG Date
0 610.941019 2011-07-28
1 603.691033 2011-07-29
2 606.771021 2011-08-01
3 592.400990 2011-08-02
4 601.171059 2011-08-03