vvoutilainen / mtsdatamodel Goto Github PK

MTSDataModel is a Python program to store and manipulate economic multivariate time series data.

License: MIT License

Python 100.00%

mtsdatamodel's Introduction

MTSDataModel

MTSDataModel is a Python class that stores and manipulates economic multivariate time series data. Essentially, it is a wrapper on pandas library; the core of MTSDataModel object is a pandas data frame that stores the data. Several data manipulation actions can be performed on the data frame.

Installation

Conda environment for the use of MTSDataModel is created according to the instructions in NoobQuant condaenv. The specific installation command for mts environment is:

mamba create --name mts anaconda python=3.6.7 numpy=1.15.2 numpy-base=1.15.2 tzlocal=2.0.0 pandas=0.24.1 seaborn=0.11.0 rpy2==2.9.4 r=3.6.0 r-base=3.6.0 r-essentials=3.6.0 r-tidyverse=1.2.1 rtools=3.4.0 r-rjsdmx=2.1_0 r-seasonal=1.7.0 rstudio=1.1.456 r-wavelets=0.3_0.1 r-xlconnect=1.0.3

Loading data into data model

Data is read in from a .csv file. This file needs to be in long format and contain following columns:

date column (string)
level 1 name column for variable/feature, e.g. GDP (string)
level 2 name column for entity, e.g. country (string)
value column (numeric).

It is assumed for the input data that

each individual time series does not contain breaks
NAs can be present at start and end of individual time series.

MTSDataModel initializes to a data frame with two-leveled multi-index columns. First level is meant to represent variable names. Second level is meant to represent entity level (e.g. country, company etc.).

Data manipulations

Several data manipulation operation can be performed on the initialized data frame within MTSDataModel object. These operations can be performed on different variables and/or entities.

Variables and entities selection

Data manipulations are performed via class methods. For this level 1 names (variables) need be specified and passed into methods via list variables. When it comes to level 2 names (entities), following rules are used:

Implicit entities selection: when list entities is left unspecified, then methods operate only on entities for which all input variables are present.
Explicit entities selection: when list entities is explicitly specified, then methods operate only on variable/entity pairs (cross-product of the two input lists) specified. If some variable/entity pair is not present in data an error will be thrown.

Data pre-processing

Methods available for pre-processing of data are

DeflateVariables()
DetrendVariables()

Feature engineering

Methods available for pre-processing of data are

MRADecomposition()
SumVariables()
ReduceVariableDimension()

mtsdatamodel's People

Contributors

Watchers

mtsdatamodel's Issues

Add HP filter

Statsmodel has two-sided HP filter implemented: http://www.statsmodels.org/devel/generated/statsmodels.tsa.filters.hp_filter.hpfilter.html

Implement method for this. One-sided filter comes via applying filter to expanding sample.

Bug with default entities selection when variable not calculated for certain entity

Update 20190521
Possible fix in SumVariables(): added 5 lines that, when no entities selected, get entities for which all given variables exists. Seems to work.

This fix needs to be applied to other methods as well that have the same problem! Similar can be used for others if we construct a separate loop for variables at start. This entity selection needs to be wrapped as separate static method and apply it to all.

Init post
Default entity selection
if entities == None:
entities = list(np.unique(self.df.columns.get_level_values(1).values))
is not working (at least) for following methods:

SumVariables()
DetrendVariables()
MRADecomposition()
SumVariables()
ReduceVariableDimension()
Example case: MRADecomposition() is performed on entity DEU:

variables = ['Credit_def_ld1','ResidentialPrices_ld1','StockPrices_def_ld1']
do.MRADecomposition(variables,entities=['DEU'],levels = 6,expanding=False)

Later we try to perform SumVariables() on default entities:
variables = {'L45':['StockPrices_def_ld1_wl4','StockPrices_def_ld1_wl5','Credit_def_ld1_wl4','Credit_def_ld1_wl5','ResidentialPrices_ld1_wl4','ResidentialPrices_ld1_wl5']}
do.SumVariables(variables)

For entities that were not included in MRADecomposition, SumVariables() results in variable with values 0 throughtout, although it should throw an error.

Merging to data frame with no columns

In ExpandingSampleCalc(), in the first counter loop line

resultframefull = pd.merge(resultframefull, crtresultframe, left_index = True, right_index = True, how = 'left')

causes a warning as columns in resultframefull are not a multi-index for some weird reason. Functionality seems to be correct despite the warning.

Before merge resultframefull is an empty frame with just index but no columns. After the merge we currently force the columns to multi-index so in rest of loops warning disappears, but this should be somehow corrected such that there will be no warning in the first loop either.

Use this example to reproduce: how to get multi-index column to gg when there is no data in it?
df = do.ReturnDf()
gg = pd.DataFrame(index=df.index)

Add checks to all methods

Add similar checks as in SumVariables to other methods as well.
Related to #1.

It is better to throw an explicit error when something is not working as user inputs it.

Crisis dummies data type should be fixed

It seems that endogenous dummy variables are treated as integers. However, in vulnerability horizon treatment we need to introduce NaNs and this causes these columns to become floats. We could probably use integers with the new nullable integer type (https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html) but I guess in this context it is enough to use just floats. This, however, should be properly defined; that is, all numerical column on MTSDataModel should be defined as floats.

SumVariables() differs from other methods

It does not really make sense that SumVariables() differs from other methods in how it takes in variables. The idea was to let the dict key to designate the name of the resulting new variable, and the variables itself are passed in as list in dict value element. To match with others, it makes more sense to pass in two lists variables and entities, and a mandatory 3rd input variable to designate the name, similar to ReduceVariableDimension.

What this does is perhibit summing multiple variables combindations at one go, but the others work in this way as well.

Once this is done, do #1

Unit tests for HPFiltering

Need to check how well HP filtering works!

Vulnerability horizon treatment

Data model needs method to reduce sample to given vulnerability horizon.

Questions:

where to put this methods? Under class MTSModel? Do not really belong to that particular class, predictive modelling part are not just data handling anymore.