loelschlaeger / fhmm Goto Github PK

View Code? Open in Web Editor NEW

12.0 3.0 7.0 158.53 MB

Hidden Markov models for finance

Home Page: https://loelschlaeger.de/fHMM/

License: GNU General Public License v3.0

C++ 1.39% R 98.61%

hidden-markov-models finance rstats

fhmm's Introduction

HMMs for Finance

The {fHMM} R package allows for the detection and characterization of financial market regimes in time series data by applying hidden Markov Models (HMMs). The vignettes outline the package functionality and the model formulation. For a reference on the method, see

Oelschläger, L., and T. Adam. 2021. “Detecting Bearish and Bullish Markets in Financial Time Series Using Hierarchical Hidden Markov Models.” Statistical Modelling. https://doi.org/10.1177/1471082X211034048

Below, we illustrate an application to the German stock index DAX. We also show how to use the package to simulate HMM data, compute the model likelihood, and decode the hidden states using the Viterbi algorithm.

Installation

You can install the released package version from CRAN with:

install.packages("fHMM")

Contributing

We are open to contributions and would appreciate your input:

If you encounter any issues, please submit bug reports as issues.
If you have any ideas for new features, please submit them as feature requests.
If you would like to add extensions to the package, please fork the master branch and submit a merge request.

Example: Fitting an HMM to the DAX

We fit a 3-state HMM with state-dependent t-distributions to the DAX log-returns from 2000 to 2022. The states can be interpreted as proxies for bearish (green below) and bullish markets (red) and an “in-between” market state (yellow).

library("fHMM")

The package has a build-in function to download financial data from Yahoo Finance:

dax <- download_data(symbol = "^GDAXI")

We first need to define the model:

controls <- set_controls(
  states      = 3,
  sdds        = "t",
  file        = dax,
  date_column = "Date",
  data_column = "Close",
  logreturns  = TRUE,
  from        = "2000-01-01",
  to          = "2022-12-31"
)

The function prepare_data() then prepares the data for estimation:

data <- prepare_data(controls)

The summary() method gives an overview:

summary(data)
#> Summary of fHMM empirical data
#> * number of observations: 5882 
#> * data source: data.frame 
#> * date column: Date 
#> * log returns: TRUE

We fit the model and subsequently decode the hidden states and compute (pseudo-) residuals:

model <- fit_model(data)
model <- decode_states(model)
model <- compute_residuals(model)

The summary() method gives an overview of the model fit:

summary(model)
#> Summary of fHMM model
#> 
#>   simulated hierarchy       LL       AIC       BIC
#> 1     FALSE     FALSE 17650.02 -35270.05 -35169.85
#> 
#> State-dependent distributions:
#> t() 
#> 
#> Estimates:
#>                   lb   estimate        ub
#> Gamma_2.1  2.747e-03  5.024e-03 9.133e-03
#> Gamma_3.1  2.080e-13  2.060e-13 2.029e-13
#> Gamma_1.2  1.006e-02  1.839e-02 3.337e-02
#> Gamma_3.2  1.516e-02  2.446e-02 3.924e-02
#> Gamma_1.3  2.250e-11  2.232e-11 2.198e-11
#> Gamma_2.3  1.195e-02  1.898e-02 2.995e-02
#> mu_1      -3.862e-03 -1.793e-03 2.754e-04
#> mu_2      -7.994e-04 -2.649e-04 2.696e-04
#> mu_3       9.642e-04  1.272e-03 1.579e-03
#> sigma_1    2.354e-02  2.586e-02 2.840e-02
#> sigma_2    1.226e-02  1.300e-02 1.380e-02
#> sigma_3    5.390e-03  5.833e-03 6.312e-03
#> df_1       5.551e+00  1.084e+01 2.115e+01
#> df_2       6.814e+00  4.866e+01 3.475e+02
#> df_3       3.973e+00  5.248e+00 6.934e+00
#> 
#> States:
#> decoded
#>    1    2    3 
#>  704 2926 2252 
#> 
#> Residuals:
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> -3.517897 -0.664017  0.012171 -0.003261  0.673180  3.693577

Having estimated the model, we can visualize the state-dependent distributions and the decoded time series:

events <- fHMM_events(
  list(dates = c("2001-09-11", "2008-09-15", "2020-01-27"),
       labels = c("9/11 terrorist attack", "Bankruptcy Lehman Brothers", "First COVID-19 case Germany"))
)
plot(model, plot_type = c("sdds","ts"), events = events)

The (pseudo-) residuals help to evaluate the model fit:

plot(model, plot_type = "pr")

Simulating HMM data

The {fHMM} package supports data simulation from an HMM and access to the model likelihood function for model fitting and the Viterbi algorithm for state decoding.

As an example, we consider a 2-state HMM with state-dependent Gamma distributions and a time horizon of 1000 data points.

controls <- set_controls(
  states  = 2,
  sdds    = "gamma",
  horizon = 1000
)

Define the model parameters via the fHMM_parameters() function (unspecified parameters would be set at random).

par <- fHMM_parameters(
  controls = controls,
  Gamma    = matrix(c(0.95, 0.05, 0.05, 0.95), 2, 2), 
  mu       = c(1, 3), 
  sigma    = c(1, 3)
)

Simulate data points from this model via the simulate_hmm() function.

sim <- simulate_hmm(
  controls        = controls,
  true_parameters = par
)
plot(sim$data, col = sim$markov_chain, type = "b")

The log-likelihood function ll_hmm() is evaluated at the identified and unconstrained parameter values, they can be derived via the par2parUncon() function.

(parUncon <- par2parUncon(par, controls))
#> gammasUncon_21 gammasUncon_12      muUncon_1      muUncon_2   sigmaUncon_1 
#>      -2.944439      -2.944439       0.000000       1.098612       0.000000 
#>   sigmaUncon_2 
#>       1.098612 
#> attr(,"class")
#> [1] "parUncon" "numeric"

Note that this transformation takes care of the restrictions, that Gamma must be a transition probability matrix (which we can ensure via the logit link) and that mu and sigma must be positive (an assumption for the Gamma distribution, which we can ensure via the exponential link).

ll_hmm(parUncon, sim$data, controls)
#> [1] -1620.515
ll_hmm(parUncon, sim$data, controls, negative = TRUE)
#> [1] 1620.515

For maximum likelihood estimation of the model parameters, we can numerically optimize ll_hmm() over parUncon (or rather minimize the negative log-likelihood).

optimization <- nlm(
  f = ll_hmm, p = parUncon, observations = sim$data, controls = controls, negative = TRUE
)

(estimate <- optimization$estimate)
#> [1] -3.4633918 -3.4406564  0.0599985  1.0645290  0.1151781  1.0794625

To interpret the estimate, it needs to be back transformed to the constrained parameter space via the parUncon2par() function. The state-labeling is not identified.

class(estimate) <- "parUncon"
estimate <- parUncon2par(estimate, controls)

par$Gamma
#>         state_1 state_2
#> state_1    0.95    0.05
#> state_2    0.05    0.95
estimate$Gamma
#>            state_1    state_2
#> state_1 0.96895127 0.03104873
#> state_2 0.03037199 0.96962801

par$mu
#> muCon_1 muCon_2 
#>       1       3
estimate$mu
#>  muCon_1  muCon_2 
#> 1.061835 2.899473

par$sigma
#> sigmaCon_1 sigmaCon_2 
#>          1          3
estimate$sigma
#> sigmaCon_1 sigmaCon_2 
#>   1.122073   2.943097

fhmm's People

Contributors

Stargazers

Watchers

Forkers

minghao2016 fratelino timoadam rouvenm mbalcilar albertlii manisahni

fhmm's Issues

Examples in .Rd-files

add small executable examples in main .Rd-files to illustrate the use of the exported function but also enable automatic testing

Document data structures

Document

thetaUncon
thetaCon
thetaList
*Ordered

in ReadMe. Make sim_par to thetaList object.

Incorporate covariates

Incorporate covariates into the state process(es) to determine which factors affect the probabilities of switching to bearish and bullish markets, respectively (just an idea, perhaps something for later versions of the package!).

Documentation issues

Set_controls() sollte als zweites stehen, da es für prepare_data() wichtig ist. Klarstellen: An object of class RprobitB_controls oder fHMM_controls?
prepare_data() sollte als drittes stehen, weil hierbei erst die class fHMM_data eingeführt wird, was man für das jetzt zweite plot braucht. Das könnte verwirren
Decode_state: ich würde „the most likely” state sequence schreiben. Rprobit_B unklar
fHMM_events(): mir wäre nicht ganz klar, was diese Funktion genau macht. Prüft sie gegeben Events, dass diese passend sind zum Einlesen?
fHMM_parameters: A tpm of dimension controls$states[1]. – tpm würde ich ausschreiben. Unterschied mu/mus_star wird nicht klar

Compute confidence intervals of estimates

Hessian checks (Timo)
integration into code (Lennart)
results to output file (Lennart)

Implement wrapper for full functionality of fHMM

Simulated values

I'm raising the issue again - I can't find the simulated data in the provided model files. Just the graphical representation. I've looked through the output files and I can't find it.

Data on coarse scale

Problem

Log-return averages on the coarse scale seems not to be the best idea. It's hard for the code to detect different states / state switches for this type of data.

Idea

Include parameter in controls to select type of coarse scale data (e.g. sum of absolute values, mean, average of absolute values). Plot coarse-scale data in ts.pdf to see if this yields better data.

Fix `sprintf` on Linux

Gives output %18-s 3 / 2 instead of number of states: 3 / 2

No writing to users's home filespace per default

Ensure that functions do not write by default in the user's home filespace (including the package directory and getwd()). This is not allowed by CRAN policies.

Calculation of Hessian

Use option hessian=FALSE in nlm and only hessian=TRUE in final estimation run. Should give speed improvement.

Error in fit_hmm(controls, events) : Id invalid. (S.1)

Run github instance, error reported

update on readData

process two different sources of data
truncation: find nearest date if truncation point does not exist, possibility to include NA for no truncation
print out which data is used and how
check in check_control if correct emp. data is supplied for HMM and HHMM

control "data"

Give controls parameter "data" which is a list containing all parameters related to data processing/simulation. Update documentation.

check_controls

write funtion "check_controls" that checks control parameters and gives output about model formulation

Odd behaviour for fixed dfs

Try the fixed-dfs model

controls = list(
  id = "test",
  sdds = c("t(Inf)",NA),
  states = c(2,0),
  time_horizon = c(100,NA),
  seed = 1
)

and see that two states cannot be identified. However, the dfs-flexible model

controls = list(
  id = "test",
  sdds = c("t",NA),
  states = c(2,0),
  time_horizon = c(100,NA),
  seed = 1
)

works.

update README

mini example
descrp of controls
saved outputs

Function for LaTeX output

Implement function that creates LaTeX output of model formulation and results.

Consistent function / file / parameter names with two parts

E.g. check_estimation or checkEstimation or check.estimation?

Flexible FS time horizon

For empirical data, implement that the fine-scale horizon can be monthly / quarterly. Leads to different fine-scale chunk sizes. In this case warning, if !controls[["data_cs_type"]] in c("mean","mean_abs")

Improve numerical optimization

Early stopping of non-promising optimization runs.
Parallelise numerical optimization runs.
- Set number of cores in controls via ncores.
- In check_controls, read out available number of cores, give warning if not (all-1) and error if too many (>=all) cores are used.
- Divide all runs into ncores batches. Last one has ceiling(runs/ncores) runs, all others floor(runs/ncores) runs. Implement progress bar for last batch. ncores must not exceed runs.

Check `set_controls` and `prepare_data`.

Check the new functions set_controls and prepare_data for clear documentation and expected behavior.

sim

simulate HMM and HHMM (depending on N=0 or N!=0)

Viterbi

fit Viterbi to new code

@return for each .Rd file

Add @return to roxygen tags and explain the functions results in the documentation.
Write about the structure of the output (class) and also what the output means. (If a function does not return a value, document that too).
Missing @return tag in
- apply_viterbi.Rd
- check_decoding.Rd
- create_visuals.Rd
- download_data.Rd
- fit_hmm.Rd
- plot_ll.Rd
- plot_sdd.Rd
- plot_ts.Rd

Pseudo-residuals normality

Include test for normality of pseudo-residuals, see package tseries, function jarque.bera.test.

Inconsistent reference style in the vignettes

Make sure that the reference style is consistent in ref.bib (to discuss).

Error in check_controls(controls): File './data/x.csv' not found.

Hi,

I am trying to fit an hhmm on copper data, however, after

fit_hmm(controls, events)

the following error pops up:

Error in check_controls(controls): File './data/x.csv' not found.

Thank you in advance for the support.

visualization

make visualization more flexible for any number of states
add. parameter: vector with dates and labels to highlight in the plot (e.g. Lehman bankruptcy)
check that plots don't get overwritten

Access the predicted values in a list/array

I'm doing an analysis of our predictions and I would like to access the output in a list/array so I could run MAPE, MSE and some other indicators. I can't seem to find it - where is it?

Implement progress bar

Write own code for progress + ETA output. Remove dependence to "progress" package.

Odd graphics for sim. HHMM

See residuals and ts plot.

Flowchart with brackets

Add brackets for functions in flowchart.

Setup JSS paper template.

Set JSS paper structure and metadata.

Plot of SDDs for simulated HHMM gives odd x-scale

Try

controls = list(
  id            = "test", 
  sdds          = c("normal","normal"),
  states        = c(3,2),
  time_horizon  = c(100,30),
  at_true       = TRUE,
  overwrite     = TRUE,
  seed          = 4
)

and see that this gives an odd x-scale in sdds.pdf. Set them based on distribution limits.

Unexported functions

Add #' @export as last Roxygen tag for user-level functions.

PRS of simulated HHMM with normal SDDs on FS look odd

Try

controls = list(
  id            = "test", 
  sdds          = c("normal","normal"),
  states        = c(3,2),
  time_horizon  = c(100,30),
  at_true       = TRUE,
  overwrite     = TRUE,
  seed          = 4
)

and see that the pseudo-residuals of the fine scale are not normal.

Warnings when using other datasets

download_data("dax", "^GDAXI", path=".")

download_data("hk", "HEN3.DE", path=".")

horizon: 2020-01-02 to 2021-03-01

Warnings:

possibly unidentified states (C.6)
events ignored (V.2)

Implement predict function

Use model results to forecast the market.

control "nlm"

Give controls parameter "nlm" which is a list containing all parameters that can be passed to nlm. Update documentation.

Make graphics as ggplot2

Transform graphics from base R to ggplot2.

Ideas

A collection of ideas on how to further extend the code:

Functionality to loop over different numbers of states.
How to deal with NA values in empirical data ("Close" may not exist for every time point, two data sets may not share all close days)?
Include comparison between true states and predicted states for simulated data in contingency table.
Possibility to extract any column from dataset, not only "Close".
Extend for fix of degrees of freedom on one scale only.
Show progress bar before first iteration
Give error if any state = 1.
Download new data automatically from https://finance.yahoo.com/. Write function download_data in 'data.R'.

Create package overview.

Create (tikz?) graphic of package functions.

Falsche state-dependent distribution bei 3 States

Beim Durchlaufen dieses Codes:

simulated HMM -----------------------------------------------------------
seed = 1
controls = list(
states = 3,
sdds = "gamma",
horizon = 500,
fit = list("runs" = 100)
)
controls %<>% set_controls
data = prepare_data(controls, seed = seed)
data %>% summary
data %>% plot
model = fit_model(data, ncluster = 1, seed = seed) %>%
decode_states %>%
compute_residuals
summary(model)
model %<>% reorder_states(state_order = 1:3)
compare(model)
model %>% plot("ll")
model %>% plot("sdds")

wird der 3. Status leider nicht richtig erkannt. Ich habe dasselbe auch mit 1000 Runs einmal ausgeführt, geändert hat sich am Ergebnis allerdings nichts.

Is there a built-in function to graph the simulated data?

First of all - I love the package! I struggled with some not-so-userfriendly packages in the past, but this is really something else!

To my question - is there existing funtion in the package to visualize/plot only the simulated data but with the same structure, visualizing the state scales underneath?

If not - is the simulated data bundled together with the data I fitted the model on in the data.rds file?

In the picture above, logReturns and dataRaw with 2714 elements. I only modelled a year, so this got to be all of the data, right? So I only need to fetch he last 365 elements and then I'm fine?

I do hope I was clear enough. If there are any confusion, just leave a quick comment and I will try to explain further.

Thanks in advance,

Carlos

Create R-package

Add a folder called "R" that contains all .R files and a folder called "scr" that contains all .cpp files (this is the folder structure that is required for the R package). Here's a cheat sheet on creating packages that could be useful for the development: https://github.com/rstudio/cheatsheets/raw/master/package-development.pdf.
Create a separate .R file for each function.
Write documentation for each function using roxygen tags, see comment below.
Where functions from other packages are used, use packageName::functionName() instead of functionName() to avoid conflicts.
Choose name for the package: fHMM
R package hex sticker
description file

likelihood computation

check that nLL_hmm works
check that nLL_hhmm works
transformation of thetaUncon to thetaCon with function
nLL_hmm can also be called from nlm

`coef` function for model coefficients

Implement coef method to extract estimated model coefficients.

estimation output

sort: state with highest mu at the front, descending
design estimation result output in txt-file (names, elements, order)
Hessian computation
AIC and BIC computation
check if iterlim was exceeded, if so, increase

Extend for other SDDs

Extend code for other state-dependent distributions:

t: t-distribution
t(x): t-distribution with x fixed degrees of freedom (which replaces fix_dfs in controls)
norm: normal distribution, i.e. t(Inf)
gamma: gamma-distribution

Include control sdd (character vector of length two).