Giter VIP home page Giter VIP logo

fhmm's Introduction

HMMs for Finance

CRAN status metacran downloads R-CMD-check codecov Lifecycle: stable

The {fHMM} R package allows for the detection and characterization of financial market regimes in time series data by applying hidden Markov Models (HMMs). The vignettes outline the package functionality and the model formulation. For a reference on the method, see

Oelschläger, L., and T. Adam. 2021. “Detecting Bearish and Bullish Markets in Financial Time Series Using Hierarchical Hidden Markov Models.” Statistical Modelling. https://doi.org/10.1177/1471082X211034048

Below, we illustrate an application to the German stock index DAX. We also show how to use the package to simulate HMM data, compute the model likelihood, and decode the hidden states using the Viterbi algorithm.

Installation

You can install the released package version from CRAN with:

install.packages("fHMM")

Contributing

We are open to contributions and would appreciate your input:

  • If you encounter any issues, please submit bug reports as issues.

  • If you have any ideas for new features, please submit them as feature requests.

  • If you would like to add extensions to the package, please fork the master branch and submit a merge request.

Example: Fitting an HMM to the DAX

We fit a 3-state HMM with state-dependent t-distributions to the DAX log-returns from 2000 to 2022. The states can be interpreted as proxies for bearish (green below) and bullish markets (red) and an “in-between” market state (yellow).

library("fHMM")

The package has a build-in function to download financial data from Yahoo Finance:

dax <- download_data(symbol = "^GDAXI")

We first need to define the model:

controls <- set_controls(
  states      = 3,
  sdds        = "t",
  file        = dax,
  date_column = "Date",
  data_column = "Close",
  logreturns  = TRUE,
  from        = "2000-01-01",
  to          = "2022-12-31"
)

The function prepare_data() then prepares the data for estimation:

data <- prepare_data(controls)

The summary() method gives an overview:

summary(data)
#> Summary of fHMM empirical data
#> * number of observations: 5882 
#> * data source: data.frame 
#> * date column: Date 
#> * log returns: TRUE

We fit the model and subsequently decode the hidden states and compute (pseudo-) residuals:

model <- fit_model(data)
model <- decode_states(model)
model <- compute_residuals(model)

The summary() method gives an overview of the model fit:

summary(model)
#> Summary of fHMM model
#> 
#>   simulated hierarchy       LL       AIC       BIC
#> 1     FALSE     FALSE 17650.02 -35270.05 -35169.85
#> 
#> State-dependent distributions:
#> t() 
#> 
#> Estimates:
#>                   lb   estimate        ub
#> Gamma_2.1  2.747e-03  5.024e-03 9.133e-03
#> Gamma_3.1  2.080e-13  2.060e-13 2.029e-13
#> Gamma_1.2  1.006e-02  1.839e-02 3.337e-02
#> Gamma_3.2  1.516e-02  2.446e-02 3.924e-02
#> Gamma_1.3  2.250e-11  2.232e-11 2.198e-11
#> Gamma_2.3  1.195e-02  1.898e-02 2.995e-02
#> mu_1      -3.862e-03 -1.793e-03 2.754e-04
#> mu_2      -7.994e-04 -2.649e-04 2.696e-04
#> mu_3       9.642e-04  1.272e-03 1.579e-03
#> sigma_1    2.354e-02  2.586e-02 2.840e-02
#> sigma_2    1.226e-02  1.300e-02 1.380e-02
#> sigma_3    5.390e-03  5.833e-03 6.312e-03
#> df_1       5.551e+00  1.084e+01 2.115e+01
#> df_2       6.814e+00  4.866e+01 3.475e+02
#> df_3       3.973e+00  5.248e+00 6.934e+00
#> 
#> States:
#> decoded
#>    1    2    3 
#>  704 2926 2252 
#> 
#> Residuals:
#>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
#> -3.517897 -0.664017  0.012171 -0.003261  0.673180  3.693577

Having estimated the model, we can visualize the state-dependent distributions and the decoded time series:

events <- fHMM_events(
  list(dates = c("2001-09-11", "2008-09-15", "2020-01-27"),
       labels = c("9/11 terrorist attack", "Bankruptcy Lehman Brothers", "First COVID-19 case Germany"))
)
plot(model, plot_type = c("sdds","ts"), events = events)

The (pseudo-) residuals help to evaluate the model fit:

plot(model, plot_type = "pr")

Simulating HMM data

The {fHMM} package supports data simulation from an HMM and access to the model likelihood function for model fitting and the Viterbi algorithm for state decoding.

  1. As an example, we consider a 2-state HMM with state-dependent Gamma distributions and a time horizon of 1000 data points.
controls <- set_controls(
  states  = 2,
  sdds    = "gamma",
  horizon = 1000
)
  1. Define the model parameters via the fHMM_parameters() function (unspecified parameters would be set at random).
par <- fHMM_parameters(
  controls = controls,
  Gamma    = matrix(c(0.95, 0.05, 0.05, 0.95), 2, 2), 
  mu       = c(1, 3), 
  sigma    = c(1, 3)
)
  1. Simulate data points from this model via the simulate_hmm() function.
sim <- simulate_hmm(
  controls        = controls,
  true_parameters = par
)
plot(sim$data, col = sim$markov_chain, type = "b")

  1. The log-likelihood function ll_hmm() is evaluated at the identified and unconstrained parameter values, they can be derived via the par2parUncon() function.
(parUncon <- par2parUncon(par, controls))
#> gammasUncon_21 gammasUncon_12      muUncon_1      muUncon_2   sigmaUncon_1 
#>      -2.944439      -2.944439       0.000000       1.098612       0.000000 
#>   sigmaUncon_2 
#>       1.098612 
#> attr(,"class")
#> [1] "parUncon" "numeric"

Note that this transformation takes care of the restrictions, that Gamma must be a transition probability matrix (which we can ensure via the logit link) and that mu and sigma must be positive (an assumption for the Gamma distribution, which we can ensure via the exponential link).

ll_hmm(parUncon, sim$data, controls)
#> [1] -1620.515
ll_hmm(parUncon, sim$data, controls, negative = TRUE)
#> [1] 1620.515
  1. For maximum likelihood estimation of the model parameters, we can numerically optimize ll_hmm() over parUncon (or rather minimize the negative log-likelihood).
optimization <- nlm(
  f = ll_hmm, p = parUncon, observations = sim$data, controls = controls, negative = TRUE
)

(estimate <- optimization$estimate)
#> [1] -3.4633918 -3.4406564  0.0599985  1.0645290  0.1151781  1.0794625
  1. To interpret the estimate, it needs to be back transformed to the constrained parameter space via the parUncon2par() function. The state-labeling is not identified.
class(estimate) <- "parUncon"
estimate <- parUncon2par(estimate, controls)

par$Gamma
#>         state_1 state_2
#> state_1    0.95    0.05
#> state_2    0.05    0.95
estimate$Gamma
#>            state_1    state_2
#> state_1 0.96895127 0.03104873
#> state_2 0.03037199 0.96962801

par$mu
#> muCon_1 muCon_2 
#>       1       3
estimate$mu
#>  muCon_1  muCon_2 
#> 1.061835 2.899473

par$sigma
#> sigmaCon_1 sigmaCon_2 
#>          1          3
estimate$sigma
#> sigmaCon_1 sigmaCon_2 
#>   1.122073   2.943097

fhmm's People

Contributors

loelschlaeger avatar rouvenm avatar timoadam avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

fhmm's Issues

Examples in .Rd-files

add small executable examples in main .Rd-files to illustrate the use of the exported function but also enable automatic testing

Incorporate covariates

Incorporate covariates into the state process(es) to determine which factors affect the probabilities of switching to bearish and bullish markets, respectively (just an idea, perhaps something for later versions of the package!).

Documentation issues

  • Set_controls() sollte als zweites stehen, da es für prepare_data() wichtig ist. Klarstellen: An object of class RprobitB_controls oder fHMM_controls?
  • prepare_data() sollte als drittes stehen, weil hierbei erst die class fHMM_data eingeführt wird, was man für das jetzt zweite plot braucht. Das könnte verwirren
  • Decode_state: ich würde „the most likely” state sequence schreiben. Rprobit_B unklar
  • fHMM_events(): mir wäre nicht ganz klar, was diese Funktion genau macht. Prüft sie gegeben Events, dass diese passend sind zum Einlesen?
  • fHMM_parameters: A tpm of dimension controls$states[1]. – tpm würde ich ausschreiben. Unterschied mu/mus_star wird nicht klar

Simulated values

I'm raising the issue again - I can't find the simulated data in the provided model files. Just the graphical representation. I've looked through the output files and I can't find it.

Data on coarse scale

Problem

Log-return averages on the coarse scale seems not to be the best idea. It's hard for the code to detect different states / state switches for this type of data.

Idea

Include parameter in controls to select type of coarse scale data (e.g. sum of absolute values, mean, average of absolute values). Plot coarse-scale data in ts.pdf to see if this yields better data.

Calculation of Hessian

Use option hessian=FALSE in nlm and only hessian=TRUE in final estimation run. Should give speed improvement.

update on readData

  • process two different sources of data
  • truncation: find nearest date if truncation point does not exist, possibility to include NA for no truncation
  • print out which data is used and how
  • check in check_control if correct emp. data is supplied for HMM and HHMM

control "data"

Give controls parameter "data" which is a list containing all parameters related to data processing/simulation. Update documentation.

check_controls

write funtion "check_controls" that checks control parameters and gives output about model formulation

Odd behaviour for fixed dfs

Try the fixed-dfs model

controls = list(
  id = "test",
  sdds = c("t(Inf)",NA),
  states = c(2,0),
  time_horizon = c(100,NA),
  seed = 1
)

and see that two states cannot be identified. However, the dfs-flexible model

controls = list(
  id = "test",
  sdds = c("t",NA),
  states = c(2,0),
  time_horizon = c(100,NA),
  seed = 1
)

works.

Flexible FS time horizon

For empirical data, implement that the fine-scale horizon can be monthly / quarterly. Leads to different fine-scale chunk sizes. In this case warning, if !controls[["data_cs_type"]] in c("mean","mean_abs")

Improve numerical optimization

  • Early stopping of non-promising optimization runs.
  • Parallelise numerical optimization runs.
    • Set number of cores in controls via ncores.
    • In check_controls, read out available number of cores, give warning if not (all-1) and error if too many (>=all) cores are used.
    • Divide all runs into ncores batches. Last one has ceiling(runs/ncores) runs, all others floor(runs/ncores) runs. Implement progress bar for last batch. ncores must not exceed runs.

sim

simulate HMM and HHMM (depending on N=0 or N!=0)

@return for each .Rd file

  • Add @return to roxygen tags and explain the functions results in the documentation.
  • Write about the structure of the output (class) and also what the output means. (If a function does not return a value, document that too).
  • Missing @return tag in
    • apply_viterbi.Rd
    • check_decoding.Rd
    • create_visuals.Rd
    • download_data.Rd
    • fit_hmm.Rd
    • plot_ll.Rd
    • plot_sdd.Rd
    • plot_ts.Rd

visualization

  • make visualization more flexible for any number of states
  • add. parameter: vector with dates and labels to highlight in the plot (e.g. Lehman bankruptcy)
  • check that plots don't get overwritten

Access the predicted values in a list/array

I'm doing an analysis of our predictions and I would like to access the output in a list/array so I could run MAPE, MSE and some other indicators. I can't seem to find it - where is it?

Plot of SDDs for simulated HHMM gives odd x-scale

Try

controls = list(
  id            = "test", 
  sdds          = c("normal","normal"),
  states        = c(3,2),
  time_horizon  = c(100,30),
  at_true       = TRUE,
  overwrite     = TRUE,
  seed          = 4
)

and see that this gives an odd x-scale in sdds.pdf. Set them based on distribution limits.

PRS of simulated HHMM with normal SDDs on FS look odd

Try

controls = list(
  id            = "test", 
  sdds          = c("normal","normal"),
  states        = c(3,2),
  time_horizon  = c(100,30),
  at_true       = TRUE,
  overwrite     = TRUE,
  seed          = 4
)

and see that the pseudo-residuals of the fine scale are not normal.

Warnings when using other datasets

download_data("dax", "^GDAXI", path=".")

download_data("hk", "HEN3.DE", path=".")

horizon: 2020-01-02 to 2021-03-01

Warnings:

  1. possibly unidentified states (C.6)
  2. events ignored (V.2)

control "nlm"

Give controls parameter "nlm" which is a list containing all parameters that can be passed to nlm. Update documentation.

Ideas

A collection of ideas on how to further extend the code:

  • Functionality to loop over different numbers of states.
  • How to deal with NA values in empirical data ("Close" may not exist for every time point, two data sets may not share all close days)?
  • Include comparison between true states and predicted states for simulated data in contingency table.
  • Possibility to extract any column from dataset, not only "Close".
  • Extend for fix of degrees of freedom on one scale only.
  • Show progress bar before first iteration
  • Give error if any state = 1.
  • Download new data automatically from https://finance.yahoo.com/. Write function download_data in 'data.R'.

Falsche state-dependent distribution bei 3 States

Beim Durchlaufen dieses Codes:

simulated HMM -----------------------------------------------------------
seed = 1
controls = list(
states = 3,
sdds = "gamma",
horizon = 500,
fit = list("runs" = 100)
)
controls %<>% set_controls
data = prepare_data(controls, seed = seed)
data %>% summary
data %>% plot
model = fit_model(data, ncluster = 1, seed = seed) %>%
decode_states %>%
compute_residuals
summary(model)
model %<>% reorder_states(state_order = 1:3)
compare(model)
model %>% plot("ll")
model %>% plot("sdds")

image

wird der 3. Status leider nicht richtig erkannt. Ich habe dasselbe auch mit 1000 Runs einmal ausgeführt, geändert hat sich am Ergebnis allerdings nichts.

Is there a built-in function to graph the simulated data?

First of all - I love the package! I struggled with some not-so-userfriendly packages in the past, but this is really something else!

To my question - is there existing funtion in the package to visualize/plot only the simulated data but with the same structure, visualizing the state scales underneath?

If not - is the simulated data bundled together with the data I fitted the model on in the data.rds file?

lenn

In the picture above, logReturns and dataRaw with 2714 elements. I only modelled a year, so this got to be all of the data, right? So I only need to fetch he last 365 elements and then I'm fine?

I do hope I was clear enough. If there are any confusion, just leave a quick comment and I will try to explain further.

Thanks in advance,

Carlos

Create R-package

  • Add a folder called "R" that contains all .R files and a folder called "scr" that contains all .cpp files (this is the folder structure that is required for the R package). Here's a cheat sheet on creating packages that could be useful for the development: https://github.com/rstudio/cheatsheets/raw/master/package-development.pdf.
  • Create a separate .R file for each function.
  • Write documentation for each function using roxygen tags, see comment below.
  • Where functions from other packages are used, use packageName::functionName() instead of functionName() to avoid conflicts.
  • Choose name for the package: fHMM
  • R package hex sticker
  • description file

likelihood computation

  • check that nLL_hmm works
  • check that nLL_hhmm works
  • transformation of thetaUncon to thetaCon with function
  • nLL_hmm can also be called from nlm

estimation output

  • sort: state with highest mu at the front, descending
  • design estimation result output in txt-file (names, elements, order)
  • Hessian computation
  • AIC and BIC computation
  • check if iterlim was exceeded, if so, increase

Extend for other SDDs

Extend code for other state-dependent distributions:

  • t: t-distribution
  • t(x): t-distribution with x fixed degrees of freedom (which replaces fix_dfs in controls)
  • norm: normal distribution, i.e. t(Inf)
  • gamma: gamma-distribution

Include control sdd (character vector of length two).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.