andybega / spduration Goto Github PK

Split-population duration model in R

Home Page: https://andybeger.com/spduration/

R 79.55% C++ 8.37% C 0.91% Makefile 0.87% TeX 10.30%

r split-population survival-analysis regression mixture-model

spduration's Introduction

spduration

spduration implements a split-population duration model for duration data with time-varying covariates where a significant subset of the population or spells will not experience failure.

library("spduration")

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

# Prepare data
data(coups)
dur.coups <- add_duration(coups, "succ.coup", unitID="gwcode", tID="year",
                          freq="year")

# Estimate model
model.coups <- spdur(duration ~ polity2, atrisk ~ polity2, data = dur.coups,
                     silent = TRUE)
summary(model.coups)

## Call:
## spdur(duration = duration ~ polity2, atrisk = atrisk ~ polity2, 
##     data = dur.coups, silent = TRUE)
## 
## Duration equation: 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.00150    0.23762  16.840  < 2e-16 ***
## polity2      0.20588    0.03037   6.779 1.21e-11 ***
## 
## Risk equation: 
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   6.5278     3.2556   2.005   0.0449 *
## polity2       0.8966     0.4084   2.196   0.0281 *
## 
##            Estimate Std. Error t value Pr(>|t|)
## log(alpha) -0.03204    0.11899  -0.269    0.788
## ---
## Signif. codes: *** = 0.001, ** = 0.01, * = 0.05, . = 0.1

plot(model.coups, type = "hazard")

Install

the latest released version from CRAN:

install.packages("spduration")

the latest development version:

library(devtools)
install_github("andybega/spduration")

Contact

submit suggestions, bugs, issues, questions at: https://github.com/andybega/spduration/issues
email: [email protected]

spduration's People

Contributors

Stargazers

Watchers

spduration's Issues

missing value error in `plot_hazard2()`

plot(m108_grid1_mdl, type="hazard2"):

 Error in quantile.default(pr, probs = 0.05) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE 
6 stop("missing values and NaN's not allowed if 'na.rm' is FALSE") 
5 quantile.default(pr, probs = 0.05) 
4 quantile(pr, probs = 0.05) at plot_hazard.r#58
3 plot_hazard2(x, ...) at plot.spdur.R#29
2 plot.spdur(m108_grid1_mdl, type = "hazard2") 
1 plot(m108_grid1_mdl, type = "hazard2")

Uncouple forecast from forecast::forecast

No reason to require the whole package as a dependency. Make it a standalone function rather than generic method.

spdur() formula fix for 1 (constant)

duration ~ 1 will produce duplicate constants, duration ~ does not run.

Check that methods for spdur work with spdurCrisp. There is/was an issue with unevaluated arguments passed to spdur() within the spdurCrisp() wrapper that would break compatibility with other methods.

Rewrote spdurCrisp() to replaced the spdur() function call with the actual spdurCrisp() function call. Hopefully this fixes it since the arguments now should evaluated in the parent frame that is used by the methods like predict.

Extent methods for na.action

Extent the spdur methods and spdur to accomodate various na.action options.

See napredict.

xtable won't work with duplicate parameter names

> xtable(model.coups)
% latex table generated in R 3.0.2 by xtable 1.7-4 package
% Tue Oct 21 16:31:52 2014
 \begin{table}[ht]
\centering
\begin{tabular}{rrrr}
  \hline
 & Estimate & StdErr & p \\ 
  \hline
1 & 4.00 & 0.24 & 0.00 \\ 
  2 & 0.21 & 0.03 & 0.00 \\ 
  3 & -0.03 & 0.12 & 0.79 \\ 
  4 & 6.53 & 3.26 & 0.04 \\ 
  5 & 0.90 & 0.41 & 0.03 \\ 
   \hline
\end{tabular}
\end{table}
Warning message:
In data.row.names(row.names, rowsi, i) :
  some row.names duplicated: 5 --> row.names NOT used

buildDuration() fix for using years

Fix so that years are converted to dates

Implement "weights" option

Like:

library(CBPS)
library(data.table)
library(spduration)
data("bscoup")
dt <- data.table(bscoup)
dt <- dt[complete.cases(dt)]

# get weights by matching on treatment : recentcoups
fit <- CBPS(recentcoups ~ milreg + instab,
            data = dt,method = "exact")

dt[,coup := as.numeric(coup)-1]

# estimate hazard with with weights
model<- glm(coup ~ instab + recentcoups + wealth,
            weights = fitted(fit),data=dt,family=quasibinomial(link="cloglog"))

summary.spdur for models without risk intercept

Currently summary.spdur uses the location of the risk intercept to split the data frame of coefficients into separate pieces for the duration and risk equation:

from summary.spdur.R:

start_split <- which(names(object$coefficients)=='(Risk Intercept)')
end_duration <- start_split - 1

So if there is no risk intercept, i.e. a model with atrisk ~ -1 + ..., these variables will be missing and will lead to an error.

Maybe ID split by attr(model$mf.dur, "terms") %>% attr(., "intercept") in combination with number of terms.

Add example for how to run spdur with mice

How to run spdur with MI data using the mice package?

Something like this should work, provided there is a tidy.spdur implementation:

library(spduration)
library(mice)

# imp.data = some data imputed using mice
mice.data <- as.mids(as.data.frame(imp.data))

# The mice::with.mids function basically exposes the analysis content in expr
# to each imputed version of the dataset. This works for functions like lm()
# and survreg() because they can just directly access the variables and don't
# need the data argument. But with spdur, we are first augmenting the data
# with add_duration(). So basically I'm below re-assembling a data frame that
# I can use with add_duration()
mice.fit <- with(
  data = mice.data,
  expr = {
    # at this point, the variables in imp.data are directly accessible, e.g. as
    # ARREST, without needing to write imp.data$ARREST
    # use that to re-create a data frame. This will need to include all terms
    # that are used by add_duration() and also in spdur() below
    df <- data.frame(ARREST = ARREST, PUBID_1997 = PUBID_1997,
                     CAL_SURVEY_DATE = CAL_SURVEY_DATE,
                     AGE_FARREST = AGE_FARREST)
    df <- add_duration(df, "ARREST", unitID = "PUBID_1997", tID = "CAL_SURVEY_DATE", freq = "month", ongoing = TRUE)
    spdur(duration ~ AGE_FARREST, atrisk ~ AGE_FARREST, distr = "weibull", data = df, silent = TRUE)
  }
)

# mice::pool needs a broom::tidy method for class "spdur" to work;
# add this is a quick fix; I'll add this to the package at some point
tidy.spdur <- function(x, ...) {
  xx <- as.data.frame(x, row.names = FALSE)
  colnames(xx) <- c("term", "estimate", "std.error", "statistic", "p.value")
  xx$group <- NA
  xx$group[grepl("Dur_", xx$term)] <- "duration"
  xx$group[grepl("Risk_", xx$term)] <- "risk"
  xx$group[grepl("alpha", xx$term)] <- "scale"
  xx$term <- gsub("Dur_|Risk_", "", xx$term)
  xx
}

mice.res <- summary(pool(mice.fit), method = "rubin")
mice.res

bscoup country ID column is not G&W codes

Documentation claims it is, but they are not.

Submit to CRAN with R 3.4.2

A new R version, 3.4.2, was released last Thursday, but the package for OS X is not up yet, and Travis CI also still uses 3.4.1. Win-builder already had 3.4.2, but need to re-install R when it is ready and check local and Travis again, then submit to CRAN...

Add extract.spdur for texreg::extract?

This does not work but could be starting point:

extract.spdur<-function(model, beside = FALSE, include.duration = TRUE, include.risk = TRUE, 
    include.aic = TRUE, include.loglik = TRUE, include.nobs = TRUE, 
    include.bic=TRUE, include.dist = TRUE,...) 
{
    s <- summary(model, ...)
    dist <- model$distr
    gof <- numeric()
    gof.names <- character()
    gof.decimal <- logical()
    if (include.nobs == TRUE) {
        n <- model$obs
        gof <- c(gof, n)
        gof.names <- c(gof.names, "Num. obs.")
        gof.decimal <- c(gof.decimal, FALSE)
    }
    if (include.loglik == TRUE) {
        lik <- logLik(model)[1]
        gof <- c(gof, lik)
        gof.names <- c(gof.names, "Log Likelihood")
        gof.decimal <- c(gof.decimal, TRUE)
    }
    if (include.aic == TRUE) {
        aic <- AIC(model)
        gof <- c(gof, aic)
        gof.names <- c(gof.names, "AIC")
        gof.decimal <- c(gof.decimal, TRUE)
    }
    if (include.bic == TRUE) {
        bic <- BIC(model)
        gof <- c(gof, bic)
        gof.names <- c(gof.names, "BIC")
        gof.decimal <- c(gof.decimal, TRUE)
    }
    duration <- s$duration
    alpha <- s$alpha
    risk <- s$split
    if (beside == FALSE) {
        if (include.duration == TRUE && include.risk == TRUE) {
            names(duration) <- paste("Duration model:", rownames(duration))
            names(risk) <- paste("Risk model:", rownames(risk))
            coef.block <- rbind(duration, alpha,risk)
        }
        else if (include.duration == TRUE) {
            coef.block <- rbind(duration, alpha)
        }
        else if (include.risk == TRUE) {
            coef.block <- risk
        }
        else {
            stop(paste("Either the include.duration or the include.risk argument", 
                "must be TRUE."))
        }
        names <- rownames(coef.block)
        co <- coef.block[, 1]
        se <- coef.block[, 2]
        pval <- coef.block[, 4]
        tr <- createTexreg(coef.names = names, coef = co, se = se, 
            pvalues = pval, gof.names = gof.names, gof = gof, 
            gof.decimal = gof.decimal)
        return(tr)
    }
    else {
        trList <- list()
        d.names <- c(names(duration), "log(alpha)")
        d.co <- c(duration[, 1], alpha[, 1])
        d.se <- c(duration[, 2], alpha[, 2])
        d.pval <- c(duration[, 4],alpha[, 4])
        r.names <- rownames(risk)
        r.co <- risk[, 1]
        r.se <- risk[, 2]
        r.pval <- risk[, 4]
        if (include.duration == TRUE) {
            tr <- createTexreg(coef.names = d.names, coef = d.co, 
                se = d.se, pvalues = d.pval, gof.names = gof.names, 
                gof = gof, gof.decimal = gof.decimal, 
                model.name = paste("Duration model (", dist,")", sep=""))
            trList[[length(trList) + 1]] <- tr
        }
        if (include.risk == TRUE) {
            tr <- createTexreg(coef.names = r.names, coef = r.co, 
                se = r.se, pvalues = r.pval, gof.names = gof.names, 
                gof = gof, gof.decimal = gof.decimal, model.name = "Risk model")
            trList[[length(trList) + 1]] <- tr
        }
        if (length(trList) == 0) {
            stop(paste("Either the include.duration or the include.risk argument", 
                "must be TRUE."))
        }
        return(trList)
    }
}

setMethod("extract", signature = className("spdur", "spduration"),
    definition = extract.spdur)

countryplot error

in coups demo, fails R check

print.summary.spdur with constants only

Error in printCoefmat(x$duration, P.values = T, has.Pvalue = T, digits = 4, :
'x' must be coefficient matrix/data frame

Check paper equation 5 numerator

Yuji Nakayama writes:

Thank you for providing a fascinating R package, spduration. I have a question about your paper published in The R Journal https://journal.r-project.org/archive/2017/RJ-2017-056/RJ-2017-056.pdf

It seems to me, intuitively, that the numerator of equation (5) is not 1-pai, but pai i.e., unconditional risk rate, since pai(t) is conditional risk rate.

predict weibull >1

Happens around lines 76-78:

if (stat == "conditional hazard")
res <- atrisk.t * ft/pmax(1e-10, (cure.t + atrisk.t * st))

from spduration-paper example debugging:

Browse[2]> s0[2014]
2562
0.007962334
Browse[2]> st[2014]
2562
0.001401938
Browse[2]> ft[2014]/st[2014]
2562
1.752906

Integrate Surv() instead of add_duration()?

Integrate the package with the Surv() objects instead of the explicit add_duration function?

This would help match how other survival analysis is done in R.

How can one compare models or the appropriateness of a model?

Add information criteria? Basically, how can we tell that "spdur" is adding something over a simpler model.

separationplot does not take heading argument

?separationplot
?separationplot.spdur
data(model.coups)
plot(model.coups, heading="foo")

In separationplot.spdur heading is hard set to "".

Remove call from spdur object, summary.spdur, etc.

Discouraged, and not really needed. Also remove extract.call()

https://tidymodels.github.io/model-implementation-principles/the-model-object.html

spdur() and missing value options

spdur() needs an option for how to deal with missing values, similar to glm's na.action option maybe. Right now any missing values in the design matrix will stop the function.

Forecast failure prob x months ahead

Instead of individual monthly forecast probabilities, an option to calculate the cummulative probability of failure in the next x time periods.

Could be as simple as 1 - prod( 1 - h(t) ) over 1 through desired time periods in the future.

clarify how `forecast` works

Documentation should be more explicit about the assumptions, i.e. carry forward extrapolation of x and updating under assumption of no event for duration.

More fundamentally, the way forecasting should be done to best take advantage of existing information depends on the lag structure of the model. The best way to forecast, e.g., if the DV y is lead and covariates are not lagged versus a model in which x is lagged and the DV y is not lead is different. There is no easy way to incorporate knowledge of these modeling assumptions without adding attributes to the data/model frames, so maybe better to change forecast so that failure information (y) and extrapolated covariates (newx?) can be passed explicitly.

forecast stat functions missing

for both weibull and loglog, forecast stat functions for conditional/unconditional harzard are missing.

Change `spdur` object to record coefs in separate lists

Currently coefficients for duration and risk equations, along with estimate for alpha, are returned by optim in one data frame, and saved in the spdur object in one data frame. Later in summary.spdur they are split based on location of (Risk Intercept). This doesn't work for models with risk ~ -1..., and finding a workaround that also works when the risk equation include factor variables is tricky, because for factor variables a term could expand to level-1 terms.

See how pscl:zeroinfl handles this, i.e. coefficients = list(count = coefc, zero = coefz).

Will need to change summary.spdur and methods like coef.

buildDuration wrapper for CRISP

look at CRISP build.duration function for arguments. Should return list with training, test, and validation data.

Release spduration 0.17.2

There was on issue on CRAN with notes for r-devel:

Documented arguments not in \usage in Rd file 'hazard.Rd':
‘x’

Functions with \usage entries need to have the appropriate \alias
entries, and all their arguments documented.
The \usage entries must correspond to syntactically valid R code.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

Prepare for release:

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)

forecast.spdur manual

Migrate to testthat 3rd edition

https://testthat.r-lib.org/articles/third-edition.html

Switch plot and whatever it depends on to use average of observation-based predictions, not hypothetical mean case

data(model.coups)
plot(model.coups, type = "hazard")

Add tidy.spdur method

This is so that the package can work with MI data using mice, see #30, which basically already also has an implementation.

Here's a guide: https://www.tidymodels.org/learn/develop/broom/

spdur loglog version

library(spduration)

data(insurgency)
duration.ins <- buildDuration(insurgency, 'insurgency', unitID='ccode', tID='date')

model_loglog <- spdur(
duration ~ low_intensity + high_neighbors + exclpop.l1,
atrisk ~ excl_groups_count.l1 + high_neighborhood + high_intensity + exclpop.l1 + lgdppc.l1,
last='end.spell', data=duration.ins, distr="loglog", max.iter=300)

Fitting base loglog...
Error in optim(base.inits, loglog.lik, method = "BFGS", control = list(maxit = max.iter), :
initial value in 'vmmin' is not finite