boost-r / gamboostlss Goto Github PK

Boosting models for fitting generalized additive models for location, shape and scale (GAMLSS) to potentially high dimensional data. The current relase version can be found on CRAN (https://cran.r-project.org/package=gamboostLSS).

Shell 0.30% R 91.98% TeX 7.72%

gamboostlss cran gamlss r-package variable-selection machine-learning boosting-algorithms r-language

gamboostlss's Introduction

gamboostLSS

gamboostLSS implements boosting algorithms for fitting generalized linear, additive and interaction models for to potentially high-dimensional data. Instead of modeling only the mean, gamboostLSS enables the user to model various distribution parameters such as location, scale and shape at the same time (hence the name GAMLSS, generalized additive models for location, scale and shape).

Using gamboostLSS

For installation instructions see below.
Instructions on how to use gamboostLSS can be found in the gamboostLSS tutorial.
Details on the noncyclical fitting method can be found in

Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., and Hofner, B. (2018), Gradient boosting for distributional regression - faster tuning and improved variable selection via noncyclical updates. Statistics and Computing. 28: 673-687. DOI 10.1007/s11222-017-9754-6. (Preliminary version: ArXiv 1611.10171).

Issues & Feature Requests

For issues, bugs, feature requests etc. please use the GitHub Issues.

Installation

Current version (from CRAN):
```
install.packages("gamboostLSS")
```
Latest patch version (patched version of CRAN package; under development) from GitHub:
```
library("devtools")
install_github("boost-R/gamboostLSS")
library("gamboostLSS")
```
Latest development version (version with new features; under development) from GitHub:
```
library("devtools")
install_github("boost-R/gamboostLSS", ref = "devel")
library("gamboostLSS")
```
To be able to use the install_github() command, one needs to install devtools first:
```
install.packages("devtools")
```

gamboostlss's People

Stargazers

Watchers

Forkers

feodorfitsner mayrandy strategist922 sbrockhaus uberstig guhjy hansnicolai hida-datathon nicolaihans1 al00014 asieminski

gamboostlss's Issues

print all steplengths nu in summary.mboostLSS

The current print-statement in summary.mboostLSS is:
cat("Step size: ", object[[1]]$control$nu, "\n\n")
https://github.com/hofnerb/gamboostLSS/blob/master/pkg/R/methods.R#L290
Which means you see only the step length of the first distribution parameter.

If it would be
cat("Step size: ", sapply(object, function(x) x$control$nu), "\n\n")
one would see if different step-lengths are used for the distribution parameters.
Best

as.families("BEINF") not working

The following code does not work:

library("gamboostLSS")
library("gamlss.dist")
data <- data.frame(x1 = runif(400), x2 = runif(400))
data$y <- with(data, rBEINF(400, mu = range((x1 + x2 - min(x1 + x2) + 0.001) / (diff(range(x1 + x2)) + 0.002)), 
                          sigma = sqrt(x1), nu = 0.1, tau = 0.1))
mod <- gamboostLSS(y ~ x1 + x2, data = data, families = as.families("BEINF"))
# Error in FAM$dldm(y = y, mu = FAM$mu.linkinv(f), sigma = sigma, nu = nu,  : 
#   unused arguments (nu = nu, tau = tau)

The reason can be found in

gamlss.dist::as.gamlss.family("BEINF")$dldm
#  function (y, mu, sigma) 
#  {
#      a <- mu * (1 - sigma^2)/(sigma^2)
#      b <- a * (1 - mu)/mu
#      dldm <- ifelse(((y == 0) | (y == 1)), 0, ((1 - sigma^2)/(sigma^2)) * 
#          (-digamma(a) + digamma(b) + log(y) - log(1 - y)))
#      dldm
#  }

which is a function of mu and sigma only.

How can we fix this for this family (and potentially others)? Is a change in gamlss.dist needed or can we fix this ourselves? Are there other families leading to the same or a similar issue?

@mayrandy, can you have a look at this?

Error when predicting with large N

I'm getting an error when trying to predict using gamboost, but the error only appears for me when the N becomes large.

E.g. running this code with 10000 rows of data is fine.

try with 10000 obs

p <- 10
n <- 10000

x <- matrix(runif(np), nrow = n, ncol = p)
coefs <- ifelse(runif(p) < 0.5, rnorm(p), 0)
y <- x %% coefs + rnorm(n)
x[,10] <- ifelse(runif(n) < 0.5, x[,10], NA)

xy <- data.frame(x,y)

forms <- list(mu = as.formula(X10 ~ .), phi = as.formula(X10 ~ .))

mod <- gamboostLSS(forms, data = xy[!is.na(xy[,10]),], families = BetaLSS())

preds <- predict(mod, newdata = xy[is.na(xy$X10),], type = "response")

no error

try with 50000 obs, error

p <- 10
n <- 50000

x <- matrix(runif(np), nrow = n, ncol = p)
coefs <- ifelse(runif(p) < 0.5, rnorm(p), 0)
y <- x %% coefs + rnorm(n)
x[,10] <- ifelse(runif(n) < 0.5, x[,10], NA)

xy <- data.frame(x,y)

forms <- list(mu = as.formula(X10 ~ .), phi = as.formula(X10 ~ .))

mod <- gamboostLSS(forms, data = xy[!is.na(xy[,10]),], families = BetaLSS())

preds <- predict(mod, newdata = xy[is.na(xy$X10),], type = "response")

glmboost with method="outer" is broken

This bug was actually already fixed in the past (I just can't remember where it exactly occured),

but currently glmboostLSS with method="outer" will always select mu.

Fix trace for non-cyclic models

## cyclic fitting: OK
> model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
+ control = boost_control(mstop = 100, trace = TRUE), method = "cycling")
[   1] ..................................................................... -- risk: 4625.79 
[  72] ...........................
Final risk: 4613.177 

## non-cyclic fitting: Not OK
> model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
+ control = boost_control(mstop = 100, trace = TRUE), method = "inner")
[   1]  -- risk: 4746.613 
[  4] ..................................................................... -- risk: 4654.223 
[ 75] .......................
Final risk: 4640.354 
.

> model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
+ control = boost_control(mstop = 100, trace = TRUE), method = "outer")
[   1]  -- risk: 4746.613 
[  4] ..................................................................... -- risk: 4654.22 
[ 75] .......................
Final risk: 4640.355 
.

The first risk is wrongly printed, the blanks in [ 4]and [ 75]are wrong and the last dot in the newline is wrong.

selected.mboostLSS is broken

set.seed(1907)
x1 <- rnorm(500)
x2 <- rnorm(500)
x3 <- rnorm(500)
x4 <- rnorm(500)
x5 <- rnorm(500)
x6 <- rnorm(500)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(500)
for( i in 1:500)
    y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 10),
                     center = TRUE, method = "cyclic")
selected(model) # ok (at least in principle)
selected(model, merge = TRUE) # ok

model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 10),
                     center = TRUE, method = "noncyclic")
selected(model) # ok (at least in principle)
selected(model, merge = TRUE) ## BROKEN

risk(, merge = TRUE) broken for method = cyclic

It accidentially drops names for the last two entries.

Non-cyclic (ok)

> set.seed(1907)
> x1 <- rnorm(1000)
> x2 <- rnorm(1000)
> x3 <- rnorm(1000)
> x4 <- rnorm(1000)
> x5 <- rnorm(1000)
> x6 <- rnorm(1000)
> mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
> sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
> y <- numeric(1000)
> for( i in 1:1000)
+     y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
> dat <- data.frame(x1, x2, x3, x4, x5, x6, y)
> model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
+                      control = boost_control(mstop = 10), 
+                      method = "noncyclic")
> 
> risk(model, merge = TRUE)
      mu    sigma       mu       mu       mu       mu       mu       mu       mu       mu       mu       mu 
3282.119 3282.119 3274.432 3267.559 3261.376 3255.652 3250.326 3245.387 3240.745 3236.436 3232.350 3228.557

Cyclic:

> set.seed(1907)
> x1 <- rnorm(1000)
> x2 <- rnorm(1000)
> x3 <- rnorm(1000)
> x4 <- rnorm(1000)
> x5 <- rnorm(1000)
> x6 <- rnorm(1000)
> mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
> sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
> y <- numeric(1000)
> for( i in 1:1000)
+     y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
> dat <- data.frame(x1, x2, x3, x4, x5, x6, y)
> model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
+                      control = boost_control(mstop = 10), 
+                      method = "cyclic")
> 
> risk(model, merge = TRUE)
      mu    sigma       mu    sigma       mu    sigma       mu    sigma       mu    sigma       mu    sigma       mu    sigma 
3282.119 3282.119 3274.432 3272.107 3265.181 3263.019 3256.740 3254.728 3248.846 3246.962 3241.473 3239.714 3234.554 3232.905 
      mu    sigma       mu    sigma       mu    sigma     <NA>     <NA> 
3228.051 3226.507 3221.936 3220.486 3216.152 3214.691 3210.562 3209.164 
attr(,"class")
[1] "inbag"

Please fix and add a test to check if all names are set correctly.

mstop = 1 should be possible for non-cyclical fitting

glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 1), method = "outer")

currently gives the errror

Error in mboostLSS_fit(formula = formula, data = data, families = families,  : 
  ‘mstop’ has to be an integer larger than 2

and if one fits a larger model and reduces it to 1, one gets

Warning message:
In attr(x, "subset")(i) :
  Minimal number of iterations: 2 (at least one iteration for each distribution parameter)

Does this really make sense for non-cyclical models?

Package not available via CRAN

Currently it seems impossible to obtain the package via CRAN:

> install.packages("gamboostLSS")
Installing package into **********************
(as ‘lib’ is unspecified)
Warning message:
package ‘gamboostLSS’ is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

and also https://cran.r-project.org/package=gamboostLSS says the following:

Make new dependency checks happy

To make the new dependency checks on CRAN happy go along the following recipe (as suggested by Kurt Hornik).

1.) Copy the functions listed after Undefined global functions or variables: in the output from R CMD check in a variable txt:

txt <- "axis box coef dlogis dnorm fitted grey lines median optimize plogis
     plot pnorm points predict qlnorm qlogis qnorm qt qweibull rgb
     rmultinom tail update var weighted.mean"

2.) With the function

imports_for_undefined_globals <- function(txt, lst, selective = TRUE) {
    if(!missing(txt))
        lst <- scan(what = character(), text = txt, quiet = TRUE)
    nms <- lapply(lst, find)
    ind <- sapply(nms, length) > 0L
    imp <- split(lst[ind], substring(unlist(nms[ind]), 9L))
    if(selective) {
        sprintf("importFrom(%s)",
                vapply(Map(c, names(imp), imp),
                       function(e)
                           paste0("\"", e, "\"", collapse = ", "),
                       ""))
    } else {
        sprintf("import(\"%s\")", names(imp))
    }
}

one can obtain the imports via

writeLines(imports_for_undefined_globals(txt))

3.) Copy and paste the output to NAMESPACE.
4.) Add the packages to Imports in DESCRIPTION.

Try to duplicate less code

@ja-thomas, we should try to use better functional units to avoid duplication of code. Currently, we have a lot of spaghetti code instead of proper functional programming.

Can we aim for that in the next release, i.e. after the initial release of the non-cyclic fitting?

stabsel is broken

### Data generating process:
set.seed(1907)
x1 <- rnorm(500)
x2 <- rnorm(500)
x3 <- rnorm(500)
x4 <- rnorm(500)
x5 <- rnorm(500)
x6 <- rnorm(500)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(500)
for( i in 1:500)
    y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 10),
                     center = TRUE, method = "cyclic")
s1 <- stabsel(model, q = 5, PFER = 1, B = 10) ## warning is expected

model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 10),
                     center = TRUE, method = "noncyclic")
s2 <- stabsel(model, q = 5, PFER = 1, B = 10) ## BROKEN

Perhaps the ERROR vanishes when selected is fixed (see #39). However, more fixmes are in the code. Currently the code is not ready to go to CRAN.

Prepare new release version

Merge devel to master
Check that all relvant issues are closed
Check that reference to novel gamboostLSS algorithm + stabsel paper is present in all relevant places
Check vignette against published paper. Results should be (more or less) identical.
Check that all citations are up to date
Make sure NEWS.Rd reflects all recent changes.

Provide separate manuals for families

analogously to boost-R/mboost#61

Model reduction broken

Model reduction yields different coefficients if the model is set back to the earlier value.

MWE:

require("gamboostLSS")

###negbin dist, linear###

set.seed(2611)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 + x1^2 +0.5 * x2 - 3 * sin(x3) -1 * x4)
sigma <- exp(-0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for (i in 1:1000)
  y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

for (i in 40:100) { 
  mod <- gamboostLSS(y ~ . , data = dat, families = NBinomialLSS(),
                     control = boost_control(mstop = i, trace  = FALSE), 
                     method = "outer", baselearner = "bols")

  cm <- coef(mod)

  mstop(mod) <- 5
  mstop(mod) <- i

  cat(i, all.equal.list(coef(mod), cm), "\n")
}

example code in families.Rd failes

For me, devtools::run_examples() fails for families.Rd, with

## families object for new distribution
newStudentT <- Families(mu= newStudentTMu(mu=mu, df=df),
                        df=newStudentTDf(mu=mu, df=df))

#  Error in body(x@response) : 
#   trying to get slot "response" from an object of a basic class ("function") with no slots

To be honest, I am not into Families. So can someone else please check?

PS: I run on:
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
gamboostLSS_2.0-1

gamlss.dist::BB, gamlss.dist::BI don't work

These are supposed to be called with a 2-column response giving successes & failures, but that doesn't seem to have been implemented yet here.

I have an ugly fix for cases where # of trials is constant for all the data (see below), but that's unsatisfactory for most practical applications, probably.

> options(error=traceback, deparse.max.lines = 10)
> library("gamboostLSS")
> library("gamlss")
> 
> # y ~ Binomial(p = invlogit(x + z), n= 60)
> n <- 100
> x <- rnorm(n)
> z <- rnorm(n)
> data <- data.frame(y = rbinom(n, p = plogis(x + z), size = 60), x = x, z= z)
> data$ymat <- with(data, cbind(success = data$y, fail = 60 - data$y))
> (m_gamlss <- gamlss(ymat ~ x + z, data = data, family = BB))
GAMLSS-RS iteration 1: Global Deviance = 700.266 
GAMLSS-RS iteration 2: Global Deviance = 553.8168 
GAMLSS-RS iteration 3: Global Deviance = 532.8859 
GAMLSS-RS iteration 4: Global Deviance = 532.8656 
GAMLSS-RS iteration 5: Global Deviance = 532.8656 

Family:  c("BB", "Beta Binomial") 
Fitting method: RS() 

Call:  gamlss(formula = ymat ~ x + z, family = BB, data = data) 

Mu Coefficients:
(Intercept)            x            z  
  -0.005859     1.028064     0.994731  
Sigma Coefficients:
(Intercept)  
     -5.265  

 Degrees of Freedom for the fit: 4 Residual Deg. of Freedom   96 
Global Deviance:     532.866 
            AIC:     540.866 
            SBC:     551.286 
> # gamlss(ymat ~ x + z, data = data, family = BI) # samesame
> 
> Betabin <- as.families(fname="BB")
> gamboostLSS(ymat ~ x + z, data = data, families = Betabin)
Error in eval(expr, envir, enclos) : object 'bd' not found
No traceback available 
> # bd is the "binomial denominator" (i.e., # of trials), not being handed over!
> 
> #-------------------------------------------------------------------------------
> # ugly, hacky fix: set bd-defaults to 60 wherever it's needed
> Betabin60 <- Betabin
> add_bd_60 <- function(f){
+   frmls <- formals(f)
+   frmls$bd <- 60
+   formals(f) <- frmls
+   f
+ }
> with(environment(Betabin60[[1]]@ngradient), {
+   #all slots share same env, so enough to do this for one slot
+   FAM$dldm <- add_bd_60(FAM$dldm)
+   FAM$d2ldm2  <- add_bd_60(FAM$d2ldm2)
+   FAM$d2ldm2  <- add_bd_60(FAM$d2ldm2)
+   FAM$dldd  <- add_bd_60(FAM$dldd)
+   FAM$d2ldd2  <- add_bd_60(FAM$d2ldd2)
+   FAM$G.dev.incr <- add_bd_60(FAM$G.dev.incr)
+   FAM$mu.initial <- quote(mu <- (y + 0.5)/(60 + 1))
+   pdf <- add_bd_60(pdf)
+ })
> with(environment(Betabin60[[2]]@ngradient), {
+   FAM$dldm <- add_bd_60(FAM$dldm)
+   FAM$d2ldm2  <- add_bd_60(FAM$d2ldm2)
+   FAM$d2ldm2  <- add_bd_60(FAM$d2ldm2)
+   FAM$dldd  <- add_bd_60(FAM$dldd)
+   FAM$d2ldd2  <- add_bd_60(FAM$d2ldd2)
+   FAM$G.dev.incr <- add_bd_60(FAM$G.dev.incr)
+   FAM$mu.initial <- quote(mu <- (y + 0.5)/(60 + 1))
+   pdf <- add_bd_60(pdf)
+ })
> #-------------------------------------------------------------------------------
> data$ones <- rep(1, n)
> # matrix-response like gamlss produces multivariate model for success and failure
> # probabilities (I guess?!?), not model for sucess probabilities: 
> coef(gamboostLSS(list(
+   mu = ymat ~ bols(x) + bols(z), 
+   sigma = ymat ~ bols(ones, intercept=FALSE)),
+   data = data, families = Betabin60))
$mu
$mu$`bols(x)`
(Intercept)           x        <NA>        <NA> 
-0.20896247  0.97583503  0.02081331 -0.97472539 

$mu$`bols(z)`
  (Intercept)             z          <NA>          <NA> 
 0.0001546734  0.9402122029 -0.2000348943 -0.9390756321 

attr(,"offset")
[1] 0.1940493

$sigma
$sigma$`bols(ones, intercept = FALSE)`
     ones      <NA> 
-3.221820 -3.210209 

attr(,"offset")
[1] 0
>
> coef(gamboostLSS(list(
+   mu = y ~ bols(x) + bols(z),
+   sigma = y ~ bols(ones, intercept=FALSE)),
+   data = data, families = Betabin60))
$mu
$mu$`bols(x)`
(Intercept)           x 
 -0.2089625   0.9758350 

$mu$`bols(z)`
 (Intercept)            z 
0.0001546734 0.9402122029 

attr(,"offset")
[1] 0.1940493

$sigma
$sigma$`bols(ones, intercept = FALSE)`
    ones 
-3.22182 

attr(,"offset")
[1] 0

codefile is here: https://gist.github.com/fabian-s/b952063213f22c31ab9ea99f681a56d7

predict for (some) noncyclic models broken

Spotted by Lisa Schlosser:

library("gamboostLSS")
set.seed(7)
data <- data.frame(x1 = runif(400, -1, 1),
                   x2 = runif(400, -1, 1),
                   x3 = runif(400, -1, 1),
                   x4 = runif(400, -1, 1))
data$y <- rnorm(400, 4 * data$x1 + 2)

gb_oob <- gamboostLSS(y ~ x1, data = data,
                      control = boost_control(risk = "oobag"), method = "noncyclic")
## works:
predict(gb_oob, type = "response", parameter = c("mu", "sigma"))

## error:
predict(gb_oob, type = "response", parameter = c("mu", "sigma"), newdata = data)

## works but seems wrong as sigma has 5 elements instead of 4:
predict(gb_oob, type = "response", parameter = c("mu", "sigma"), newdata = data[c(1:4),])

This only happens with risk = "oobag" and method = "noncyclic".

Problem on calculating risk in gamboostLSS with noncyclical approach

Consider the following Gaussian distribution example:

library(gamboostLSS)

# g_muvsi is a function that allows to see which distribution
# parameter is updated in each iteration.
# If mu is updated, then returns 1, if sigma is updated, then it returns 2.
# The first value in the risk vector is removed as it is the initial value.
g_muvsi <- function(gamboostLSS_object) {
  df1 <- data.frame(risk = risk(gamboostLSS_object)$mu[-1], muvsi = 1)
  df2 <- data.frame(risk = risk(gamboostLSS_object)$sigma[-1], muvsi = 2)
  df <- rbind(df1, df2)
  df <- df[order(df$risk, decreasing = TRUE), ]
  return(df$muvsi)
}

set.seed(1)
n <- 3
x1 <- c(1,2,3)

mu <- x1
sigma <- exp(.1 * x1)

data <- cbind(x1)
y <- rnorm(3, mu, sigma)

# center x1
cx1 <- scale(x1, center = T, scale = F)

b_g <- gamboostLSS(y ~ bols(cx1),
                   control = boost_control(mstop = 10),
                   method = "noncyclic")

By applying the g_muvsi() function to the gamboostLSS object we can find that sigma is updated in the third boosting iteration.

> g_muvsi(b_g)
 [1] 1 2 2 1 2 2 1 2 2 1

However, if the negative gradient for mu and sigma after the second boosting iteration is caluclated manually

mstop(b_g) = 2
ngr_mu <- (1/exp(predict(b_g)$sigma)^2) * (y - predict(b_g)$mu)
ngr_si <- -1 + exp(-2 * predict(b_g)$sigma) * (y - predict(b_g)$mu)^2

fit them with linear regression

lm_mu <- lm(ngr_mu ~ cx1)
lm_si <- lm(ngr_si ~ cx1)

and finally calculate the empirical risk with step length 0.1

risk_mu <- -sum(dnorm(y, predict(b_g)$mu + .1 * lm_mu$fitted.values, exp(predict(b_g)$sigma), log = T))
risk_si <- -sum(dnorm(y, predict(b_g)$mu, exp(predict(b_g)$sigma + .1 * lm_si$fitted.values), log = T))

we'll get

> risk_mu
[1] 3.515798
> risk_si
[1] 3.517315

which means that mu should be updated in third iteration, but not sigma.

According to my debug analysis, I've found that the empirical risk for one distribution parameter in iteration [m] depends on which distribution parameter was updated in the previous iteration [m-1]. If, like this example, sigma is updated in the second iteration, then the risk for mu in the third iteration is calculated with the following formula

> but_risk_mu <- -sum(dnorm(y, predict(b_g)$mu, exp(predict(b_g)$sigma), log = T))
> but_risk_mu
[1] 3.609269

Comparing but_risk_mu and risk_si, the winner is sigma, just the same as the gamboostLSS output.
Similarly, if mu is updated in iteration [m-1], then the risk for sigma in iteration [m] is not calculated like risk_si, but a value without step length updates

but_risk_si <- -sum(dnorm(y, predict(b_g)$mu, exp(predict(b_g)$sigma), log = T))

I think the problem appears somewhere between row 243 and 287 in

gamboostLSS/R/mboostLSS.R

Line 243 in f97c890

if (method == "noncyclic") {

Update References

"Gradient boosting for distributional regression - faster tuning and improved variable selection via noncyclical updates", has been accepted for publication in Statistics and Computing.

Replace

Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., and Hofner, B. (2016), 
Stability selection for component-wise gradient boosting in multiple dimensions. 
Preliminary version: \url{http://arxiv.org/abs/1611.10171}.

with

Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., and Hofner, B. (2017), 
Gradient boosting for distributional regression - faster tuning and improved 
variable selection via noncyclical updates. 
Accepted for publication in \emph{Statistics and Computing}.\cr
Preliminary version: \url{http://arxiv.org/abs/1611.10171}.

bug in cvrisk.mboostLSS() if families at default

If the families argument is not specified explicitly in mboostLSSone gets an error in cvrisk.mboostLSS() (spotted by Almond Stöcker).
See the following MWP:

library(gamboostLSS)

### Data generating process:
set.seed(1907)
x1 <- rnorm(1000)
mu    <- 2*x1
sigma <- rep(1, 1000)
y <- numeric(1000)
for( i in 1:1000)
       y[i] <- rnorm(1, mean = mu[i], sd=sigma[i])

dat <- data.frame(x1=x1, y=y)

## model with default families 
model <- mboostLSS(y ~ bbs(x1), data = dat)

## cvrisk.mboostLSS() does not work as families was not specified in model call
cvr <- cvrisk(model, folds = cv(model.weights(model), B=3), trace=FALSE)


## model with families = GaussianLSS()
model2 <- mboostLSS(y ~ bbs(x1), families = GaussianLSS())

## works as families was specified in model call
cvr2 <- cvrisk(model2, folds = cv(model.weights(model2), B=3), trace=FALSE)

I think that a possible fix is to define familiesin clin mboostLSS after line
https://github.com/hofnerb/gamboostLSS/blob/master/pkg/R/mboostLSS.R#L8
by adding

if(is.null(cl$families)) cl$families <- GaussianLSS()

Best

predint() not working when multiple variables are included within a single baselearner

library("gamboostLSS")
set.seed(7)
data <- data.frame(x1 = runif(400, -1, 1),
                   x2 = runif(400, -1, 1),
                   x3 = runif(400, -1, 1),
                   x4 = runif(400, -1, 1))
data$y <- rnorm(400, 4 * data$x1 + 2 * data$x3)

gb <- gamboostLSS(y ~ btree(x1, x2, x3, x4), data = data, method = "noncyclic")

## error:
predint(gb, which = "x1")

possible fix:
replace pred_vars <- unique(unlist(pred_vars)) with pred_vars <- unique(trimws(unlist(strsplit(unlist(pred_vars), ",")))) in the predint() function

response(fitted()) and fitted(, type = response) in mboostLSS.R

This is just a question of understandig some of your code:

In the file mboostLSS.R it says in Code line 159ff:

## update value of nuisance parameters
## use response(fitted()) as this is much quicker than fitted(, type = response)
 for (k in mods[-j])
     assign(names(fit)[k], families[[k]]@response(fitted(fit[[k]])),
                environment(get("ngradient", environment(fit[[j]]$subset))))

And consequently response(fitted()) is used.

I just wondered whether there is a reason, that in line 130 fitted(, type = response) is used instead of response(fitted())?
The corresponding code ist:

for (k in mods[-j]){
    if (!is.null(fit[[k]]))
          assign(names(fit)[k], fitted(fit[[k]], type = "response"),
               environment(families[[j]]@ngradient))
}

Best

Update/check Tutorial references

README.md
Manuals

summary() fails if no base-learner was selected

For noncyclic fitting, it can happen that for a distribution parameter no base-learner is selected. In this case the summary() function fails.
See also boost-R/mboost#75

library(gamboostLSS)

### Data generating process:
set.seed(1907)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for( i in 1:1000)
  y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

### linear model with y ~ . for both components: 400 boosting iterations
model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 400),
                     center = TRUE, method = "noncyclic")

## works fine
summary(model)

## summary() breaks if for one component no base-learner was selected
summary(model[10])

## breaks as well 
summary(model[0])

For mstop = 0, the summary() function also fails for cyclic fitting.

### linear model with y ~ . for both components: 400 boosting iterations
model2 <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 400),
                     center = TRUE, method = "cyclic")

## works fine
summary(model2[0])

Add option for stochastic gradient descent

Delete stco_paper branch

The branch was actually in a wrong state... and still contained refactored stuff.

The correct branch is now stco_paper_outer_version (I can't force-push stco_paper since it's protected)

stco_paper branch should be deleted and stco_paper_outer_version protected

Scoping problem with "combined_risk" in nc_mboostLSS

attr(model1, "combined_risk") seems to always relate to the last fitted model instead of model1.
And perhaps related to this issue, the combined_risk cannot be returned when fitting multiple models with mclapply:

library(gamboostLSS)
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###
### Scoping problem with "combined_risk" in noncyclical mboostLSS 
### Particularly, prohibiting the access to "combined_risk" after fitting multiple models via mclapply 
###~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~###

### Use first example of ?mboostLSS but with mclapply over the methods cyclic and noncyclic:

# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
### Data generating process:
set.seed(1907)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for( i in 1:1000)
 y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


### linear models with y ~ . for both components: 20 boosting iterations fitted via mclapply
model <- mclapply(c("cyclic", "noncyclic"), function(method) glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat, 
                                                                      control = boost_control(mstop = 20),method = method, center = TRUE))
# only consider the noncyclic model[[2]] (as the cyclic does not have "combined_risk") 
model_nc <- model[[2]]

# Cannot access combined_risk
try(attr(model_nc, "combined_risk")())

### If there was another model previously fitted, the combined_risk of this other model is returned instead:
# e.g. model instead x1 as response and mstop = 30  
model_0 <- mboostLSS(x1 ~ ., families = GaussianLSS(), data = dat, 
                    control = boost_control(mstop = 30),method = "noncyclic")

try(length(attr(model_nc, "combined_risk")()))
# => combined_risk of model_0 (of length 30+2) is returned instead of the one from model_nc (length 20+2)


### And if we now fit model_nc seperately, the combined_risk of model_0 is replaced !!! 
model_1 <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat, 
                      control = boost_control(mstop = 20),method = "noncyclic", center = TRUE)

length(attr(model_0, "combined_risk")())
# => now, length is 22 !

Fix documentation and "no visible global function definition"

see e.g. https://s3.amazonaws.com/archive.travis-ci.org/jobs/115008209/log.txt

Please note that I've merged some changes from patch to pkg and might have broken some more things. I'll try to look into this but you might also want to have a look.

Specifying intercept models is not documented & a PITA

I think having to do

data$ones <- rep(1, n)
gamboostLSS(list( .... ~ ...,  bla = response ~ bols(ones, intercept =FALSE)), 
  data=data, families=SomethingCraycray())

instead of

 bla = response ~ 1

sucks major d*** in terms of usability. At least it should be documented somewhere that this is the way we want users to specify an intercept model / formula.....

Add tests for cvrisk with non-cyclic models

Add references to Arxiv TechRep

add this to manuals (were applicable)
add to CITATION
add to README.md

Bug in mstop for non-cyclical fitting

require("gamboostLSS")

###negbin dist, linear###

set.seed(2611)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 + x1^2 +0.5 * x2 - 3 * sin(x3) -1 * x4)
sigma <- exp(-0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for (i in 1:1000)
  y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 3), method = "inner")
cvr2 <- cvrisk(model, grid = 1:100)

mstop(cvr2)  #  a scalar
mstop(model) <- mstop(cvr2) #works
## but
mstop(model) # a vector
## and
mstop(model) <- c(mu = 10, sigma = 20) # breaks
mstop(model) ## now mstop = 10...

Does the current behavior makes sense? I don't think so.

I do see why it is interesting to know how many iterations were fitted for mu and sigma but at the same time, we cannot reuse this information. We only care about the total number of steps.

So it would preferable to primarily show the single mstop value.
Furthermore, we need to check all documentation where we state that mstop can be assigned as a named vector etc. That must be consistent and correct.

stabsel.mboostLSS does not work on FDboost with scalar response

because this check:

gamboostLSS/R/methods.R

Line 399 in f97c890

n <- if(inherits(x, "FDboostLSS")) {

assumes a functional response model, I think.

risk: Problem when models are decreased and increased afterwards

## Generate some data
set.seed(1907)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for( i in 1:1000)
    y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

## Fit model
model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                      control = boost_control(mstop = list(mu = 10, sigma = 20), trace = TRUE),
                      center = TRUE)
risk(model)  ## OK
lapply(risk(model), length)  ## OK

## Increase mstop
mstop(model) <- c(20, 30)
risk(model) ## Missing value introduced
lapply(risk(model), length)  ## in principle OK


## No issue when model does not need to be reduced (above to mstop = c(10, 10))
model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                      control = boost_control(mstop = 10, trace = TRUE),
                      center = TRUE)
mstop(model) <- 20
risk(model)

@ja-thomas The above error exists at least in the current implementation in the master branch. Please also check this in the devel branch (or your new branch). Altogether, we need to check all occurences of "risk" and make sure that it works with the new mboost risk. I think it's best if we only do this in the devel branch as the old master branch most likely will not make it to CRAN another time...

mboost_LSS does not work for "matrix valued, but scalar" responses fit with FDboost

because this:

gamboostLSS/R/mboostLSS.R

Line 171 in f97c890

 offset[[j]] <- families[[j]]@offset(y = if(funchar != "FDboost") response else c(response), 

turns matrices of scalar wins/losses, e.g., into a long vector.

cycling variable should be in boost_control

Would be better if the variable to set which fitting algorithm should be used was in the boost_control list.

Maybe the boost_control function from mboost can be modified after importing it in gamboostLSS .

Adapt confint.mboost

to work with gamboostLSS. At best, for both the cyclic and the non-cyclic variants.

Issue in gamboostLSS tutorial

Error: processing vignette 'gamboostLSS_Tutorial.Rnw' failed with diagnostics:
 chunk 23 

Error in rep(NA, mmo - length(rsk)) : invalid 'times' argument
Execution halted

This is most likely related to the latest changes in mboost regarding risk() (see boost-R/mboost#64 and boost-R/mboost#66).

print methods, truncate output of data

check boost-R/mboost#11 and see if it is suffiencient to fix it there.

cvrisk.nc_gamboostLSS exportation and default

cvrisk.nc_gamboostLSS is currently not exported, but obviously provides another default grid then the one for cvrisk.gamboostLSS. The second default is also reported in ?cvrisk.gamboostLSS/?cvrisk.nc_gamboostLSS.

I was a little confused with that and maybe it would already help to export cvrisk.nc_gamboostLSS to make differences more visible.

Best,
Almond

Update reference to Thomas et al

Online First. DOI 10.1007/s11222-017-9754-6 (URL http://dx.doi.org/10.1007/s11222-017-9754-6)

Non-cyclical fitting gets slower in later iterations

Try to fix.

Offset values are written in all distribution parameter models

In mboostLSS_fit() all offset values of the distribution parameters (mu, sigma, ...) are written in each distribution parameter model.

e.g. The fit object of the mu parameter has a variable "mu" containing the initial offset value, although it is only needed in the other distributions. The value is not changed in the iterations of the algorithm.
For the other parameters the same problem applies.

summary method

In my opinion it would be nice to have a summary-function for mboostLSS-objects. In the following is code for a summary function (model taken from help).

library(gamboostLSS)

### Data generating process
set.seed(1907)
x1 <- rnorm(1000)
x2 <- rnorm(1000)
x3 <- rnorm(1000)
x4 <- rnorm(1000)
x5 <- rnorm(1000)
x6 <- rnorm(1000)
mu    <- exp(1.5 +1 * x1 +0.5 * x2 -0.5 * x3 -1 * x4)
sigma <- exp(-0.4 * x3 -0.2 * x4 +0.2 * x5 +0.4 * x6)
y <- numeric(1000)
for( i in 1:1000)
  y[i] <- rnbinom(1, size = sigma[i], mu = mu[i])
dat <- data.frame(x1, x2, x3, x4, x5, x6, y)

### linear model with y ~ . for both components: 400 boosting iterations
model <- glmboostLSS(y ~ ., families = NBinomialLSS(), data = dat,
                     control = boost_control(mstop = 400),
                     center = TRUE)


### summary just tells you that the model is a list
summary(model)


### summary function based on print.mboostLSS() and summary.mboost()
summary.mboostLSS <- function(object, ...) {

  ####cl <- match.call()
  cat("\n")
  cat("\t LSS Models fitted via Model-based Boosting\n")
  cat("\n")
  if (!is.null(attr(object, "call"))) 
    cat("Call:\n", deparse(attr(object, "call")), "\n\n", sep = "")
  cat("Number of boosting iterations: mstop =", mstop(object), "\n")
  cat("Step size: ", sapply(1:length(object), function(i) object[[i]]$control$nu), "\n\n")

  cat("Families:\n")
  lapply(object, function(xi){

    show(xi$family)

    nm <- variable.names(xi)
    selprob <- tabulate(selected(xi), nbins = length(nm)) / length(selected(xi))
    names(selprob) <- names(nm)
    selprob <- sort(selprob, decreasing = TRUE)
    selprob <- selprob[selprob > 0] 
    show(selprob) 

  } )
  invisible(object)
}


### summary of the model 
summary(model)

Missing documentation for stabsel.mboostLSS and selected.stabsel_mboostLSS

Need an additional .Rd file for the documentation in pkg.

EDIT: No new .Rd file, will add in methods.Rd

Fix `cvrisk.nc_mboostLSS`

Fix trace = TRUE: Currently doesn't work. Use output function from cvrisk.mboost if possible
cvrisk.nc_mboostLSS needs to be documented as grid is defined differently.
Plot function needs to be documented: But how? We currently use plot.cvrisk from mboost? Perhaps use something ala:

plot.nc_cvrisk <- function(...) {
  plot.cvrisk(...)
}

Change links in as.families()

It should be possible to use different link functions in families that we import from gamlss via as.families(), currently we are just using the default.

The link function is crucial forngradient, as in gamlss they use the derivatives d l / d mu (in case of the first parameter of the distribution). We need d l / d eta_mu, therefore we also need d mu / d eta_mu which depends on the link.

Problem fitting a blackboost(LSS) model to toy data

I'm currently playing around with a toy example to get the blackboostLSS function working. That is, I set up a blackboost model (without LSS) which works fast and produces the expected results. However, when I use blackboostLSS the results are vastly different although I tried to keep all parameters equal.

Here is my code:

library(gamboostLSS)
library(partykit)

# Init boost control params
boost_ctrl <- boost_control(trace = TRUE)

# Init tree control params
tree_ctrl <- ctree_control()

# Fit blackboost model
bb <- blackboost(dist ~ speed, 
                      data = cars,
                      control = boost_ctrl,
                      tree_controls = tree_ctrl)

### plot fit
plot(dist ~ speed, data = cars, main = "Blackboost Fit")
lines(cars$speed, predict(bb), col = "red")

# Fit blackboostLSS model
bb_lss <- blackboostLSS(formula = list(mu = dist ~ speed,
                                           sigma = dist ~ speed),
                            data = cars,
                            method = "cyclic",
                            control = boost_ctrl,
                            tree_controls = tree_ctrl)

# plot fit
plot(dist ~ speed, data = cars, main = "BlackboostLSS fit")
lines(cars$speed, predict(bb_lss)$mu, col = "red")

Essentially I would expect the first and second plot to look very similar which clearly isn't the case.

Any hint or advice would be highly appreciated.