hofnerb / stabs Goto Github PK

View Code? Open in Web Editor NEW

25.0 5.0 9.0 840 KB

Stability Selection with Error Control

Home Page: https://cran.r-project.org/package=stabs

R 99.07% Shell 0.93%

stability-selection cran r-package r-language variable-selection variable-importance machine-learning resampling

stabs's People

Contributors

Stargazers

Watchers

Forkers

richardbeare springdance andreyto gokceneraslan helshi nguyentr17 guhjy iscoul marastadler

stabs's Issues

Randomized Logistic Regression

Hi there,

Can I use the function 'stabsel' for multinomial classification after factorizing the class?

If so, how does this compare to 'RandomizedLogisticRegression' from sklearn?

Thank you!

Disallow specification of penalty parameters via args.fitfun

see #21 for an example.

Use `mc.preschedule = FALSE` per default with `mclapply`

This reduces the impact of errors in single folds and might speed up run time.

glmnet.lasso_maxCoef(): incorrect coefficient sorting? [bug]

Hi,
on line 129 of fitfuns.R, it looks like glmnet.lasso_maxCoef() is not sorting the coefficients in the right order. The code is:
selected <- order(coef(fit)[-1])[1:q]

Here is a reproducible example that would capture lines 126 and 129:

library(glmnet)
library(stabs)
set.seed(36)

data(QuickStartExample)
x <- QuickStartExample$x
y <- QuickStartExample$y

lambda.1se <- cv.glmnet(x, y)$lambda.1se # 0.1322761

fit <- glmnet(x, y, lambda = lambda.1se)
fit$df # 8

## coefficients minus the intercept
coef_vec <- coef(fit)[-1] 
coef_vec
# 1.3019591  0.0000000  0.6422423  0.0000000 -0.7892393  0.4944803  0.0000000  0.2943201  0.0000000  0.0000000  0.1058430
# 0.0000000  0.0000000 -1.0402308  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000 -0.9791165

##  line 129 :: order(coef(fit)[-1])[1:q]
i1 <- order(coef_vec); i1
# [1] 14 20  5  2  4  7  9 10 12 13 15 16 17 18 19 11  8  6  3  1
q <- 9; coef_vec[i1[1:q]]
# -1.0402308 -0.9791165 -0.7892393  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000

I would have expected the sorting to be done on the absolute value of the coefficients. Assuming this is the intent, two possible solutions are:

i2 <- order(abs(coef_vec), decreasing = TRUE); i2
# [1]  1 14 20  5  3  6  8 11  2  4  7  9 10 12 13 15 16 17 18 19
coef_vec[i2[1:q]]
# 1.3019591 -1.0402308 -0.9791165 -0.7892393  0.6422423  0.4944803  0.2943201  0.1058430  0.0000000

selected <- predict(fit, type = "nonzero")
selected <- selected[[length(selected)]] ## length is 1 when lambda is a scalar
selected
#  1  3  5  6  8 11 14 20

I believe the later solution is used in glmnet.lasso() on lines 31 and 32 of fitfuns.R

Thanks
Peter

## verify that the same coefficients are chosen 
all(selected %in% i2[1:fit$df])
# TRUE

No variables selected, on high dimensional data.

I used stab package for high dimensional gene type of data, which, regardless how I tried, (lars.lasso would not work on input, but glmnet.lasso works just fine), it shows that ,everytime I run, 'no variables selected '. for example,

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =0.2)

my data had extremely high dimensionality 6248 columns and low observations with 224 rows.

If I use glmnet.lasso_maxCoef, the algorithm would just return first 45 genes as being selected, this is clearly not right.

What am I doing wrong? has anyone, except the low dimensional data, tried on real world high dimensional data?

Improve error handling (2)

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "unimodal")
## Error in if (cutoff <= 3/4) { : missing value where TRUE/FALSE needed
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "r-concave")
## Error in minD(q, p, cutoff[i], B) : ‘q’ must be <= 9

In the latter case just set call. = FALSE.

no variables are selected

Hello hofnerb,
If I do not fix my lambda, the algorithm would not select any of the variables in the high dimensional data that I have. here is the code and the outcome:

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =1)
Stability Selection with unimodality assumption

No variables selected

Stablity selection [R] stabs::stabsel "Error in res[[1]] : subscript out of bounds"

Hello,
I'am analysing sequences data for my master thesis. Doing so I'am intenting to use feature selection to find the most important features of my sequences to predict my outcome (salary at 30 years old).
When I run the following code in Rstudio I get an error:

Reglin_control_features_sala <- lm(t9salafteall_comp ~ Agglo + Sex + Migr_gen + Migr_reg + Langue_maison_dummy + Reg_ling, data=alldata_formation_salaires_NoNAs, weights = basewt)

set.seed(999)
y <- residuals(Reglin_control_features, type = "deviance")
x <- Features_12mois_SD

prop_stabsel <- stabs::stabsel( x= x , y = y, fitfun = glmnet.lasso, PFER=0.5, cutoff= 0.6,
B=1000)
=> "Error in res[[1]] : subscript out of bounds"

I tried different values for the parmeters PFER, cutoff and B, always with the same result, my professor, Matthias Studer, does not know where this error come from, that is why I am asking you directly.

Thanks a lot for your help,
Regards,

Leonhard.

Simulation-based FDR estimation

Especially for genomics applications, it would be nice to also have FDR procedure in the implementation.

Selection paths / selections

Is a variable considered to be selected if it is selected at some point and removed again later on? See #5

Fix issue when variables are dropped from active set in lars.lasso

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "none")
## Error in ret[selected] <- TRUE :
##    only 0's may be mixed with negative subscripts

(error reported by Markus Scheibengraber)

q = 1 -> more than one variable selected

I have a quick question:
I used stabsel in combination with glmnet.lasso and set q to be 1.
However, the results show that more than one variable has been selected - how is that possible?

stabsel(x = data[predict], y = data$freq_calls, fitfun=glmnet.lasso, args.fitfun=list(family="poisson"), PFER = 0.2, q=1, sampling.type="SS")

Stability Selection with unimodality assumption

Selected variables:
             A                E 
               2                6 

Selection probabilities:
  O                        N.                          Ag.                            Ed                 I                            G   
0.00                    0.01                    0.01                    0.02                    0.09                    0.10
C                          E                           A
0.19                    0.66                    0.67
                    

---
Cutoff: 0.65; q: 1; PFER (*):  0.192 
   (*) or expected number of low selection probability variables
PFER (specified upper bound):  0.2 
PFER corresponds to signif. level 0.0213 (without multiplicity adjustment)

Thank you already.

"base-learner" needs to be "variable" or "effects"

as stabsel is not solely applicable for boosting

Fix plot

If a matrix is used instead of a data frame, y-labs are wrong:

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 5, fitfun = lars.lasso, assumption = "none")
plot(stabs)

## with data frame everything is fine:
x <- as.data.frame(x)
plot(stabsel(x, y,PFER=1,q = 5,fitfun=lars.lasso,assumption="none"))

citation("stabs")

using citation("stabs") gives

Benjamin Hofner and Torsten Hothorn (2015). stabs: Stability Selection with Error Control, R package version R package version 0.5-1, http://CRAN.R-project.org/package=stabs.

Benjamin Hofner, Luigi Boccuto and Markus Goeker (2014). Controlling false discoveries in high-dimensional situations: Boosting with stability selection. arXiv:1411.1285. http://arxiv.org/abs/1411.1285.

which doubly contains "R package version" in the reference to the manual and does not mention the more recent version of the paper at e.g. http://www.ncbi.nlm.nih.gov/pubmed/25943565

Can I use stabsel on non-zero matrix after lasso?

Dear Hofner,

I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).

Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.

Here is my code:

library(glmnet)
library(stabs)

#data
set.seed(1234)
data("bodyfat", package = "TH.data")
x = as.matrix(bodyfat[, -2])
y = bodyfat[,2]

#use cv.glmnet to get best lambda
cv.fit=cv.glmnet(x,y, alpha = 1)
cv.fit$lambda.min
res = glmnet( x = x, y = y, family = "gaussian", alpha = 1, lambda =cv.fit$lambda.min )


#xuse: x with lasso coefficient>0 
nonZero = which(res$beta != 0)
xuse = as.matrix(x[, nonZero]) 

#stabsel on xuse (zero coefficient columns removed)
set.seed(1234)
stab.lasso1 <- stabsel(x = xuse, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

#stabsel on x
set.seed(1234)
stab.lasso2 <- stabsel(x = x, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

stab.lasso1$phat
stab.lasso2$phat
stab.lasso2$phat[nonZero]

output:

stab.lasso1$phat
s0
waistcirc 1.00
hipcirc 1.00
kneebreadth 0.97
anthro3a 0.68
anthro3b 0.92
anthro3c 0.57

stab.lasso2$phat
s0
age 0.52
waistcirc 0.99
hipcirc 1.00
elbowbreadth 0.45
kneebreadth 0.94
anthro3a 0.57
anthro3b 0.85
anthro3c 0.56
anthro4 0.17

stab.lasso1$selected
waistcirc hipcirc kneebreadth anthro3b
1 2 3 5
stab.lasso2$selected
waistcirc hipcirc kneebreadth anthro3b
2 3 5 7

Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?

Thank you!

Documentation for parLapply or sequential runs

Thanks for developing and maintaining the package. I was wondering if you have any recommendations for using parLapply as a parallel backend. I realized sometimes mclapply raises error. For example, this error:

Error in `.check_ncores(cores)`: 3 simultaneous processes spawned

The error is hard to reproduce. Sometimes it happens sometimes it does not.

CRAN packages not using Suggests: packages conditionally

As §1.1.3.1 of the manual told you, packages in Suggests: should be used conditionally. The latest version of igraph will not install on Solaris (hence packages strictly depending on it), and as you can see from its CRAN checks page, your package now fails its check.

Please correct ASAP and definitely within a month.

glmnet.lasso_maxCoef() is missing a lambda argument? [bug]

Hi,
before I get into this issue report, I wanted to thank you for writing such a useful package :)

I have been using stabsel() with glmnet() and noticed the following on line 126 of glmnet.lasso_maxCoef:
fit <- glmnet::glmnet(x, y, ...)

glmnet.lasso_maxCoef() does not seem to be using the penalty parameter lambda= that is passed to it.

I was expecting line 126 to look like:

fit <- glmnet::glmnet(x, y, lambda=lambda, ...)

Assuming the lambda value passes the check on line 120, is it being captured by the ...? If not this looks like a bug.

Thanks
Peter

PS: ?glmnet suggests using predict.glmet() for single lambda values.

pelase add an example that use glmnet insead of lars

I have a script that use glmnet to fit lasso logistic models. I would like to maintain the script as is and implement ot top of that the stabsel, I do not want to change the script to use lars instead of gmlnet.
Can you please add an example of stabsel applyed to glmnet?

coefficients from each run

Is it possible to fetch the coefficients of the variables from each run? It is great to know the stable associations and would be even better to know the direction and degree of relationship between response and variables.

Make vignette

Add vignette based on README.md
Add vignette showing how to use stabsel with mboost and gamboostLSS (the latter perhaps based on the paper?)

Graphical models query

Hi,
Not sure if this is the forum you'd like to use for queries - let me know if it isn't.

I'm exploring approaches using the JGL package, specifically the fused group lasso. I'm likely to be working with two groups. I have the mechanisms in place to compute the two lambda values. The difference in partial correlation coefficient for corresponding graph edges is of interest. I have explored bootstrapping approaches to characterising this, but a stability selection approach looks interesting.

I'm unsure of how to use the q parameter in this setting. Do you have examples for glasso-like cases? I also need to be careful about how the resampling occurs within groups.

Thanks

Continue stability selection after errors

If an error occures in a single fold, continue execution of stabsel.
For the evaluation discard this fold and issue a warning.

Provide functions to compute max(q) and/or range(cutoff)

For assumption = "unimodal" it must hold:

cutoff >= 1/2 + min(theta^2, 1 / (2*B) + 3/4 * theta^2)

For assumption = "r-concave":
Export an interface to maxQ

Make new dependency checks happy

To make the new dependency checks on CRAN happy go along the following recipe (as suggested by Kurt Hornik).

1.) Copy the functions listed after Undefined global functions or variables: in the output from R CMD check in a variable txt:

txt <- "abline axis hcl matplot par plot points tail"

2.) With the function

imports_for_undefined_globals <- function(txt, lst, selective = TRUE) {
    if(!missing(txt))
        lst <- scan(what = character(), text = txt, quiet = TRUE)
    nms <- lapply(lst, find)
    ind <- sapply(nms, length) > 0L
    imp <- split(lst[ind], substring(unlist(nms[ind]), 9L))
    if(selective) {
        sprintf("importFrom(%s)",
                vapply(Map(c, names(imp), imp),
                       function(e)
                           paste0("\"", e, "\"", collapse = ", "),
                       ""))
    } else {
        sprintf("import(\"%s\")", names(imp))
    }
}

one can obtain the imports via

writeLines(imports_for_undefined_globals(txt))

3.) Copy and paste the output to NAMESPACE.
4.) Add the packages to Imports in DESCRIPTION.