hofnerb / stabs Goto Github PK
View Code? Open in Web Editor NEWStability Selection with Error Control
Home Page: https://cran.r-project.org/package=stabs
Stability Selection with Error Control
Home Page: https://cran.r-project.org/package=stabs
Hi there,
Can I use the function 'stabsel' for multinomial classification after factorizing the class?
If so, how does this compare to 'RandomizedLogisticRegression' from sklearn?
Thank you!
see #21 for an example.
This reduces the impact of errors in single folds and might speed up run time.
Hi,
on line 129 of fitfuns.R, it looks like glmnet.lasso_maxCoef() is not sorting the coefficients in the right order. The code is:
selected <- order(coef(fit)[-1])[1:q]
Here is a reproducible example that would capture lines 126 and 129:
library(glmnet)
library(stabs)
set.seed(36)
data(QuickStartExample)
x <- QuickStartExample$x
y <- QuickStartExample$y
lambda.1se <- cv.glmnet(x, y)$lambda.1se # 0.1322761
fit <- glmnet(x, y, lambda = lambda.1se)
fit$df # 8
## coefficients minus the intercept
coef_vec <- coef(fit)[-1]
coef_vec
# 1.3019591 0.0000000 0.6422423 0.0000000 -0.7892393 0.4944803 0.0000000 0.2943201 0.0000000 0.0000000 0.1058430
# 0.0000000 0.0000000 -1.0402308 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 -0.9791165
## line 129 :: order(coef(fit)[-1])[1:q]
i1 <- order(coef_vec); i1
# [1] 14 20 5 2 4 7 9 10 12 13 15 16 17 18 19 11 8 6 3 1
q <- 9; coef_vec[i1[1:q]]
# -1.0402308 -0.9791165 -0.7892393 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
I would have expected the sorting to be done on the absolute value of the coefficients. Assuming this is the intent, two possible solutions are:
i2 <- order(abs(coef_vec), decreasing = TRUE); i2
# [1] 1 14 20 5 3 6 8 11 2 4 7 9 10 12 13 15 16 17 18 19
coef_vec[i2[1:q]]
# 1.3019591 -1.0402308 -0.9791165 -0.7892393 0.6422423 0.4944803 0.2943201 0.1058430 0.0000000
or
selected <- predict(fit, type = "nonzero")
selected <- selected[[length(selected)]] ## length is 1 when lambda is a scalar
selected
# 1 3 5 6 8 11 14 20
I believe the later solution is used in glmnet.lasso()
on lines 31 and 32 of fitfuns.R
Thanks
Peter
## verify that the same coefficients are chosen
all(selected %in% i2[1:fit$df])
# TRUE
I used stab package for high dimensional gene type of data, which, regardless how I tried, (lars.lasso would not work on input, but glmnet.lasso works just fine), it shows that ,everytime I run, 'no variables selected '. for example,
stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =0.2)
my data had extremely high dimensionality 6248 columns and low observations with 224 rows.
If I use glmnet.lasso_maxCoef, the algorithm would just return first 45 genes as being selected, this is clearly not right.
What am I doing wrong? has anyone, except the low dimensional data, tried on real world high dimensional data?
library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "unimodal")
## Error in if (cutoff <= 3/4) { : missing value where TRUE/FALSE needed
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "r-concave")
## Error in minD(q, p, cutoff[i], B) : ‘q’ must be <= 9
In the latter case just set call. = FALSE
.
Hello hofnerb,
If I do not fix my lambda, the algorithm would not select any of the variables in the high dimensional data that I have. here is the code and the outcome:
stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =1)
Stability Selection with unimodality assumption
No variables selected
Hello,
I'am analysing sequences data for my master thesis. Doing so I'am intenting to use feature selection to find the most important features of my sequences to predict my outcome (salary at 30 years old).
When I run the following code in Rstudio I get an error:
Reglin_control_features_sala <- lm(t9salafteall_comp ~ Agglo + Sex + Migr_gen + Migr_reg + Langue_maison_dummy + Reg_ling, data=alldata_formation_salaires_NoNAs, weights = basewt)
set.seed(999)
y <- residuals(Reglin_control_features, type = "deviance")
x <- Features_12mois_SD
prop_stabsel <- stabs::stabsel( x= x , y = y, fitfun = glmnet.lasso, PFER=0.5, cutoff= 0.6,
B=1000)
=> "Error in res[[1]] : subscript out of bounds"
I tried different values for the parmeters PFER, cutoff and B, always with the same result, my professor, Matthias Studer, does not know where this error come from, that is why I am asking you directly.
Thanks a lot for your help,
Regards,
Leonhard.
Especially for genomics applications, it would be nice to also have FDR procedure in the implementation.
Is a variable considered to be selected if it is selected at some point and removed again later on? See #5
library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "none")
## Error in ret[selected] <- TRUE :
## only 0's may be mixed with negative subscripts
(error reported by Markus Scheibengraber)
I have a quick question:
I used stabsel in combination with glmnet.lasso and set q to be 1.
However, the results show that more than one variable has been selected - how is that possible?
stabsel(x = data[predict], y = data$freq_calls, fitfun=glmnet.lasso, args.fitfun=list(family="poisson"), PFER = 0.2, q=1, sampling.type="SS")
Stability Selection with unimodality assumption
Selected variables:
A E
2 6
Selection probabilities:
O N. Ag. Ed I G
0.00 0.01 0.01 0.02 0.09 0.10
C E A
0.19 0.66 0.67
---
Cutoff: 0.65; q: 1; PFER (*): 0.192
(*) or expected number of low selection probability variables
PFER (specified upper bound): 0.2
PFER corresponds to signif. level 0.0213 (without multiplicity adjustment)
Thank you already.
as stabsel is not solely applicable for boosting
If a matrix is used instead of a data frame, y-labs are wrong:
library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 5, fitfun = lars.lasso, assumption = "none")
plot(stabs)
## with data frame everything is fine:
x <- as.data.frame(x)
plot(stabsel(x, y,PFER=1,q = 5,fitfun=lars.lasso,assumption="none"))
using citation("stabs") gives
Benjamin Hofner and Torsten Hothorn (2015). stabs: Stability Selection with Error Control, R package version R package version 0.5-1, http://CRAN.R-project.org/package=stabs.
Benjamin Hofner, Luigi Boccuto and Markus Goeker (2014). Controlling false discoveries in high-dimensional situations: Boosting with stability selection. arXiv:1411.1285. http://arxiv.org/abs/1411.1285.
which doubly contains "R package version" in the reference to the manual and does not mention the more recent version of the paper at e.g. http://www.ncbi.nlm.nih.gov/pubmed/25943565
Dear Hofner,
I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).
Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.
Here is my code:
library(glmnet)
library(stabs)
#data
set.seed(1234)
data("bodyfat", package = "TH.data")
x = as.matrix(bodyfat[, -2])
y = bodyfat[,2]
#use cv.glmnet to get best lambda
cv.fit=cv.glmnet(x,y, alpha = 1)
cv.fit$lambda.min
res = glmnet( x = x, y = y, family = "gaussian", alpha = 1, lambda =cv.fit$lambda.min )
#xuse: x with lasso coefficient>0
nonZero = which(res$beta != 0)
xuse = as.matrix(x[, nonZero])
#stabsel on xuse (zero coefficient columns removed)
set.seed(1234)
stab.lasso1 <- stabsel(x = xuse, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1)
#stabsel on x
set.seed(1234)
stab.lasso2 <- stabsel(x = x, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1)
stab.lasso1$phat
stab.lasso2$phat
stab.lasso2$phat[nonZero]
output:
stab.lasso1$phat
s0
waistcirc 1.00
hipcirc 1.00
kneebreadth 0.97
anthro3a 0.68
anthro3b 0.92
anthro3c 0.57
stab.lasso2$phat
s0
age 0.52
waistcirc 0.99
hipcirc 1.00
elbowbreadth 0.45
kneebreadth 0.94
anthro3a 0.57
anthro3b 0.85
anthro3c 0.56
anthro4 0.17
stab.lasso1$selected
waistcirc hipcirc kneebreadth anthro3b
1 2 3 5
stab.lasso2$selected
waistcirc hipcirc kneebreadth anthro3b
2 3 5 7
Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?
Thank you!
Thanks for developing and maintaining the package. I was wondering if you have any recommendations for using parLapply
as a parallel backend. I realized sometimes mclapply raises error. For example, this error:
Error in `.check_ncores(cores)`: 3 simultaneous processes spawned
The error is hard to reproduce. Sometimes it happens sometimes it does not.
As §1.1.3.1 of the manual told you, packages in Suggests: should be used conditionally. The latest version of igraph will not install on Solaris (hence packages strictly depending on it), and as you can see from its CRAN checks page, your package now fails its check.
Please correct ASAP and definitely within a month.
Hi,
before I get into this issue report, I wanted to thank you for writing such a useful package :)
I have been using stabsel() with glmnet() and noticed the following on line 126 of glmnet.lasso_maxCoef:
fit <- glmnet::glmnet(x, y, ...)
glmnet.lasso_maxCoef()
does not seem to be using the penalty parameter lambda=
that is passed to it.
I was expecting line 126 to look like:
fit <- glmnet::glmnet(x, y, lambda=lambda, ...)
Assuming the lambda value passes the check on line 120, is it being captured by the ...
? If not this looks like a bug.
Thanks
Peter
PS: ?glmnet
suggests using predict.glmet()
for single lambda values.
I have a script that use glmnet to fit lasso logistic models. I would like to maintain the script as is and implement ot top of that the stabsel, I do not want to change the script to use lars instead of gmlnet.
Can you please add an example of stabsel applyed to glmnet?
See also
http://stackoverflow.com/questions/34063292/example-of-stabsel-with-glmnet
Is it possible to fetch the coefficients of the variables from each run? It is great to know the stable associations and would be even better to know the direction and degree of relationship between response and variables.
Hi,
Not sure if this is the forum you'd like to use for queries - let me know if it isn't.
I'm exploring approaches using the JGL package, specifically the fused group lasso. I'm likely to be working with two groups. I have the mechanisms in place to compute the two lambda values. The difference in partial correlation coefficient for corresponding graph edges is of interest. I have explored bootstrapping approaches to characterising this, but a stability selection approach looks interesting.
I'm unsure of how to use the q parameter in this setting. Do you have examples for glasso-like cases? I also need to be careful about how the resampling occurs within groups.
Thanks
If an error occures in a single fold, continue execution of stabsel
.
For the evaluation discard this fold and issue a warning
.
assumption = "unimodal"
it must hold:cutoff >= 1/2 + min(theta^2, 1 / (2*B) + 3/4 * theta^2)
assumption = "r-concave"
:maxQ
To make the new dependency checks on CRAN happy go along the following recipe (as suggested by Kurt Hornik).
1.) Copy the functions listed after Undefined global functions or variables:
in the output from R CMD check
in a variable txt
:
txt <- "abline axis hcl matplot par plot points tail"
2.) With the function
imports_for_undefined_globals <- function(txt, lst, selective = TRUE) {
if(!missing(txt))
lst <- scan(what = character(), text = txt, quiet = TRUE)
nms <- lapply(lst, find)
ind <- sapply(nms, length) > 0L
imp <- split(lst[ind], substring(unlist(nms[ind]), 9L))
if(selective) {
sprintf("importFrom(%s)",
vapply(Map(c, names(imp), imp),
function(e)
paste0("\"", e, "\"", collapse = ", "),
""))
} else {
sprintf("import(\"%s\")", names(imp))
}
}
one can obtain the imports via
writeLines(imports_for_undefined_globals(txt))
3.) Copy and paste the output to NAMESPACE
.
4.) Add the packages to Imports
in DESCRIPTION
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.