Giter VIP home page Giter VIP logo

stabs's People

Contributors

gokceneraslan avatar hofnerb avatar richardbeare avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

stabs's Issues

Randomized Logistic Regression

Hi there,

Can I use the function 'stabsel' for multinomial classification after factorizing the class?

If so, how does this compare to 'RandomizedLogisticRegression' from sklearn?

Thank you!

glmnet.lasso_maxCoef(): incorrect coefficient sorting? [bug]

Hi,
on line 129 of fitfuns.R, it looks like glmnet.lasso_maxCoef() is not sorting the coefficients in the right order. The code is:
selected <- order(coef(fit)[-1])[1:q]

Here is a reproducible example that would capture lines 126 and 129:

library(glmnet)
library(stabs)
set.seed(36)

data(QuickStartExample)
x <- QuickStartExample$x
y <- QuickStartExample$y

lambda.1se <- cv.glmnet(x, y)$lambda.1se # 0.1322761

fit <- glmnet(x, y, lambda = lambda.1se)
fit$df # 8

## coefficients minus the intercept
coef_vec <- coef(fit)[-1] 
coef_vec
# 1.3019591  0.0000000  0.6422423  0.0000000 -0.7892393  0.4944803  0.0000000  0.2943201  0.0000000  0.0000000  0.1058430
# 0.0000000  0.0000000 -1.0402308  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000 -0.9791165

##  line 129 :: order(coef(fit)[-1])[1:q]
i1 <- order(coef_vec); i1
# [1] 14 20  5  2  4  7  9 10 12 13 15 16 17 18 19 11  8  6  3  1
q <- 9; coef_vec[i1[1:q]]
# -1.0402308 -0.9791165 -0.7892393  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000  0.0000000

I would have expected the sorting to be done on the absolute value of the coefficients. Assuming this is the intent, two possible solutions are:

i2 <- order(abs(coef_vec), decreasing = TRUE); i2
# [1]  1 14 20  5  3  6  8 11  2  4  7  9 10 12 13 15 16 17 18 19
coef_vec[i2[1:q]]
# 1.3019591 -1.0402308 -0.9791165 -0.7892393  0.6422423  0.4944803  0.2943201  0.1058430  0.0000000

or

selected <- predict(fit, type = "nonzero")
selected <- selected[[length(selected)]] ## length is 1 when lambda is a scalar
selected
#  1  3  5  6  8 11 14 20

I believe the later solution is used in glmnet.lasso() on lines 31 and 32 of fitfuns.R

Thanks
Peter

## verify that the same coefficients are chosen 
all(selected %in% i2[1:fit$df])
# TRUE

No variables selected, on high dimensional data.

I used stab package for high dimensional gene type of data, which, regardless how I tried, (lars.lasso would not work on input, but glmnet.lasso works just fine), it shows that ,everytime I run, 'no variables selected '. for example,

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =0.2)

my data had extremely high dimensionality 6248 columns and low observations with 224 rows.

If I use glmnet.lasso_maxCoef, the algorithm would just return first 45 genes as being selected, this is clearly not right.

What am I doing wrong? has anyone, except the low dimensional data, tried on real world high dimensional data?

Improve error handling (2)

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "unimodal")
## Error in if (cutoff <= 3/4) { : missing value where TRUE/FALSE needed
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "r-concave")
## Error in minD(q, p, cutoff[i], B) : ‘q’ must be <= 9

In the latter case just set call. = FALSE.

no variables are selected

Hello hofnerb,
If I do not fix my lambda, the algorithm would not select any of the variables in the high dimensional data that I have. here is the code and the outcome:

stab_lasso<-stabsel(x =train_dat, y =tr_target, fitfun =glmnet.lasso, cutoff =0.75,PFER =1)
Stability Selection with unimodality assumption

No variables selected

Stablity selection [R] stabs::stabsel "Error in res[[1]] : subscript out of bounds"

Hello,
I'am analysing sequences data for my master thesis. Doing so I'am intenting to use feature selection to find the most important features of my sequences to predict my outcome (salary at 30 years old).
When I run the following code in Rstudio I get an error:

Reglin_control_features_sala <- lm(t9salafteall_comp ~ Agglo + Sex + Migr_gen + Migr_reg + Langue_maison_dummy + Reg_ling, data=alldata_formation_salaires_NoNAs, weights = basewt)

set.seed(999)
y <- residuals(Reglin_control_features, type = "deviance")
x <- Features_12mois_SD

prop_stabsel <- stabs::stabsel( x= x , y = y, fitfun = glmnet.lasso, PFER=0.5, cutoff= 0.6,
B=1000)
=> "Error in res[[1]] : subscript out of bounds"

I tried different values for the parmeters PFER, cutoff and B, always with the same result, my professor, Matthias Studer, does not know where this error come from, that is why I am asking you directly.

Thanks a lot for your help,
Regards,

Leonhard.

Fix issue when variables are dropped from active set in lars.lasso

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 19, fitfun = lars.lasso, assumption = "none")
## Error in ret[selected] <- TRUE :
##    only 0's may be mixed with negative subscripts

(error reported by Markus Scheibengraber)

q = 1 -> more than one variable selected

I have a quick question:
I used stabsel in combination with glmnet.lasso and set q to be 1.
However, the results show that more than one variable has been selected - how is that possible?

stabsel(x = data[predict], y = data$freq_calls, fitfun=glmnet.lasso, args.fitfun=list(family="poisson"), PFER = 0.2, q=1, sampling.type="SS")

Stability Selection with unimodality assumption

Selected variables:
             A                E 
               2                6 

Selection probabilities:
  O                        N.                          Ag.                            Ed                 I                            G   
0.00                    0.01                    0.01                    0.02                    0.09                    0.10
C                          E                           A
0.19                    0.66                    0.67
                    

---
Cutoff: 0.65; q: 1; PFER (*):  0.192 
   (*) or expected number of low selection probability variables
PFER (specified upper bound):  0.2 
PFER corresponds to signif. level 0.0213 (without multiplicity adjustment)

Thank you already.

Fix plot

If a matrix is used instead of a data frame, y-labs are wrong:

library("stabs")
library("lars")
x <- matrix(rnorm(100*20),100,20)
beta <- c(rep(1, 5), rep(0, 15))
y <- x %*% beta + rnorm(100)
stabs <- stabsel(x, y, PFER = 1, q = 5, fitfun = lars.lasso, assumption = "none")
plot(stabs)

## with data frame everything is fine:
x <- as.data.frame(x)
plot(stabsel(x, y,PFER=1,q = 5,fitfun=lars.lasso,assumption="none"))

citation("stabs")

using citation("stabs") gives

Benjamin Hofner and Torsten Hothorn (2015). stabs: Stability Selection with Error Control, R package version R package version 0.5-1, http://CRAN.R-project.org/package=stabs.

Benjamin Hofner, Luigi Boccuto and Markus Goeker (2014). Controlling false discoveries in high-dimensional situations: Boosting with stability selection. arXiv:1411.1285. http://arxiv.org/abs/1411.1285.

which doubly contains "R package version" in the reference to the manual and does not mention the more recent version of the paper at e.g. http://www.ncbi.nlm.nih.gov/pubmed/25943565

Can I use stabsel on non-zero matrix after lasso?

Dear Hofner,

I have many large matrices (1000 obs * 15000 vars) for lasso and variable selection. To speed up, I think it would be much faster to run stabsel() on data with variables whose lasso coefficients >0 (columns of x matrix with zero coefficients are removed).

Is this reasonable? Running stabsel() on full x matrix will return pvalue for all variables in x, while running it on reduced x matrix is much faster. The order of resulted pvalue for variables seem to be consistent using x or reduced x, but values are different.

Here is my code:

library(glmnet)
library(stabs)

#data
set.seed(1234)
data("bodyfat", package = "TH.data")
x = as.matrix(bodyfat[, -2])
y = bodyfat[,2]

#use cv.glmnet to get best lambda
cv.fit=cv.glmnet(x,y, alpha = 1)
cv.fit$lambda.min
res = glmnet( x = x, y = y, family = "gaussian", alpha = 1, lambda =cv.fit$lambda.min )


#xuse: x with lasso coefficient>0 
nonZero = which(res$beta != 0)
xuse = as.matrix(x[, nonZero]) 

#stabsel on xuse (zero coefficient columns removed)
set.seed(1234)
stab.lasso1 <- stabsel(x = xuse, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

#stabsel on x
set.seed(1234)
stab.lasso2 <- stabsel(x = x, y = y, fitfun = glmnet.lasso,args.fitfun=list('family'="gaussian",'lambda'=cv.fit$lambda.min), cutoff = 0.75, PFER = 1) 

stab.lasso1$phat
stab.lasso2$phat
stab.lasso2$phat[nonZero]


output:

stab.lasso1$phat
s0
waistcirc 1.00
hipcirc 1.00
kneebreadth 0.97
anthro3a 0.68
anthro3b 0.92
anthro3c 0.57

stab.lasso2$phat
s0
age 0.52
waistcirc 0.99
hipcirc 1.00
elbowbreadth 0.45
kneebreadth 0.94
anthro3a 0.57
anthro3b 0.85
anthro3c 0.56
anthro4 0.17

stab.lasso1$selected
waistcirc hipcirc kneebreadth anthro3b
1 2 3 5
stab.lasso2$selected
waistcirc hipcirc kneebreadth anthro3b
2 3 5 7

Running stabsel() on x or reduced x (xuse) have the same selected variables. But is there any potential problem to run stabsel() on reduced x?

Thank you!

Documentation for parLapply or sequential runs

Thanks for developing and maintaining the package. I was wondering if you have any recommendations for using parLapply as a parallel backend. I realized sometimes mclapply raises error. For example, this error:

Error in `.check_ncores(cores)`: 3 simultaneous processes spawned

The error is hard to reproduce. Sometimes it happens sometimes it does not.

CRAN packages not using Suggests: packages conditionally

As §1.1.3.1 of the manual told you, packages in Suggests: should be used conditionally. The latest version of igraph will not install on Solaris (hence packages strictly depending on it), and as you can see from its CRAN checks page, your package now fails its check.

Please correct ASAP and definitely within a month.

glmnet.lasso_maxCoef() is missing a lambda argument? [bug]

Hi,
before I get into this issue report, I wanted to thank you for writing such a useful package :)

I have been using stabsel() with glmnet() and noticed the following on line 126 of glmnet.lasso_maxCoef:
fit <- glmnet::glmnet(x, y, ...)

glmnet.lasso_maxCoef() does not seem to be using the penalty parameter lambda= that is passed to it.

I was expecting line 126 to look like:

fit <- glmnet::glmnet(x, y, lambda=lambda, ...)

Assuming the lambda value passes the check on line 120, is it being captured by the ...? If not this looks like a bug.

Thanks
Peter

PS: ?glmnet suggests using predict.glmet() for single lambda values.

coefficients from each run

Is it possible to fetch the coefficients of the variables from each run? It is great to know the stable associations and would be even better to know the direction and degree of relationship between response and variables.

Make vignette

  • Add vignette based on README.md
  • Add vignette showing how to use stabsel with mboost and gamboostLSS (the latter perhaps based on the paper?)

Graphical models query

Hi,
Not sure if this is the forum you'd like to use for queries - let me know if it isn't.

I'm exploring approaches using the JGL package, specifically the fused group lasso. I'm likely to be working with two groups. I have the mechanisms in place to compute the two lambda values. The difference in partial correlation coefficient for corresponding graph edges is of interest. I have explored bootstrapping approaches to characterising this, but a stability selection approach looks interesting.

I'm unsure of how to use the q parameter in this setting. Do you have examples for glasso-like cases? I also need to be careful about how the resampling occurs within groups.

Thanks

Make new dependency checks happy

To make the new dependency checks on CRAN happy go along the following recipe (as suggested by Kurt Hornik).

1.) Copy the functions listed after Undefined global functions or variables: in the output from R CMD check in a variable txt:

txt <- "abline axis hcl matplot par plot points tail"

2.) With the function

imports_for_undefined_globals <- function(txt, lst, selective = TRUE) {
    if(!missing(txt))
        lst <- scan(what = character(), text = txt, quiet = TRUE)
    nms <- lapply(lst, find)
    ind <- sapply(nms, length) > 0L
    imp <- split(lst[ind], substring(unlist(nms[ind]), 9L))
    if(selective) {
        sprintf("importFrom(%s)",
                vapply(Map(c, names(imp), imp),
                       function(e)
                           paste0("\"", e, "\"", collapse = ", "),
                       ""))
    } else {
        sprintf("import(\"%s\")", names(imp))
    }
}

one can obtain the imports via

writeLines(imports_for_undefined_globals(txt))

3.) Copy and paste the output to NAMESPACE.
4.) Add the packages to Imports in DESCRIPTION.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.