pbreheny / grpreg Goto Github PK
View Code? Open in Web Editor NEWRegularization paths for regression models with grouped covariates
Home Page: http://pbreheny.github.io/grpreg/
Regularization paths for regression models with grouped covariates
Home Page: http://pbreheny.github.io/grpreg/
@YaohuiZeng: I'm seeing a strange bug in the BEDPP rules in this example:
DF <-read.csv(text=RCurl::getURL("https://raw.githubusercontent.com/jonhersh/datasets/master/myDF2.csv"), header=T)
X <- DF[,1:74]
Y <- DF[,75]
groups <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9)
groups.factor <- factor(groups, labels = c("head","hh demographics","dwelling","public services","durables","education","region/sector","head LF","hh LF"))
fit <- grpreg(X, Y, group=groups.factor)
I don't understand why...the data set looks ordinary yet we're getting -nan
s, BEDPP is rejecting everything, and we get all 0s for all values of λ. I have turned BEDPP screening off (2aeacd4) as of grpreg_3.1-4, but if you can fix it, I would like to turn it back on for 3.1-5.
I'm unsure if it applies to grouped regression, but is there a way to get a measure of variable importance but at the group level (or some way to get some proxy of that measure)? I'm thinking about something similar to what caret uses when reporting importance for their ML methods.
Brought up by Vanda Lourenço over e-mail.
expand_spline()
: Allow user to pass vector argument specifying which columns should be expanded (default would be: all numeric columns)expand_factor()
function to do the same for factorsThis is not well-documented now, and confusing/contradictory if a person only reads Breheny and Huang (2009).
In response to pkgdown 1.4 changes
Hello,
Is there a way to set intercept = FALSE
?
Thank you
Yaohui Zeng writes:
For overlapping group selection, you mentioned that we can specify “gmax = 5” in the ‘grpreg' function. I checked the document, and the ‘gmax’ specifies the maximum number of groups allowed to be selected. Through some investigation, I found that if ‘gmax = 5’ specified, then most of the time only 3 or 4 groups were selected. The question I wish to consult you is, are there any ways that we can select exactly 5 groups by the ‘grpreg’ function?
To illustrate,
set.seed(1)
X <- matrix(rnorm(100*100), 100, 100)
y <- rnorm(100)
fit <- grpreg(X, y, rep(1:50, each=2), gmax=20)
predict(fit, "ngroups")
...
0.0668 0.0648 0.0629 0.061 0.0592 0.0574
17 17 18 18 19 21
We specified gmax=20
, but we never get a solution with exactly 20 nonzero groups.
The issue here is that there are simply no λ values for which the number of groups is equal to 20. This can be fixed by increasing the resolution of the λ grid:
fit <- grpreg(X, y, rep(1:50, each=2), gmax=20, nlambda=500)
tail(predict(fit, "ngroups"))
0.0592 0.0589 0.0585 0.0582 0.0578 0.0575
18 20 20 20 20 21
Hi there,
I'm really excited about the package. Thanks for all your hard work on it.
I'm getting a pesky subscript out of bounds error when I run the program. I don't think I'm making an error but I could be wrong. Can you advise whether this is a bug on my end or yours?
require(RCurl)
DF <-read.csv(text=getURL(
"https://raw.githubusercontent.com/jonhersh/datasets/master/myDF2.csv"), , header=T)
Xtrain <- DF[,1:74]
Ytrain <- DF[,75]
groups <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9)
groups.factor <- factor(groups, labels = c("head","hh demographics","dwelling","public services","durables","education","region/sector","head LF","hh LF"))
# group lasso using grplasso
grpreg.fit <- cv.grpreg(Xtrain,Ytrain,group = groups.factor)
Hello,
Are you also planning on adding other distributions from the exponential family, like the Gamma distribution?
Kind regards,
Olivier
I have the following example:
xx <- seq(0.01,1,0.01)
zz <- cbind(sin(xx*2*pi), cos(xx*2*pi))
y <- rep(1, 100)
grpreg::grpreg(zz, y, group = c(1,2), lambda = 10^seq(-3,20,1))
#> Error: Algorithm failed to converge for any values of lambda. This indicates a combination of (a) an ill-conditioned feature matrix X and (b) insufficient penalization. You must fix one or the other for your model to be identifiable.
Created on 2021-08-11 by the reprex package (v2.0.0)
The grpreg
throws an error rather than fit every coefficient to zero. What might cause such problem?
Combine poisson and binomial C code into a common glm framework
Hello, another possibly useful implementation in the grpsurv and grpreg functions can be allowing for weighted observations.
Thanks!
Best regards,
Irmak
It would be nice to allow the user to specify penalty
as a vector so that different penalties could be used for different groups. For example, if some groups may be splines but others, we wish to carry out bi-level selection.
Is it possible to perform your grpreg SCAD analysis on data sets with missing values? It seemed from the error message I received about # of observations not matching (which went away when I excluded X variables with missing values) that it is a problem. Would you recommend substituting column averages in the place of missing values? The same question applies to an alternate outcome (y) variable I'd like to use which also has missing values for some subjects. Thanks!
grpmat()
plot.sp()
plot_spline()
ns
and add tests for both types of splinesresiduals()
methodplot_spline(cvfit)
: Needs documentation, should use lambda.min
as default if unspecifiedI provided exact same type of group, x , y as example but kept getting this error. so depressing.
Is there any way to modify the code to accept weights? This would really help in making the package compatible with glmnet
There is fit$loss
variable if fit
is a model returned by grpreg(...)
call. And fit$loss
is equal to RSS, but it should be equal to 1/2 of RSS, i.e. negative log-likelihood, according to Models vignette, Gaussian (linear regression) Section. See the reproducible example below:
> library(glmnet)
> library(grpreg)
> data(QuickStartExample)
> x <- QuickStartExample$x
> y <- QuickStartExample$y
> fit<-grpreg(stats::model.matrix(y~., data = data.frame(y=y, x, check.names = TRUE))[,-1, drop=FALSE], y, group = 1:20, penalty="grLasso")
> sum((cbind(rep(1,100),x) %*% fit$beta[,10] - y)^2) #pure RSS - no factor of 1/2 here !!!
[1] 363.5578
> fit$loss[10]
[1] 363.5578
The correct value would be that of 1/2 of RSS.
Hi,
I believe I have found a possible bug in group coefficients constraint for penalty="grLasso"
in grpreg
. Penalty vignette states "the coefficients within a group will either all equal zero or none will equal zero". This constraint seems to be broken in a data example I came across.
To reproduce this potential bug please execute:
#load data example from github
library(Rfssa)
load_github_data("https://github.com/SzymonNowakowski/DMRnet/blob/testing_branch/data/promoX.RData")
load_github_data("https://github.com/SzymonNowakowski/DMRnet/blob/testing_branch/data/promoy.RData")
#prepare data in grpreg-accepted format
y <- ifelse(y == levels(y)[2], 1, 0)
X <- stats::model.matrix(y~., data = data.frame(y=y, X, check.names = TRUE))[, -1, drop=FALSE]
group <- rep(1:57, each = 3)
#run grpreg
library(grpreg)
fit<-grpreg(X, y, group = group, penalty="grLasso", family="binomial")
#examine coefficients
coef(fit)[44:46,1:9]
# 0.2199 0.2133 0.2069 0.2008 0.1948 0.189 0.1834 0.1779 0.1726
# X15c 0 -2.139903e-17 8.559611e-17 1.069951e-16 2.139903e-16 0.0000000 5.991728e-16 3.423845e-16 0.0000000
# X15g 0 3.998654e-02 7.887908e-02 1.167650e-01 1.537286e-01 0.1898477 2.251939e-01 2.598331e-01 0.2938264
# X15t 0 9.878343e-02 1.948222e-01 2.882890e-01 3.793487e-01 0.4681499 5.548270e-01 6.395019e-01 0.7222856
It is probably a numerical problem - you'll notice some coefficients very close to 0 in the first row X15c
of the output, however for the lambdas 0.189 and 0.1726 it is not close to 0, but exactly 0, breaking the abovementioned constraint.
Thank you in advance for having a look into it,
Szymon
PS. The data I used is a subset of Promoter dataset isolated for the purpose of reproducing this behavior.
I found that when doing standardization for the design matrix, the code is using degrees of freedom = n for standard deviation. I am wondering why not use n-1, which is commonly used in other standardize function and should give an unbiased estimator for the variance.
Would be nice to add this, as in ncvreg. Ryan Miller has done some work on this.
This function should exist and return something useful; see #34
Hello, I've just discovered the package and it looks great. However, the data I have at the moment uses time-varying covariates so I require compatibility with the [start, stop) representation of survival times. Would it be possible to implement this? Perhaps similar to how glmnet does (see version update 4.1)?
Maria Kamenetsky (University of Wisconsin) writes:
With grpreg, I see that you have the option to use the Poisson distribution for the family, but I am unable to find an option to include an offset with this. I was wondering if there is a way to do so using the package.
When trying to install grpreg after installing devtools like so:
install.packages("devtools")
devtools::install_github("pbreheny/grpreg")
I get this error message :
Downloading GitHub repo pbreheny/grpreg@master
from URL https://api.github.com/repos/pbreheny/grpreg/zipball/master
Error: Could not find build tools necessary to build grpreg
Any ideas on what I'm missing here or what I could try to fix it? Thanks in advance.
Hello,
Example attached using R version 3.6.0 and grpreg version 3.2-1 (did not have this problem with R version 3.5.3 and grpreg version 3.1-3). Data is publicly available (https://www.cdc.gov/nchs/nhanes/index.htm).
Resulting $cve of cv.grpreg() is sometimes a vector (correct) and sometimes a scalar (wrong), depending on the user-defined grouping and the user-set seed.
The plotted output gives the expected cross-validation curve when cve is a vector, but breaks when cve is a scalar.
Why is this happening? Many thanks!
Hi, thank you for this very useful package!
I am working with cv.grpsurv, I would like to use the 1 standard error rule to perform variable selection.
However, unlike cv.grpreg, cv.grpsurv returns only the mean cross validation error (cve) for each lambda without the estimated standard error associated with each value of cve.
Is it possible to return this extra vector as an output value in cv.grpsurv?
Thanks :)
Need to get in touch with Dan Kessler to see where this stands
library(grpreg)
data("Birthwt")
X <- Birthwt$X
group <- Birthwt$group
y <- Birthwt$bwt
fit <- grpreg(X, y, group, penalty="grLasso")
Lambda <- fit$lambda
check <- grpreg(X,y,group,lambda=Lambda)
check$beta
Using the above code, I was trying to be able to check the results by specifying lambda myself. However, when I do this, it results in all 0's.
when I used plot to draw fit after I ran fit <- grpreg(X, y, group, penalty="grLasso")
plot(fit), I got this error.Does anybody know the reason? how to fix it?
One should be able to specify unpenalized groups through group.multiplier
, as in:
data(Birthwt)
X <- Birthwt$X
y <- Birthwt$bwt
group <- Birthwt$group
fit <- grpreg(X, y, group, group.multiplier=0:7)
But this doesn't work; they have to be given a group number of 0.
Similar to what ncvfit()
does in ncvreg.
Hi, thanks for developing such a good package on group regularization. It is extremely useful as it not only integrates a bunch of penalties but also has the option for several probability distributions.
I am currently working on an algorithm, in which a group Lasso regression is performed without intercept. However, there is no option as intercept=FALSE
. If you could add feature it would be very helpful.
Hi Patrick Breheny,
Thanks for the excellent package!! I'm using the function cv.grpreg but I'd like to use the "auc" as the loss function to use for cross-validation (as in package cv.glmnet). Do you think is there a way to implement this?
Cheers,
Pedro
Guo Pi (Sun Yat-Sen University) writes:
I recently used the group Lasso model in your R package grpreg to integrate data from different sources. However, I faced some problems in coding the model.
Kind of similar to what glinternet does, but if part of grpreg, could use, say, group MCP as the penalty.
There is an inconsistency in documentation and vignettes for grpreg
objective function:
I have tested the results of grpreg
with glmnet
on some examples and they are the same, so looking at glmnet
objective function for Gaussian families I conclude that 1. and 2. are correct. Could you please update the Penalties vignette?
Hi, thank you so much for creating this package!
I have a confusion about parameter settings.Suppose I have 3 groups and each group has 2 members,when I use group = 1,1,2,2,3,3,then I can get group-lasso/group-MCP/group-SCAD.I wonder if I use the group=1:6,whether this is equivalent to lasso/MCP/SCAD?
Hello,
While doing a grouped lasso for time-to-event outcome in a stack of multiply imputed datasets with folds predefined to be the same for multiple imputations of individual patients, I realized that "fold" argument fails to work. I think the issue is caused due to calculating the loss not in cvf.surv (which takes the subsetting according to folds into account) but rather outside which does not take into account the fold structure.
Thanks!
Irmak
Models vignette defines loss function
I have compared the output of grpreg
and glmnet
and they are the same, so looking at glmnet
objective function for Gaussian families I believe the second definition is correct and the factor of 2 should be removed from the opening paragraph: instead of the deviance (−2 times the log-likelihood) it should read: the deviance (minus log-likelihood).
Hi,
I am fitting a cv group lasso model on matrices derived from single-cell RNA-seq data-
I won’t get into details of the model, but I have gene expression values of 7 transcription factors and 290 differentially expressed genes (DEGs) in 18 cells, and am using the group lasso to find a set of transcription factors that “explain” the gene expression values of differentially expressed genes.
All entries of my matrices are normalized genes expression value. The dimensions of my matrices are as follows:
X (terminal_ddN_my_X) = 5220 2030
Y (terminal_ddN_my_Y) = 5220 1
This is my code snippet:
terminal_ddN_seed <- 15
terminal_grplasso2_ddN_cv_grepreg_fit <- cv.grpreg(terminal_ddN_my_X, terminal_ddN_my_Y, terminal_ddN_group_id, penalty="grLasso", seed=terminal_ddN_seed)
terminal_ddN_group_id is a vector that contains the group of each variable.
When I try to run this code, I get the following error:
Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd'
The traceback is:
7. La.svd(x, nu, nv)
6. svd(X[, ind, drop = FALSE], nu = 0)
5. orthogonalize(XX, G)
4. newXG(X, group, group.multiplier, attr(yy, "m"), bilevel)
3. grpreg(penalty = "grLasso", X = structure(c(2.97925825263004,
2.53039506574974, 3.59726888875713, 2.95151714857877, 3.21434876363906,
3.47295745106238, 2.92167213023129, 2.98361981870551, 3.3768239126559,
3.73921703297616, 3.35209278391207, 2.92910464088794, 3.08964352564125, ...
2. do.call("grpreg", fit.args)
1. cv.grpreg(terminal_ddN_my_X, terminal_ddN_my_Y, terminal_ddN_group_id,
penalty = "grLasso", seed = terminal_ddN_seed)
Unfortunately, I have no intuition as to how to debug this issue even after reading many stack overflow blogs about the same issue (albeit from using other packages).
Any help would be greatly appreciated!
PC
inherits()
instead of class()
in testingdouble()
or integer()
instead of numeric()
call.=FALSE
in stop()
I am using your R package grpreg (grpregOverlap) to perform survival analysis.
First I chose penalty = "grSCAD", then run the following code. it looks okay.
f.over <- grpsurv(x1m, y , penalty = "grSCAD" , group = groupvector, eps=0.005, max.iter=10000)
cv.over <- cv.grpsurv(x1m, y, nfolds = 10, trace=FALSE, seed=1, penalty = "grSCAD" , group = groupvector, lambda=f.over$lambda, eps=0.005, max.iter=10000)
There is no any error, and the output is reasonable.
However, if I set penalty = "gel" or penalty = "cMCP", the following code can run.
f.over <- grpsurv(x1m, y , penalty = "gel" , group = groupvector, eps=0.005, max.iter=10000)
f.over <- grpsurv(x1m, y , penalty = "grSCAD" , group = groupvector, eps=0.005, max.iter=10000)
but, I can not run cross-validation function, as following:
cv.over <- cv.grpsurv(x1m, y, nfolds = 10, trace=FALSE, seed=1, penalty = "gel", group = groupvector, lambda=f.over$lambda, eps=0.005, max.iter=10000)
cv.over <- cv.grpsurv(x1m, y, nfolds = 10, trace=FALSE, seed=1, penalty = "cMCP", group = groupvector, lambda=f.over$lambda, eps=0.005, max.iter=10000)
the error is:
Error in reorderGroups(group, group.multiplier, strtrim(penalty, 2) == :
Length of group.multiplier must equal number of penalized groups
if I remove penalty = "gel" or penalty = "cMCP" from above cv.grpsurv function. there is no error, but the output is unreasonable.
please help me on this problem.
Thanks very much.
Zaixiang Tang
[email protected]
Hi, many thanks to the excellent package.
I'm wondering if the package is availble for multinomial logistic regression (the response variable has more than 2 levels)? The family argument only provides "gaussian" and "binomial" options.
Best
Rui.
Hi,
Thanks for the amazing package! It's super fast and deals with a wide range of data.
I was wondering if it's possible to add an offset option to the Poisson family, so that the package is not limited to simple poisson data but can also be used to deal with time-homogeneous Poisson processes.
Thank you very much!
Hi there,
I'm having an issue with numerical instability with the package. I'm trying to fit a group lasso on some cases where p approaches n. Groups are over basis functions for some splines set up with ns(intercept=T) from the splines package. I'm running into what seems like some numerical issues - I seem to be getting coefficients, most often the intercept, that are implausibly large or small, i.e. many many orders of magnitude different from the rest of the coefficients. The issue isn't deterministic, but doesn't seem to happen when I run models with larger n. Changing the eps arguments in grpreg doesn't seem to stabilize things. Any ideas how to fix, besides running the model and checking for any bizarrely huge/small coefficients?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.