pbreheny / grpreg Goto Github PK

View Code? Open in Web Editor NEW

33.0 33.0 14.0 5.06 MB

Regularization paths for regression models with grouped covariates

Home Page: http://pbreheny.github.io/grpreg/

R 71.07% C 26.73% CSS 1.12% JavaScript 1.09%

grpreg's People

Contributors

Stargazers

Watchers

Forkers

yaohuizeng wantingallin shenh1995 xiaoyuanchen gunthergl dankessler zzz1990771 remiller1450 jwallib shaoyoucheng laminetourelab

grpreg's Issues

Strange BEDPP bug

@YaohuiZeng: I'm seeing a strange bug in the BEDPP rules in this example:

DF <-read.csv(text=RCurl::getURL("https://raw.githubusercontent.com/jonhersh/datasets/master/myDF2.csv"), header=T)

X <- DF[,1:74]
Y <- DF[,75]
groups <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9)
groups.factor <- factor(groups, labels = c("head","hh demographics","dwelling","public services","durables","education","region/sector","head LF","hh LF"))
fit <- grpreg(X, Y, group=groups.factor)

I don't understand why...the data set looks ordinary yet we're getting -nans, BEDPP is rejecting everything, and we get all 0s for all values of λ. I have turned BEDPP screening off (2aeacd4) as of grpreg_3.1-4, but if you can fix it, I would like to turn it back on for 3.1-5.

Variable Importance?

I'm unsure if it applies to grouped regression, but is there a way to get a measure of variable importance but at the group level (or some way to get some proxy of that measure)? I'm thinking about something similar to what caret uses when reporting importance for their ML methods.

No tools for cross-validation with gBridge()

Brought up by Vanda Lourenço over e-mail.

Add more expand_ functionality

Improve expand_spline(): Allow user to pass vector argument specifying which columns should be expanded (default would be: all numeric columns)
Create an expand_factor() function to do the same for factors
Make sure all combinations of factors, splines, and non-expanded variables work well together

Documentation: Add section on adaptive rescaling

This is not well-documented now, and confusing/contradictory if a person only reads Breheny and Huang (2009).

Fix website

In response to pkgdown 1.4 changes

Intercept in grpreg() and cv.grpreg() function

Hello,

Is there a way to set intercept = FALSE ?

Thank you

Question about specifying the number of groups selected by 'grpreg' function

Yaohui Zeng writes:

For overlapping group selection, you mentioned that we can specify “gmax = 5” in the ‘grpreg' function. I checked the document, and the ‘gmax’ specifies the maximum number of groups allowed to be selected. Through some investigation, I found that if ‘gmax = 5’ specified, then most of the time only 3 or 4 groups were selected. The question I wish to consult you is, are there any ways that we can select exactly 5 groups by the ‘grpreg’ function?

To illustrate,

set.seed(1)
X <- matrix(rnorm(100*100), 100, 100)
y <- rnorm(100)
fit <- grpreg(X, y, rep(1:50, each=2), gmax=20)
predict(fit, "ngroups")
...
0.0668 0.0648 0.0629  0.061 0.0592 0.0574 
    17     17     18     18     19     21

We specified gmax=20, but we never get a solution with exactly 20 nonzero groups.

The issue here is that there are simply no λ values for which the number of groups is equal to 20. This can be fixed by increasing the resolution of the λ grid:

fit <- grpreg(X, y, rep(1:50, each=2), gmax=20, nlambda=500)
tail(predict(fit, "ngroups"))
0.0592 0.0589 0.0585 0.0582 0.0578 0.0575 
    18     20     20     20     20     21

Error in X[, ind, drop = FALSE] : subscript out of bounds error

Hi there,

I'm really excited about the package. Thanks for all your hard work on it.

I'm getting a pesky subscript out of bounds error when I run the program. I don't think I'm making an error but I could be wrong. Can you advise whether this is a bug on my end or yours?

require(RCurl)

DF <-read.csv(text=getURL(
  "https://raw.githubusercontent.com/jonhersh/datasets/master/myDF2.csv"), , header=T)

Xtrain <- DF[,1:74]
Ytrain <- DF[,75]

groups <- c(1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,5,5,5,5,6,6,6,7,7,7,7,7,8,8,8,8,8,8,8,8,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9)
groups.factor <- factor(groups, labels = c("head","hh demographics","dwelling","public services","durables","education","region/sector","head LF","hh LF"))

# group lasso using grplasso
grpreg.fit <- cv.grpreg(Xtrain,Ytrain,group = groups.factor)

Gamma regression

Hello,

Are you also planning on adding other distributions from the exponential family, like the Gamma distribution?

Kind regards,

Olivier

summary(cvfit): If intercept-only model is returned, should not say that there is 1 nonzero group

Error occurs with constant responses

I have the following example:

xx <- seq(0.01,1,0.01)
zz <- cbind(sin(xx*2*pi), cos(xx*2*pi))
y <- rep(1, 100)

grpreg::grpreg(zz, y, group = c(1,2), lambda = 10^seq(-3,20,1))
#> Error: Algorithm failed to converge for any values of lambda.  This indicates a combination of (a) an ill-conditioned feature matrix X and (b) insufficient penalization.  You must fix one or the other for your model to be identifiable.

^{Created on 2021-08-11 by the reprex package (v2.0.0)}

The grpreg throws an error rather than fit every coefficient to zero. What might cause such problem?

Overhaul GLM

Combine poisson and binomial C code into a common glm framework

Option for observation weights

Hello, another possibly useful implementation in the grpsurv and grpreg functions can be allowing for weighted observations.
Thanks!
Best regards,
Irmak

Mixed penalties

It would be nice to allow the user to specify penalty as a vector so that different penalties could be used for different groups. For example, if some groups may be splines but others, we wish to carry out bi-level selection.

grpreg SCAD analysis with missing values possible?

Is it possible to perform your grpreg SCAD analysis on data sets with missing values? It seemed from the error message I received about # of observations not matching (which went away when I excluded X variables with missing values) that it is a problem. Would you recommend substituting column averages in the place of missing values? The same question applies to an alternate outcome (y) variable I'd like to use which also has missing values for some subjects. Thanks!

Additive models

run into error Error in X[, ind, drop = FALSE] : subscript out of bounds

I provided exact same type of group, x , y as example but kept getting this error. so depressing.

Weights option?

Is there any way to modify the code to accept weights? This would really help in making the package compatible with glmnet

Incorrect calculation of `loss` variable in a model returned by `grpreg(...)`

There is fit$loss variable if fit is a model returned by grpreg(...) call. And fit$loss is equal to RSS, but it should be equal to 1/2 of RSS, i.e. negative log-likelihood, according to Models vignette, Gaussian (linear regression) Section. See the reproducible example below:

> library(glmnet)
> library(grpreg)
> data(QuickStartExample)
> x <- QuickStartExample$x
> y <- QuickStartExample$y

> fit<-grpreg(stats::model.matrix(y~., data = data.frame(y=y, x, check.names = TRUE))[,-1, drop=FALSE], y, group = 1:20, penalty="grLasso")
> sum((cbind(rep(1,100),x) %*% fit$beta[,10] - y)^2)   #pure RSS - no factor of 1/2 here !!!
[1] 363.5578
> fit$loss[10]
[1] 363.5578

The correct value would be that of 1/2 of RSS.

A possible bug in group coefficients for group Lasso

Hi,

I believe I have found a possible bug in group coefficients constraint for penalty="grLasso" in grpreg. Penalty vignette states "the coefficients within a group will either all equal zero or none will equal zero". This constraint seems to be broken in a data example I came across.

To reproduce this potential bug please execute:

#load data example from github
library(Rfssa)
load_github_data("https://github.com/SzymonNowakowski/DMRnet/blob/testing_branch/data/promoX.RData")
load_github_data("https://github.com/SzymonNowakowski/DMRnet/blob/testing_branch/data/promoy.RData")

#prepare data in grpreg-accepted format
y <- ifelse(y == levels(y)[2], 1, 0)   
X <- stats::model.matrix(y~., data = data.frame(y=y, X, check.names = TRUE))[, -1, drop=FALSE]
group <- rep(1:57, each = 3)

#run grpreg
library(grpreg)
fit<-grpreg(X, y, group = group, penalty="grLasso", family="binomial")

#examine coefficients
coef(fit)[44:46,1:9] 
#      0.2199        0.2133       0.2069       0.2008       0.1948     0.189       0.1834       0.1779    0.1726
# X15c      0 -2.139903e-17 8.559611e-17 1.069951e-16 2.139903e-16 0.0000000 5.991728e-16 3.423845e-16 0.0000000
# X15g      0  3.998654e-02 7.887908e-02 1.167650e-01 1.537286e-01 0.1898477 2.251939e-01 2.598331e-01 0.2938264
# X15t      0  9.878343e-02 1.948222e-01 2.882890e-01 3.793487e-01 0.4681499 5.548270e-01 6.395019e-01 0.7222856

It is probably a numerical problem - you'll notice some coefficients very close to 0 in the first row X15c of the output, however for the lambdas 0.189 and 0.1726 it is not close to 0, but exactly 0, breaking the abovementioned constraint.

Thank you in advance for having a look into it,
Szymon
PS. The data I used is a subset of Promoter dataset isolated for the purpose of reproducing this behavior.

Scaling in standardize.c

I found that when doing standardization for the design matrix, the code is using degrees of freedom = n for standard deviation. I am wondering why not use n-1, which is commonly used in other standardize function and should give an unbiased estimator for the variance.

mfdr inference

Would be nice to add this, as in ncvreg. Ryan Miller has done some work on this.

summary.grpreg()

This function should exist and return something useful; see #34

Allow [start, stop) representation of survival times

Hello, I've just discovered the package and it looks great. However, the data I have at the moment uses time-varying covariates so I require compatibility with the [start, stop) representation of survival times. Would it be possible to implement this? Perhaps similar to how glmnet does (see version update 4.1)?

Offset option for Poisson regression

Maria Kamenetsky (University of Wisconsin) writes:

With grpreg, I see that you have the option to use the Poisson distribution for the family, but I am unable to find an option to include an offset with this. I was wondering if there is a way to do so using the package.

Error Message - Could not find build tools necessary to build grpreg

When trying to install grpreg after installing devtools like so:

install.packages("devtools")
devtools::install_github("pbreheny/grpreg")

I get this error message :

Downloading GitHub repo pbreheny/grpreg@master
from URL https://api.github.com/repos/pbreheny/grpreg/zipball/master
Error: Could not find build tools necessary to build grpreg

Any ideas on what I'm missing here or what I could try to fix it? Thanks in advance.

Inconsistent output based on grouping and seed.

Hello,

Example attached using R version 3.6.0 and grpreg version 3.2-1 (did not have this problem with R version 3.5.3 and grpreg version 3.1-3). Data is publicly available (https://www.cdc.gov/nchs/nhanes/index.htm).

Resulting $cve of cv.grpreg() is sometimes a vector (correct) and sometimes a scalar (wrong), depending on the user-defined grouping and the user-set seed.

The plotted output gives the expected cross-validation curve when cve is a vector, but breaks when cve is a scalar.

Why is this happening? Many thanks!

grpreg_bug.zip

standard error in cv.grpsurv

Hi, thank you for this very useful package!
I am working with cv.grpsurv, I would like to use the 1 standard error rule to perform variable selection.
However, unlike cv.grpreg, cv.grpsurv returns only the mean cross validation error (cve) for each lambda without the estimated standard error associated with each value of cve.
Is it possible to return this extra vector as an output value in cv.grpsurv?
Thanks :)

Fold in grpregOverlap

Need to get in touch with Dan Kessler to see where this stands

All 0's for checking with user specified lambda

library(grpreg)
data("Birthwt")
X <- Birthwt$X
group <- Birthwt$group
y <- Birthwt$bwt
fit <- grpreg(X, y, group, penalty="grLasso")
Lambda <- fit$lambda
check <- grpreg(X,y,group,lambda=Lambda)
check$beta

Using the above code, I was trying to be able to check the results by specifying lambda myself. However, when I do this, it results in all 0's.

Error in title(...) : 'X11_MetricInfo' -955 < 0

when I used plot to draw fit after I ran fit <- grpreg(X, y, group, penalty="grLasso")

plot(fit), I got this error.Does anybody know the reason? how to fix it?

Specify unpenalized groups through group.multiplier

One should be able to specify unpenalized groups through group.multiplier, as in:

data(Birthwt)
X <- Birthwt$X
y <- Birthwt$bwt
group <- Birthwt$group
fit <- grpreg(X, y, group, group.multiplier=0:7)

But this doesn't work; they have to be given a group number of 0.

Single lambda API

Similar to what ncvfit() does in ncvreg.

Options to exclude the intercept

Hi, thanks for developing such a good package on group regularization. It is extremely useful as it not only integrates a bunch of penalties but also has the option for several probability distributions.

I am currently working on an algorithm, in which a group Lasso regression is performed without intercept. However, there is no option as intercept=FALSE. If you could add feature it would be very helpful.

AUC as a loss function in cv.grpreg

Hi Patrick Breheny,

Thanks for the excellent package!! I'm using the function cv.grpreg but I'd like to use the "auc" as the loss function to use for cross-validation (as in package cv.glmnet). Do you think is there a way to implement this?

Cheers,
Pedro

Using grpreg for integrative analysis

Guo Pi (Sun Yat-Sen University) writes:

I recently used the group Lasso model in your R package grpreg to integrate data from different sources. However, I faced some problems in coding the model.

Add interaction tools

Kind of similar to what glinternet does, but if part of grpreg, could use, say, group MCP as the penalty.

`grpreg` objective function inconsistency

There is an inconsistency in documentation and vignettes for grpreg objective function:

Instruction on page 15 defines the objective function as $$Q = \frac{1}{n} * loss + penalty$$
Models vignette defines the objective function as $$Q = \frac{1}{n} * loss + penalty$$
Penalties vignette, however, defines the objective function differently as $$Q = loss + penalty$$

I have tested the results of grpreg with glmnet on some examples and they are the same, so looking at glmnet objective function for Gaussian families I conclude that 1. and 2. are correct. Could you please update the Penalties vignette?

A question about “group” setting

Hi, thank you so much for creating this package!
I have a confusion about parameter settings.Suppose I have 3 groups and each group has 2 members,when I use group = 1,1,2,2,3,3,then I can get group-lasso/group-MCP/group-SCAD.I wonder if I use the group=1:6,whether this is equivalent to lasso/MCP/SCAD?

"Fold" arguement in cv.grpsurv does not work

Hello,
While doing a grouped lasso for time-to-event outcome in a stack of multiply imputed datasets with folds predefined to be the same for multiple imputations of individual patients, I realized that "fold" argument fails to work. I think the issue is caused due to calculating the loss not in cvf.surv (which takes the subsetting according to folds into account) but rather outside which does not take into account the fold structure.
Thanks!
Irmak

`grpreg` loss function inconsistency in Models vignette

Models vignette defines loss function $L(\beta|X,y)$ in the beginning as the deviance (−2 times the log-likelihood). Later, in Gaussian (linear regression) section there is a conflicting definition $L(\beta|X,y) = \frac{1}{2}||y-X\beta||^2$. I believe that either 2 or $\frac{1}{2}$ should be removed.

I have compared the output of grpreg and glmnet and they are the same, so looking at glmnet objective function for Gaussian families I believe the second definition is correct and the factor of 2 should be removed from the opening paragraph: instead of the deviance (−2 times the log-likelihood) it should read: the deviance (minus log-likelihood).

Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd'

Hi,
I am fitting a cv group lasso model on matrices derived from single-cell RNA-seq data-
I won’t get into details of the model, but I have gene expression values of 7 transcription factors and 290 differentially expressed genes (DEGs) in 18 cells, and am using the group lasso to find a set of transcription factors that “explain” the gene expression values of differentially expressed genes.

All entries of my matrices are normalized genes expression value. The dimensions of my matrices are as follows:
X (terminal_ddN_my_X) = 5220 2030
Y (terminal_ddN_my_Y) = 5220 1

This is my code snippet:

terminal_ddN_seed <- 15
terminal_grplasso2_ddN_cv_grepreg_fit <- cv.grpreg(terminal_ddN_my_X, terminal_ddN_my_Y, terminal_ddN_group_id, penalty="grLasso", seed=terminal_ddN_seed)

terminal_ddN_group_id is a vector that contains the group of each variable.

When I try to run this code, I get the following error:

Error in La.svd(x, nu, nv) : error code 1 from Lapack routine 'dgesdd'

The traceback is:

7. La.svd(x, nu, nv)
6. svd(X[, ind, drop = FALSE], nu = 0)
5. orthogonalize(XX, G)
4. newXG(X, group, group.multiplier, attr(yy, "m"), bilevel)
3. grpreg(penalty = "grLasso", X = structure(c(2.97925825263004, 
2.53039506574974, 3.59726888875713, 2.95151714857877, 3.21434876363906, 
3.47295745106238, 2.92167213023129, 2.98361981870551, 3.3768239126559, 
3.73921703297616, 3.35209278391207, 2.92910464088794, 3.08964352564125, ...
2. do.call("grpreg", fit.args)
1. cv.grpreg(terminal_ddN_my_X, terminal_ddN_my_Y, terminal_ddN_group_id, 
penalty = "grLasso", seed = terminal_ddN_seed)

Unfortunately, I have no intuition as to how to debug this issue even after reading many stack overflow blogs about the same issue (albeit from using other packages).

Any help would be greatly appreciated!
PC

Clean up code

Use inherits() instead of class() in testing
Use double() or integer() instead of numeric()
Use call.=FALSE in stop()
Fix no space after comma in code
No tabs
Use full references (w/ http)

an error in cv.grpsurv function when penalty = "gel" or penalty = "cMCP" are used

I am using your R package grpreg (grpregOverlap) to perform survival analysis.

First I chose penalty = "grSCAD", then run the following code. it looks okay.
f.over <- grpsurv(x1m, y , penalty = "grSCAD" , group = groupvector, eps=0.005, max.iter=10000)
cv.over <- cv.grpsurv(x1m, y, nfolds = 10, trace=FALSE, seed=1, penalty = "grSCAD" , group = groupvector, lambda=f.over$lambda, eps=0.005, max.iter=10000)

There is no any error, and the output is reasonable.

However, if I set penalty = "gel" or penalty = "cMCP", the following code can run.
f.over <- grpsurv(x1m, y , penalty = "gel" , group = groupvector, eps=0.005, max.iter=10000)
f.over <- grpsurv(x1m, y , penalty = "grSCAD" , group = groupvector, eps=0.005, max.iter=10000)

but, I can not run cross-validation function, as following:
cv.over <- cv.grpsurv(x1m, y, nfolds = 10, trace=FALSE, seed=1, penalty = "gel", group = groupvector, lambda=f.over$lambda, eps=0.005, max.iter=10000)
cv.over <- cv.grpsurv(x1m, y, nfolds = 10, trace=FALSE, seed=1, penalty = "cMCP", group = groupvector, lambda=f.over$lambda, eps=0.005, max.iter=10000)

the error is:
Error in reorderGroups(group, group.multiplier, strtrim(penalty, 2) == :
Length of group.multiplier must equal number of penalized groups

if I remove penalty = "gel" or penalty = "cMCP" from above cv.grpsurv function. there is no error, but the output is unreasonable.
please help me on this problem.

Thanks very much.

Zaixiang Tang
[email protected]

Is grpreg available for multinomial models?

Hi, many thanks to the excellent package.

I'm wondering if the package is availble for multinomial logistic regression (the response variable has more than 2 levels)? The family argument only provides "gaussian" and "binomial" options.

Best
Rui.

Add an offset term to the Poisson family

Hi,
Thanks for the amazing package! It's super fast and deals with a wide range of data.
I was wondering if it's possible to add an offset option to the Poisson family, so that the package is not limited to simple poisson data but can also be used to deal with time-homogeneous Poisson processes.

Thank you very much!

Numerical instability

Hi there,

I'm having an issue with numerical instability with the package. I'm trying to fit a group lasso on some cases where p approaches n. Groups are over basis functions for some splines set up with ns(intercept=T) from the splines package. I'm running into what seems like some numerical issues - I seem to be getting coefficients, most often the intercept, that are implausibly large or small, i.e. many many orders of magnitude different from the rest of the coefficients. The issue isn't deterministic, but doesn't seem to happen when I run models with larger n. Changing the eps arguments in grpreg doesn't seem to stabilize things. Any ideas how to fix, besides running the model and checking for any bizarrely huge/small coefficients?

pbreheny / grpreg Goto Github PK

grpreg's People

Contributors

Stargazers

Watchers

Forkers

grpreg's Issues

Recommend Projects

Recommend Topics

Recommend Org