abess-team / abess Goto Github PK

Fast Best-Subset Selection Library

Home Page: https://abess.readthedocs.io/

License: Other

Python 2.59% CMake 0.17% C++ 92.37% C 0.79% Cuda 1.39% Shell 0.01% R 2.64% M4 0.03%

polynomial-algorithm high-dimensional-data best-subset-selection machine-learning python r scikit-learn principal-component-analysis linear-regression classification-algorithm

abess's Introduction

abess: Fast Best-Subset Selection in Python and R

Overview

abess (Adaptive BEst Subset Selection) library aims to solve general best subset selection, i.e., find a small subset of predictors such that the resulting model is expected to have the highest accuracy. The selection for best subset shows great value in scientific researches and practical applications. For example, clinicians want to know whether a patient is healthy or not based on the expression levels of a few of important genes.

This library implements a generic algorithm framework to find the optimal solution in an extremely fast way. This framework now supports the detection of best subset under: linear regression, classification (binary or multi-class), counting-response modeling, censored-response modeling, multi-response modeling (multi-tasks learning), etc. It also supports the variants of best subset selection like group best subset selection, nuisance penalized regression, Especially, the time complexity of (group) best subset selection for linear regression is certifiably polynomial.

Quick start

The abess software has both Python and R's interfaces. Here a quick start will be given and for more details, please view: Installation.

Python package

Install the stable version of Python-package from Pypi:

$ pip install abess

or conda-forge:

$ conda install abess

Best subset selection for linear regression on a simulated dataset in Python:

from abess.linear import LinearRegression
from abess.datasets import make_glm_data
sim_dat = make_glm_data(n = 300, p = 1000, k = 10, family = "gaussian")
model = LinearRegression()
model.fit(sim_dat.x, sim_dat.y)

See more examples analyzed with Python in the Python tutorials.

R package

Install the stable version of R-package from CRAN with:

install.packages("abess")

Best subset selection for linear regression on a simulated dataset in R:

library(abess)
sim_dat <- generate.data(n = 300, p = 1000)
abess(x = sim_dat[["x"]], y = sim_dat[["y"]])

See more examples analyzed with R in the R tutorials.

Runtime Performance

To show the power of abess in computation, we assess its timings of the CPU execution (seconds) on synthetic datasets, and compare to state-of-the-art variable selection methods. The variable selection and estimation results are deferred to Python performance and R performance. All computations are conducted on a Ubuntu platform with Intel(R) Core(TM) i9-9940X CPU @ 3.30GHz and 48 RAM.

Python package

We compare abess Python package with scikit-learn on linear regression and logistic regression. Results are presented in the below figure:

It can be see that abess uses the least runtime to find the solution. This results can be reproduced by running the command in shell:

$ python abess/docs/simulation/Python/timings.py

R package

We compare abess R package with three widely used R packages: glmnet, ncvreg, and L0Learn. We get the runtime comparison results:

Compared with other packages, abess shows competitive computational efficiency, and achieves the best computational power when variables have a large correlation.

Conducting the following command in shell can reproduce the above results in R:

$ Rscript abess/docs/simulation/R/timings.R

Open source software

abess is a free software and its source code is publicly available on Github. The core framework is programmed in C++, and user-friendly R and Python interfaces are offered. You can redistribute it and/or modify it under the terms of the GPL-v3 License. We welcome contributions for abess, especially stretching abess to the other best subset selection problems.

What's news

New features version 0.4.7:

Support limiting beta into a range by clipping method. One application is to perform non-negative fitting.
Support no-intercept model for most regressors in abess.linear with argument fit_intercept=False. We assume that the data has been centered for these models.
Support AUC criterion for Logistic and Multinomial Regression.

New features version 0.4.6:

Support no-intercept model for most regressors in abess.linear with argument fit_intercept=False. We assume that the data has been centered for these models. (Python)
abess can be used via mlr3extralearners as learners regr.abess and classif.abess. (R)
Use CMake on compiling to increase scalability.
Support score functions for all GLM models. (Python)
Rearrange some arguments in Python package to improve legibility. Please check the latest API document. (Python)

Citation

If you use abess or reference our tutorials in a presentation or publication, we would appreciate citations of our library.

Zhu Jin, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, and Junxian Zhu. "abess: A Fast Best-Subset Selection Library in Python and R." Journal of Machine Learning Research 23, no. 202 (2022): 1-7.

The corresponding BibteX entry:

@article{JMLR:v23:21-1060,
  author  = {Jin Zhu and Xueqin Wang and Liyuan Hu and Junhao Huang and Kangkang Jiang and Yanhang Zhang and Shiyun Lin and Junxian Zhu},
  title   = {abess: A Fast Best-Subset Selection Library in Python and R},
  journal = {Journal of Machine Learning Research},
  year    = {2022},
  volume  = {23},
  number  = {202},
  pages   = {1--7},
  url     = {http://jmlr.org/papers/v23/21-1060.html}
}

References

Junxian Zhu, Canhong Wen, Jin Zhu, Heping Zhang, and Xueqin Wang (2020). A polynomial algorithm for best-subset selection problem. Proceedings of the National Academy of Sciences, 117(52):33117-33123.
Pölsterl, S (2020). scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn. J. Mach. Learn. Res., 21(212), 1-6.
Yanhang Zhang, Junxian Zhu, Jin Zhu, and Xueqin Wang. A splicing approach to best subset of groups selection. INFORMS Journal on Computing, 35(1):104–119, 2023. doi: 10.1287/ijoc.2022.1241.
Qiang Sun and Heping Zhang (2020). Targeted Inference Involving High-Dimensional Data Using Nuisance Penalized Regression, Journal of the American Statistical Association, DOI: 10.1080/01621459.2020.1737079.
Zhu Jin, Xueqin Wang, Liyuan Hu, Junhao Huang, Kangkang Jiang, Yanhang Zhang, Shiyun Lin, and Junxian Zhu. "abess: A Fast Best-Subset Selection Library in Python and R." Journal of Machine Learning Research 23, no. 202 (2022): 1-7.

abess's People

Contributors

Stargazers

Watchers

abess's Issues

Gamma model's result go wrong

Describe the bug

The result of Gamma model is wrong. When use approx newton, the estimator is always all-zero vector; a total wrong result is got when use exact newton.

Code for Reproduction

Here are R code:

  n <- 10000
  p <- 5
  support.size <- 3
  dataset <- generate.data(n, p, support.size, family = "gamma", seed = 1)
  
  approx_fit <- abess(
    dataset[["x"]],
    dataset[["y"]],
    family = "gamma",
    newton = "approx",
  )
  exact_fit <- abess(
    dataset[["x"]],
    dataset[["y"]],
    family = "gamma",
    newton = "exact",
  )
  print("true_coef: ")
  print(dataset$beta)
  print("approx newton est_coef: ")
  print(approx_fit$beta[,support.size]) 
  print("exact newton est_coef: ")
  print(exact_fit$beta[,support.size])

Result:

[1] "true_coef: "
[1] 0.000000 0.000000 3.069073 6.725235 7.974553
[1] "approx newton est_coef: "
x1 x2 x3 x4 x5 
 0  0  0  0  0 
[1] "exact newton est_coef: "
          x1           x2           x3           x4           x5 
2.159370e+01 0.000000e+00 5.188777e-13 0.000000e+00 0.000000e+00

Here are Python code:

import abess
import numpy as np

np.random.seed(1)
data = abess.make_glm_data(n=10000, p=5, k=3, family="gamma")

model1 = abess.GammaRegression(support_size = 3, approximate_Newton = False)
model1.fit(data.x, data.y)

model2 = abess.GammaRegression(support_size = 3, approximate_Newton = True)
model2.fit(data.x, data.y)

print("true_coef: ",data.coef_)
print("approx newton est_coef: ",model2.coef_)
print("exact newton est_coef: ",model1.coef_)

Results:

true_coef:  [ 1.47594114  6.66687502 -2.85407881  0.          0.        ]
approx newton est_coef:  [0. 0. 0. 0. 0.]
exact newton est_coef:  [ 0.00000000e+00  0.00000000e+00 -2.10497802e-34 -1.93607800e-35 1.82703568e-34]

Desktop (please complete the following information):

OS: Platform Version: Linux-4.15.0-189-generic-x86_64-with-glibc2.27, 64bit
Python Version: 3.10.4
Package Version: 0.4.6

Negative coefficients in data generation

Is your feature request related to a problem? Please describe.
I find that the data generators will always give positive coef_, described on this page. Could you support random negative coefficients too?

Describe the solution you'd like
coef_ contains both positive and negative values.

Use abess.linear.LogisticRegression in cross_validate return negative test_score

Describe the bug

In my experiment, when using abess.linear.LogisticRegression in cross_validate, it returns negative test_score.The following code provides an example.

Code for Reproduction

LogisticRegression the samples on Hypersphere(dim=9) in 10D Euclidean Space (without do the logarithm map).

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from geomstats.geometry.hypersphere import Hypersphere
from abess import LogisticRegression

sphere = Hypersphere(dim=9)
labels = np.concatenate((np.zeros(1000),np.ones(1000)))
data0 = sphere.random_riemannian_normal(mean=np.array([1/3,0,2/3,0,2/3,0,0  ,0,0  ,0]), n_samples=1000, precision=5)
data1 = sphere.random_riemannian_normal(mean=np.array([0  ,0,0  ,0,2/3,0,2/3,0,1/3,0]), n_samples=1000, precision=5)
data = np.concatenate((data0,data1))
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size=0.33, random_state=0)
result = cross_validate(LogisticRegression(support_size=range(0, 11)), train_data, train_labels)
print(result)

return:

{'fit_time': array([0.01018214, 0.0107348 , 0.00791812, 0.00899959, 0.00998664]), 'score_time': array([0.0010004, 0.       , 0.0010004, 0.       , 0.       ]), 'test_score': array([-96.61424923, -94.28383166, -91.52310614, -95.26857002,
       -87.02766473])}

[Matrix package in R CRAN]: deprecations in the forthcoming Matrix version 1.4-2

Matrix 1.4-2 will formally deprecate 187 coercion methods. More precisely, coercions of the form

as(object, Class)

where

'object' inherits from the virtual class Matrix, is a traditional matrix, or is a logical or numeric vector
'Class' specifies a non-virtual subclass of Matrix, such as dgCMatrix, but really any subclass matching the pattern
```
^[dln]([gts][CRT]|di|ge|tr|sy|tp|sp)Matrix$
```

will continue to work as before but signal a deprecation message or warning (message in the widely used dg.Matrix and d.CMatrix cases).

To simplify the revision process, the development version of Matrix provides Matrix:::.as.via.virtual(), taking a pair of class names and returning as a call the correct nesting of coercion:

Matrix:::.as.via.virtual("matrix", "dgCMatrix")

[Bug] No termination within reasonable time for Poisson regression in a specific case

I've encountered a strange issue: abess() does not terminate in a specific situation. The following code produces a reproducible example. It runs for at least 10 mins without termination. However, by simply setting support.size = 0:13 or support.size = 14, it terminates immediately (perhaps within 1 second). Moreover, when tune.type = "gic", this issue also didn't happen, which makes me really confused.

The version of abess is 0.4.7 (installed from CRAN). I've tested the code on two different Linux systems. The same issue is encountered.

library(abess)
seed <- 1
n <- 100
p <- 1000
family <- "poisson"
snr <- Inf
beta <- rep(0, p)
nonzero <- sample(1:p, 10)
beta[nonzero] <- c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5)
k <- 10

data <- generate.data(n, p, beta = beta, snr = snr, family = family, support.size = k, seed = seed)
x <- data$x
y <- data$y

abess(x, y, tune.type = "cv", family = "poisson", support.size = 0:14)

[Question] Cox model for ultra-high dimensional data

Hello, I am doing some real data analysis about high-dimensional cox model. My real dataset's shape is like 240*7000, however, I try to use the abess.CoxPHSurvivalAnalysis() with cv and it can not choose any feature out. So, I must use screening before abess for Cox model. I also did simulation test for only screening method in abess package and found that the screening method can not contain all the real features spawn by make_glm_data. So, I doubt the algorithm of screening in this package, I hope you guys may adapt it, thank u!!!

Results seem to be meaningless when smax > rank(X).

No warning is suggested when rank of the design matrix <= support size. When the support size gets larger, results seem to be meaningless. What about throwing a warning? The following code provides a demo.

library(abess)
data <- generate.data(n = 30, p = 100, support.size = 10)
x0 <- data$x
y0 <- data$y

idx <- c(1:10, 1:10)
x <- x0[idx, ]
y <- y0[idx]
abess(x, y, support.size = 0:15)

Call:
abess.default(x = x, y = y, support.size = 0:15)

   support.size          dev         GIC
1             0 9.934785e+05   276.17935
2             1 2.310636e+05   252.06170
3             2 8.155513e+04   236.28617
4             3 9.333066e+03   197.98460
5             4 2.467864e+03   176.43313
6             5 3.883858e+02   144.50369
7             6 5.223017e+02   155.48135
8             7 1.689733e+00    45.86059
9             8 2.741854e+01   106.64631
10            9 3.802369e-24 -1033.05370
11           10 1.188349e-25 -1097.31384
12           11 6.137819e-25 -1059.42301
13           12 1.259870e-25 -1086.03949
14           13 4.764910e-24 -1008.32964
15           14 1.402658e-25 -1073.78679
16           15 3.573510e-25 -1050.03047

Could not run the example in arXiv:2110.09697

I tried to run the usage example provided in the article, but it failed,

from abess.linear import abessLogistic
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.preprocessing import PolynomialFeatures
from sklearn.model_selection import GridSearchCV
# combine feature transform and model:
pipe = Pipeline([('poly', PolynomialFeatures(include_bias=False)), ('alogistic', abessLogistic())])
param_grid = {'poly_interaction_only': [True, False],'poly_degree': [1, 2, 3]}
# Use cross validation to tune parameters:
scorer = make_scorer(roc_auc_score, greater_is_better=True)
grid_search = GridSearchCV(pipe, param_grid, scoring=scorer, cv=5)
# load and fitting example dataset:
X, y = load_breast_cancer(return_X_y=True)
grid_search.fit(X, y)
# print the best tuning parameter and associated AUC score:
print([grid_search.best_params_, grid_search.best_score_])

It gives the following errors:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/ipykernel_28894/2314297278.py in <module>
     13 # load and fitting example dataset:
     14 X, y = load_breast_cancer(return_X_y=True)
---> 15 grid_search.fit(X, y)
     16 # print the best tuning parameter and associated AUC score:
     17 print([grid_search.best_params_, grid_search.best_score_])

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    839                 return results
    840 
--> 841             self._run_search(evaluate_candidates)
    842 
    843             # multimetric is determined here because in the case of a callable

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1294     def _run_search(self, evaluate_candidates):
   1295         """Search all candidates in param_grid"""
-> 1296         evaluate_candidates(ParameterGrid(self.param_grid))
   1297 
   1298 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
    793                               n_splits, n_candidates, n_candidates * n_splits))
    794 
--> 795                 out = parallel(delayed(_fit_and_score)(clone(base_estimator),
    796                                                        X, y,
    797                                                        train=train, test=test,

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1041             # remaining jobs.
   1042             self._iterating = False
-> 1043             if self.dispatch_one_batch(iterator):
   1044                 self._iterating = self._original_iterator is not None
   1045 

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    859                 return False
    860             else:
--> 861                 self._dispatch(tasks)
    862                 return True
    863 

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in _dispatch(self, batch)
    777         with self._lock:
    778             job_idx = len(self._jobs)
--> 779             job = self._backend.apply_async(batch, callback=cb)
    780             # A job can complete so quickly than its callback is
    781             # called before we get here, causing self._jobs to

~/opt/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/opt/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
    220     def __call__(self, *args, **kwargs):
    221         with config_context(**self.config):
--> 222             return self.function(*args, **kwargs)

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    584             cloned_parameters[k] = clone(v, safe=False)
    585 
--> 586         estimator = estimator.set_params(**cloned_parameters)
    587 
    588     start_time = time.time()

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py in set_params(self, **kwargs)
    148         self
    149         """
--> 150         self._set_params('steps', **kwargs)
    151         return self
    152 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/metaestimators.py in _set_params(self, attr, **params)
     52                 self._replace_estimator(attr, name, params.pop(name))
     53         # 3. Step parameters and other initialisation arguments
---> 54         super().set_params(**params)
     55         return self
     56 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py in set_params(self, **params)
    228             key, delim, sub_key = key.partition('__')
    229             if key not in valid_params:
--> 230                 raise ValueError('Invalid parameter %s for estimator %s. '
    231                                  'Check the list of available parameters '
    232                                  'with `estimator.get_params().keys()`.' %

ValueError: Invalid parameter poly_degree for estimator Pipeline(steps=[('poly', PolynomialFeatures(include_bias=False)),
                ('alogistic', abessLogistic())]). Check the list of available parameters with `estimator.get_params().keys()`.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/ipykernel_28894/2314297278.py in <module>
     13 # load and fitting example dataset:
     14 X, y = load_breast_cancer(return_X_y=True)
---> 15 grid_search.fit(X, y)
     16 # print the best tuning parameter and associated AUC score:
     17 print([grid_search.best_params_, grid_search.best_score_])

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     61             extra_args = len(args) - len(all_args)
     62             if extra_args <= 0:
---> 63                 return f(*args, **kwargs)
     64 
     65             # extra_args > 0

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py in fit(self, X, y, groups, **fit_params)
    839                 return results
    840 
--> 841             self._run_search(evaluate_candidates)
    842 
    843             # multimetric is determined here because in the case of a callable

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py in _run_search(self, evaluate_candidates)
   1294     def _run_search(self, evaluate_candidates):
   1295         """Search all candidates in param_grid"""
-> 1296         evaluate_candidates(ParameterGrid(self.param_grid))
   1297 
   1298 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_search.py in evaluate_candidates(candidate_params, cv, more_results)
    793                               n_splits, n_candidates, n_candidates * n_splits))
    794 
--> 795                 out = parallel(delayed(_fit_and_score)(clone(base_estimator),
    796                                                        X, y,
    797                                                        train=train, test=test,

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1041             # remaining jobs.
   1042             self._iterating = False
-> 1043             if self.dispatch_one_batch(iterator):
   1044                 self._iterating = self._original_iterator is not None
   1045 

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in dispatch_one_batch(self, iterator)
    859                 return False
    860             else:
--> 861                 self._dispatch(tasks)
    862                 return True
    863 

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in _dispatch(self, batch)
    777         with self._lock:
    778             job_idx = len(self._jobs)
--> 779             job = self._backend.apply_async(batch, callback=cb)
    780             # A job can complete so quickly than its callback is
    781             # called before we get here, causing self._jobs to

~/opt/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py in apply_async(self, func, callback)
    206     def apply_async(self, func, callback=None):
    207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
    209         if callback:
    210             callback(result)

~/opt/anaconda3/lib/python3.9/site-packages/joblib/_parallel_backends.py in __init__(self, batch)
    570         # Don't delay the application, to avoid keeping the input
    571         # arguments in memory
--> 572         self.results = batch()
    573 
    574     def get(self):

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in __call__(self)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py in <listcomp>(.0)
    260         # change the default number of processes to -1
    261         with parallel_backend(self._backend, n_jobs=self._n_jobs):
--> 262             return [func(*args, **kwargs)
    263                     for func, args, kwargs in self.items]
    264 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/fixes.py in __call__(self, *args, **kwargs)
    220     def __call__(self, *args, **kwargs):
    221         with config_context(**self.config):
--> 222             return self.function(*args, **kwargs)

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, split_progress, candidate_progress, error_score)
    584             cloned_parameters[k] = clone(v, safe=False)
    585 
--> 586         estimator = estimator.set_params(**cloned_parameters)
    587 
    588     start_time = time.time()

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py in set_params(self, **kwargs)
    148         self
    149         """
--> 150         self._set_params('steps', **kwargs)
    151         return self
    152 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/utils/metaestimators.py in _set_params(self, attr, **params)
     52                 self._replace_estimator(attr, name, params.pop(name))
     53         # 3. Step parameters and other initialisation arguments
---> 54         super().set_params(**params)
     55         return self
     56 

~/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py in set_params(self, **params)
    228             key, delim, sub_key = key.partition('__')
    229             if key not in valid_params:
--> 230                 raise ValueError('Invalid parameter %s for estimator %s. '
    231                                  'Check the list of available parameters '
    232                                  'with `estimator.get_params().keys()`.' %

ValueError: Invalid parameter poly_degree for estimator Pipeline(steps=[('poly', PolynomialFeatures(include_bias=False)),
                ('alogistic', abessLogistic())]). Check the list of available parameters with `estimator.get_params().keys()`.

Thanks to the author for helping to respond to my question

Incorrect optimal support size in a multitask learning problem with nonnegativity constraints

Describe the bug
I was testing abess on the multitask learning problem below (deconvolution of a multichannel signal using a known point spread function / blur kernel by regressing the multichannel signal on shifted copies of the point spread function) but I am experiencing poor performance, where the best subset is inferred to be much larger than it actually is (support size of best subset = 157, while true support size of simulated dataset is 50). Is there a way to resolve this? Does this have to do with the fact that abess currently does not support nonnegativity constraints (all my simulated model coefficients are positive)? Is there any way to allow for nonnegativity or box constraints? Even if not taken into account during fitting it could be handy if the coefficients could be clipped within the allowed range, so that the information criteria would at least indicate the correct support size of the best allowed model. Taking into account the constraints during fitting would of course be even better. I was also interested in fitting a multitask identity link Poisson model - as currently only Poisson log link is incorporated I could potentially still get something close to that by using an observation weights matrix 1/(Y+0.1), which would be approx. 1/Poisson variance weights and then using family="mgaussian". Is that allowed by any chance? Can observation weights be a matrix with family="mgaussian", to allow for different observation weights per outcome channel/task? If not, could this be allowed for?

Code for Reproduction

Paste your code for reproducing the bug:

library(remotes)
remotes::install_github("tomwenseleers/L0glm/L0glm")
library(L0glm)
# simulate blurred multichannel spike train
set.seed(1)
s <- 0.1 # sparsity (% of timepoints where there is a peak)
p <- 500 # 500 variables
# simulate multichannel blurred spike train with Gaussian noise
sd_noise <- 1
sim <- simulate_spike_train(n=p, 
                            p=p, 
                            k=round(s*p), # true support size = 0.1*500 = 50
                            mean_beta = 10000,
                            sd_logbeta = 1,
                            family="gaussian",
                            sd_noise = sd_noise,
                            multichannel=TRUE, sparse=TRUE)
X <- sim$X # covariate matrix with shifted copies of point spread function, n x p matrix
Y <- sim$y # multichannel signal (blurred spike train), n x m matrix  
colnames(X) = paste0("x", 1:ncol(X)) # NOTE: if colnames of X and Y are not set abess gives an error message, maybe fix this?
colnames(Y) = paste0("y", 1:ncol(Y))
true_coefs <- sim$beta_true # true coefficients
m <- ncol(Y) # nr of tasks
n <- nrow(X) # nr of observations
p <- ncol(X) # nr of independent variables (shifted copies of point spread functions)
W <- 1/(Y+0.1) # approx 1/variance Poisson observation weights with family="poisson", n x m matrix

# best subset multitask learning using family="mgaussian" using abess
abess_fit <- abess(x = X, 
                               y = Y, 
                               # weights = 1/(Y+0.1),  # QUESTION: if I use family="poisson" above I would like to be able to use this matrix as approx. 1/Poisson variance observation weights to be able to fix identity link poisson using family="mgaussian" in abess - is this possible/allowed ?
                               family = "mgaussian",
                               tune.path = "sequence", 
                               support.size = c(1:200),
                               lambda=0, 
                               warm.start = TRUE,
                               tune.type = "gic") # or cv or ebic or bic or aic
plot(abess_fit, type="tune")  # optimal support size would come out at 150 when in fact it is 50 - I presume because my coefficients are constrainted to be all positive, and abess currently does not allow for nonnegativity or box constraints?
beta_abess = as.matrix(extract(abess_fit)$beta) # coefficient matrix for best subset
library(qlcMatrix)
image(x=1:nrow(sim$y), y=1:ncol(sim$y), z=beta_abess^0.1, col = topo.colors(255),
      useRaster=TRUE,
      xlab="Time", ylab="Channel", main="abess multitask mgaussian (red=true peaks)")
abline(v=(1:nrow(sim$X))[as.vector(rowMax(sim$beta_true)!=0)], col="red")
abline(v=(1:nrow(sim$X))[as.vector(rowMin(beta_abess)!=0)], col="cyan")
sum(rowMax(sim$beta_true)!=0) # 50 true peaks
sum(rowMin(beta_abess)!=0) # 157 peaks detected

Expected behavior

Couple of questions here:

How could I improve performance for this problem? Is the problem mainly that abess currently does not allow for nonnegativity (or box) constraints on the coefficients? Would it be an option to allow passing two vectors lower and upper to be passed in abess to take into account possible box constraints on the coefficients? Worst case these could just be obtained via clipping, in which case the correct support might still come out based on the provided information criteria or CV.
Even if I specify support.size = c(1:50), the resulting model with a support size of 50 still does not perform great. Does this also have to do with the absence of any options to enforce nonnegativity constraints?
Multitask identity link Poisson models are currently not supported by abess, but would it be possible to pass as observation weights a matrix 1/(Y+0.1), which would be approx. 1/Poisson variance weights, which I could then use in combination with family="mgaussian" to get close to a multitask identity link Poisson model?

Desktop (please complete the following information):

OS: Windows 11
R 4.2.3
Package Version : 0.4.7

Screenshots

Samples with zero weight can affect the result.

Describe the bug

I want to check that whether setting sample weight to 0 is equivalent to removing corresponding samples. However, in the following example, coef1 and coef2 are different. Does it means that samples with zero weight can still affect the result?

Code for Reproduction

import numpy as np
from abess.linear import MultinomialRegression
from abess.datasets import make_multivariate_glm_data

n, p = 100, 50
np.random.seed(12345)
data = make_multivariate_glm_data(n=n, p=p, k=10, M=3, family='multinomial')

# construct dataset1 with 100 samples
X1, y1 = data.x, data.y
w1 = np.ones(n)
model1 = MultinomialRegression(support_size=10)
model1.fit(X1, y1, weight=w1)
coef1 = model1.coef_

# construct dataset2 by adding 100 different samples to dataset1
# simultaneously set weight to 0 for these additional 100 samples
X2 = np.vstack([data.x, -data.x])
y2 = np.vstack([data.y, data.y])
w2 = np.ones(n * 2)
w2[n:] = 0
model2 = MultinomialRegression(support_size=10)
model2.fit(X2, y2, weight=w2)
coef2 = model2.coef_

Thanks!

[bug] reported by CRAN

That is

BayesSUR KSgeneral L0Learn MrSGUIDE RPEGLMEN SCAT abess clustermq diseq
gaselect imager landsepi rayrender

which specify a C++ standard, usually C++11, but do not use it in their
configure script. See
https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Using-C_002b_002b-code

L0Learn already has a deadline for another issue: for the rest please
correct before 2023-02-18.

Note that in the cases we have looked at no C++ specification is
actually needed: the current default of C++14, soon to be C++17, works.
(And C++11 has been the default since R 3.6.2 in 2019.) So please try
removing it.

--
Brian D. Ripley, [email protected]
Emeritus Professor of Applied Statistics, University of Oxford

zsh: illegal hardware instruction python (mac m1 silicon)

I run the following code in the terminal, I get the error "zsh: illegal hardware instruction python"

Python 3.9.7 (default, Sep 16 2021, 08:50:36) 
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from abess.linear import abessLogistic
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.metrics import make_scorer, roc_auc_score
>>> from sklearn.preprocessing import PolynomialFeatures
>>> from sklearn.model_selection import GridSearchCV
>>> pipe = Pipeline([('poly', PolynomialFeatures(include_bias=False)), ('alogistic', abessLogistic())])
>>> param_grid = {'poly__interaction_only': [True, False],'poly__degree': [1, 2, 3]}
>>> scorer = make_scorer(roc_auc_score, greater_is_better=True)
>>> grid_search = GridSearchCV(pipe, param_grid, scoring=scorer, cv=5)
>>> X, y = load_breast_cancer(return_X_y=True)
>>> grid_search.fit(X, y)

Problem of computing GIC

I want to compute GIC to select the true model. But I gain different results from the abess packages and manual calculation.

   set.seed(2)
    p = 250
    N = 2500
    X = matrix(rnorm(N * p), ncol = p)
    A = sort(sample(p, 10))
    beta = rep(0, p)
    beta = replace(beta, A, rnorm(10, mean = 6))
    xbeta <- X %*% beta
    Y <- xbeta + rnorm(N)

Compute the estimator by abess packages.


    C = abess(X, Y, family = "gaussian", tune.path="sequence",tune.type = "gic")
    k = C$best.size
    mid=coef(abess(X, Y, family = "gaussian",support.size =k))
    Central =mid[2:(p+1)]
    intercept=mid[1]
    #compute GIC[10]=131.3686
    GIC= N*log(1/(2*N)*t(Y-X%*%Central-intercept)%*%(Y-X%*%Central-intercept))+k*log(p)*(log(log(N)))
    #GIC=-1601.499

Change wording of readme.md

Instead of writing

"The abess software both Python and R's interfaces."

I suggest

"The abess software has both Python and R interfaces."

Thanks for your project.

`SelectFromModel` in `scikit-learn` enables the `abess` estimator

Make the estimators in abess (like LinearRegression and LogisticRegression) can be used via SelectFromModel. See details about sklearn.feature_selection.SelectFromModel in https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel.

Question about getting the result of each splicing iteration

Hello! I want to get the result of each step of splicing iteration, but I don't know how to get it. The c++ code is difficult for me, I wonder if there is a simple way to get the result?

Could not install python package on m1 pro silicon

I run pip install abess in the terminal

Last login: Tue Jan 18 20:32:17 on ttys000
(base) jiaqihu@Mac-M1 ~ % pip install abess

Collecting abess
  Using cached abess-0.3.6.tar.gz (1.5 MB)
Requirement already satisfied: numpy in ./opt/anaconda3/lib/python3.9/site-packages (from abess) (1.20.3)
Requirement already satisfied: scipy in ./opt/anaconda3/lib/python3.9/site-packages (from abess) (1.7.1)
Requirement already satisfied: scikit-learn>=0.24 in ./opt/anaconda3/lib/python3.9/site-packages (from abess) (0.24.2)
Requirement already satisfied: joblib>=0.11 in ./opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.24->abess) (1.1.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in ./opt/anaconda3/lib/python3.9/site-packages (from scikit-learn>=0.24->abess) (2.2.0)
Building wheels for collected packages: abess
  Building wheel for abess (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /Users/jiaqihu/opt/anaconda3/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/setup.py'"'"'; __file__='"'"'/private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-wheel-j5ai4o5i
       cwd: /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/
  Complete output (19 lines):
  bash: /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/copy_src.sh: No such file or directory
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.9-x86_64-3.9
  creating build/lib.macosx-10.9-x86_64-3.9/abess
  copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/metrics.py -> build/lib.macosx-10.9-x86_64-3.9/abess
  copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/linear.py -> build/lib.macosx-10.9-x86_64-3.9/abess
  copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/cabess.py -> build/lib.macosx-10.9-x86_64-3.9/abess
  copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/datasets.py -> build/lib.macosx-10.9-x86_64-3.9/abess
  copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/__init__.py -> build/lib.macosx-10.9-x86_64-3.9/abess
  copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/bess_base.py -> build/lib.macosx-10.9-x86_64-3.9/abess
  copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/pca.py -> build/lib.macosx-10.9-x86_64-3.9/abess
  running build_ext
  building 'abess._cabess' extension
  swigging /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap.i to /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap_wrap.cpp
  swig -python -c++ -o /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap_wrap.cpp /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap.i
  error: command 'swig' failed: No such file or directory
  ----------------------------------------
  ERROR: Failed building wheel for abess
  Running setup.py clean for abess
Failed to build abess
Installing collected packages: abess
    Running setup.py install for abess ... error
    ERROR: Command errored out with exit status 1:
     command: /Users/jiaqihu/opt/anaconda3/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/setup.py'"'"'; __file__='"'"'/private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-record-6eee5eyp/install-record.txt --single-version-externally-managed --compile --install-headers /Users/jiaqihu/opt/anaconda3/include/python3.9/abess
         cwd: /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/
    Complete output (19 lines):
    bash: /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/copy_src.sh: No such file or directory
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.9-x86_64-3.9
    creating build/lib.macosx-10.9-x86_64-3.9/abess
    copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/metrics.py -> build/lib.macosx-10.9-x86_64-3.9/abess
    copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/linear.py -> build/lib.macosx-10.9-x86_64-3.9/abess
    copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/cabess.py -> build/lib.macosx-10.9-x86_64-3.9/abess
    copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/datasets.py -> build/lib.macosx-10.9-x86_64-3.9/abess
    copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/__init__.py -> build/lib.macosx-10.9-x86_64-3.9/abess
    copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/bess_base.py -> build/lib.macosx-10.9-x86_64-3.9/abess
    copying /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/abess/pca.py -> build/lib.macosx-10.9-x86_64-3.9/abess
    running build_ext
    building 'abess._cabess' extension
    swigging /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap.i to /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap_wrap.cpp
    swig -python -c++ -o /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap_wrap.cpp /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/src/pywrap.i
    error: command 'swig' failed: No such file or directory
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/jiaqihu/opt/anaconda3/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/setup.py'"'"'; __file__='"'"'/private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-install-ofm5rwzp/abess_e1c5333de72248a2bdb93137c36fb890/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/pj/c4q3qfkx2119wyj2vbfr_yhr0000gn/T/pip-record-6eee5eyp/install-record.txt --single-version-externally-managed --compile --install-headers /Users/jiaqihu/opt/anaconda3/include/python3.9/abess Check the logs for full command output.

Installation Error: Could not find a version that satisfies the requirement setuptools>=42

Installation Error

When I try to install abess from GitHub directly using the command

cd ./python
pip install .

some errors were reported as shown in the following picture

Is there anything wrong with my computer or the source code of abess?

Thanks!

support exponential correlation structure for the `abess` python library

abess would be even more helpful if make_glm_data and make_multivariate_glm_data can support simulating datasets associated with an exponential correlation structure. This structure is widely considered in recent literature (e.g. https://arxiv.org/pdf/2104.12576.pdf).

One-from-group or N-from-group

Best Subset of Group Selection allows for all members of non-overlapping groups to be selected, but would the converse be possible, i.e. choosing only one variable from each group? This scheme could be extended to N-from-group, or allow the user to set a maximum proportion of each group (e.g. 50%) which should be selected.

If any of this is possible in with current version of the software, I would be grateful to know how it can be implemented.

[R-package] abess fails with one feature

Describe the bug

abess::abess() from R package fails when there is only one feature.

Code for Reproduction

library(abess)
library(data.table)

x = data.table(x = sample(100, size = 100))
y = factor(sample(c("a", "b", "c"), size = 100, replace = TRUE))

abess(x = x, y = y, family = "multinomial")
#> Error in Matrix::Matrix(x[, -y_dim], sparse = TRUE, dimnames = list(vn, : length of 'dimnames' [1] not equal to array extent

^{Created on 2023-03-01 with reprex v2.0.2}

Version: ‘0.4.7’

This did not happen in previous releases (you broke the CI of mlr3extralearners)

Cox model: c-index

Hello, I 'm using the package to calculate some real survival data, I find the cross-validation can only choose deviance on the test cohort to determine the support size. could you guys add the c_index principle for cox model's cross validation?

Why the data generator funciton `make_glm_data` for gamma will define n shape parameters for a data set

Describe the bug
Why the data generator funciton make_glm_data for gamma will define n shape parameters for a data set

elif family == "gamma":
            x = x / 16
            m = 5 * np.sqrt(2 * np.log(p) / n)
            if coef_ is None:
                Tbeta[nonzero] = np.random.uniform(m, 100 * m, k) * sign
            else:
                Tbeta = coef_
            # add noise
            eta = x @ Tbeta + np.random.normal(0, sigma, n)
            # set coef_0 to make eta<0
            eta = eta - np.abs(np.max(eta)) - 10
            eta = -1 / eta
            # set the shape para of gamma uniformly in [0.1,100.1]
            shape_para = 100 * np.random.uniform(0, 1, n) + 0.1
            y = np.random.gamma(
                shape=shape_para,
                scale=eta / shape_para,
                size=n)

Additional context

Would it be more sensible that a data set share the same shape parameter?

Incorporation of `fit_intercept` to `LinearRegression`?

Would it be possible to introduce a fit_intercept to LinearRegression, similar to the sklearn linear regression implementation?

Request for the Generalization to Ordinal Logistics Regression

Your method performs quite well on high-dimensional linear models. I am curious about whether it will work on ordinal logistics regression, and this also seems to be a meaningful generalization. Will you implement this new feature in the near future? Thanks!

All types of tune go wrong in PCA model

Describe the bug

All types of tune go wrong in PCA model, include "gic", "aic", "bic", "ebic" and "cv". Specifically, all information metric methods return 0; the result of "cv" method monotonically decreases as support_size increases so it's useless for selecting support_size.

Code for Reproduction

n <- 10000
p <- 5
support_size <- 3
dataset <- generate.spc.matrix(n, p, support_size, snr = 100, seed = 1)

for(ic_type in c("gic", "aic", "bic", "ebic")){
  spca_fit <- abesspca(dataset[["x"]], tune.type = ic_type)
  if(all(spca_fit[["tune.value"]] == 0)){
    print(sprintf("tune.value of %s is all zero!", ic_type))
  } 
}

spca_fit2 <- abesspca(dataset[["x"]], tune.type = "cv")
if(!is.unsorted(-spca_fit2[["tune.value"]])){
  print("tune.value of cv is sorted!")
}

Results:

[1] "tune.value of gic is all zero!"
[1] "tune.value of aic is all zero!"
[1] "tune.value of bic is all zero!"
[1] "tune.value of ebic is all zero!"
[1] "tune.value of cv is sorted!"

Desktop (please complete the following information):

OS: x86_64-w64-mingw32
R version 4.2.1
Package Version 0.4.6

the class `OrdinalRegression` can’t be imported from the module `abess.linear`

Hello, I'm trying to running make html in command line to convert comments to .html files, but there are soming warnings and errors in compiling the basic examples like cox regression. The warnings show that the class OrdinalRegression can’t be imported from the module abess.linear. Then I try to delete my comments, but there are still warnings and errors.
Here are the warnings:

WARNING: autodoc: failed to import class OrdinalRegression' from module' abess. linear
AttributeError: module’ abess. linear' has no attribute’ OrdinalRegression'
checking consistency... D: \My app \Gi thubDesktop \abess \docs \Tutorial\1-g 1m \README. rst: WARNING: document isn' t included in any toctree
D: \My app \Gi thubDesktop \abess \docs \Tutorial \2- pca \README. rst: WARNING: document isn' t included in any toctree
app\Gi thubDesktop \abess \docs \Tutorial \3 -advanced features \README.rst WARNING: document isn't included in any toctree
D: \My app \Gi thubDesktop \abess \docs \Tutorial \4-computation-tips \README. rst: WARNING: document isn' t included in any toctree
D: \My app \Gi thubDesktop \abess \docs \Tutorial \5-scikit-learn-connection \README. rst: WARNING: document isn' t included in any toctree 
D: \My app \Gi thubDesktop \abess \docs \Tutorial \README. rst: WARNING: document isn' t inc luded in any toctree

Have problem in abessMultinomial

When I want to use abess in multi classfication,I use the abessMultinomial and run the example code

from abess.linear import abessMultinomial
from abess.datasets import make_multivariate_glm_data
import numpy as np
np.random.seed(12345)
data = make_multivariate_glm_data(n = 100, p = 50, k = 10, M = 3, family = 'multinomial')
model = abessMultinomial(support_size = [10])
model.fit(data.x, data.y)
model.predict(data.x)

And I get an int 47,but I think it should get a array of the class such as .

array([[1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],

I'm so sorry to bother you again.And thanks a lot for yours help.

Input matrix X containing a constant column for `LinearRegression` is more complicated than that in `scikit-learn`

When the input matrix X contains a constant column, the LinearRegression() class in abess package makes prediction with nan instead of estimated values, which is the case of scikit-learn class LassoCV(). One way to avoid this is that we set the parameter is_normal=False, however, this is not the way user likes and scikit-learn works. Since I have encountered this kind of thing many times，I wonder if there is any possible that you can optimize this API. The following codes describe the case concisely:

cross validation in R package

Describe the bug
The cross validation result is not the same as the result written in R.

The code to reproduce

library(abess)
n <- 100
p <- 200
support.size <- 3
dataset <- generate.data(n, p, support.size, seed = 1)
ss <- 0:10

nfolds <- 5
foldid <- rep(1:nfolds, ceiling(n / nfolds))[1:n]
abess_fit <- abess(dataset[["x"]], dataset[["y"]], 
                   tune.type = "cv", nfolds = nfolds, 
                   foldid = foldid, support.size = ss, num.threads = 1)

cv <- rep(0, length(ss))
for (k in 1:nfolds) {
  abess_fit_k <- abess(dataset[["x"]][foldid != k, ], 
                       dataset[["y"]][foldid != k], support.size = ss)
  y_hat_k <- predict(abess_fit_k, dataset[["x"]][foldid == k, ], 
                     support.size = ss)
  fold_cv <- apply(y_hat_k, 2, function(yh) {
    mean((dataset[["y"]][foldid == k] - yh)^2)
  })
  fold_cv <- round(fold_cv, digits = 2)
  print(fold_cv)
  cv <- cv + fold_cv
}
cv <- cv / nfolds
names(cv) <- NULL
all.equal(cv, abess_fit$tune.value, digits = 2)

Expected behavior
The output of all.equal(cv, abess_fit$tune.value, digits = 2) is TRUE. However, the output is "Mean relative difference: 0.0008444762".

System info

platform       x86_64-apple-darwin17.0     
arch           x86_64                      
os             darwin17.0                  
system         x86_64, darwin17.0          
status                                     
major          4                           
minor          1.0                         
year           2021                        
month          05                          
day            18                          
svn rev        80317                       
language       R                           
version.string R version 4.1.0 (2021-05-18)
nickname       Camp Pontanezen

Possible to include calcualtion of Bayes Factor according to Dunstan et al (2022)?

Dunstan et al in Easy computation of the Bayes factor to fully quantify Occam’s razor in least-squares fitting and to guide actions present "an easy way of calculating [the Bayes Factor] so that it can be routinely used with all least-squares fitting to complement and augment other figures of merit".

Would it be possible to include these calculations as an alternative ic_type?

Error: invalid register for .seh_savexmm

I tried to install the latest version of the abess Python package, and I followed the guideline in https://abess.readthedocs.io/en/latest/Installation.html.
But the error Error: invalid register for .seh_savexmm occurs.

The info of my computer are:

Platform Version: Windows-10-10.0.22000-SP0, 64bit
Python Version: 3.7.4
'CPU info:  Intel64 Family 6 Model 85 Stepping 4, GenuineIntel'
Package Version: 0.4.0

Illegal instruction with conda-forge package

In the HPC environment (Intel, linux) even very simple script give Illegal Instruction errors. This suffices to generate the error

model = abess.LinearRegression()
model.fit(np.random.rand(2,2), np.random.rand(2))

However, this occurs only with the latest abess binary from conda-forge. If I use pip to install the package, everything runs smoothly.

memory out when combining abess with auto-sklearn for classification.

Describe the bug
I'm doing some experiments about combining abess with auto-sklearn, when using MultinomialRegression for classification, the memory tends to increase very quickly and so much that it cannot be displayed on a web page, but for LinearRegression, there is no similar out-of-memory problem.

Code for Reproduction

My code is given as follows:

from pprint import pprint

from ConfigSpace.configuration_space import ConfigurationSpace
from ConfigSpace.hyperparameters import CategoricalHyperparameter, \
    UniformIntegerHyperparameter, UniformFloatHyperparameter

import sklearn.metrics
import autosklearn.classification
import autosklearn.pipeline.components.classification
from autosklearn.pipeline.components.base \
    import AutoSklearnClassificationAlgorithm
from autosklearn.pipeline.constants import DENSE, SIGNED_DATA, UNSIGNED_DATA, \
    PREDICTIONS

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
import openml
from abess import MultinomialRegression
from sklearn.ensemble import RandomForestClassifier
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier

import time
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

class AbessClassifier(AutoSklearnClassificationAlgorithm):

    def __init__(self, exchange_num, random_state=None):
        self.exchange_num = exchange_num
        self.random_state = random_state
        self.estimator = None

    def fit(self, X, y):
        from abess import MultinomialRegression
        self.estimator = MultinomialRegression()
        self.estimator.fit(X, y)
        return self

    def predict(self, X):
        if self.estimator is None:
            raise NotImplementedError
        return self.estimator.predict(X)
    
    def predict_proba(self, X):
        if self.estimator is None:
            raise NotImplementedError()
        return self.estimator.predict_proba(X)

    @staticmethod
    def get_properties(dataset_properties=None):
        return {
            'shortname': 'abess Classifier',
            'name': 'abess logistic Classifier',
            'handles_regression': False,
            'handles_classification': True,
            'handles_multiclass': True,
            'handles_multilabel': False,
            'handles_multioutput': False,
            'is_deterministic': False,
            # Both input and output must be tuple(iterable)
            'input': [DENSE, SIGNED_DATA, UNSIGNED_DATA],
            'output': [PREDICTIONS]
        }
    
    @staticmethod
    def get_hyperparameter_search_space(dataset_properties=None):
        cs = ConfigurationSpace() 
        exchange_num=UniformIntegerHyperparameter(
            name='exchange_num', lower=4, upper=6, default_value=5
        )
        cs.add_hyperparameters([exchange_num])
        return cs
    
# Add abess logistic classifier component to auto-sklearn.
autosklearn.pipeline.components.classification.add_classifier(AbessClassifier)
cs = AbessClassifier.get_hyperparameter_search_space()
print(cs)

dataset = fetch_openml(data_id = int(29),as_frame=True)#507,183,44136
X=dataset.data
y=dataset.target
X.replace([np.inf,-np.inf],np.NaN,inplace=True)
## Remove rows with NaN or Inf values
inx=X[X.isna().values==True].index.unique()
X.drop(inx,inplace=True)
y.drop(inx,inplace=True)
##use dummy variables to replace classification variables:
X = pd.get_dummies(X)
## Keep only numeric columns
X = X.select_dtypes(np.number)
## Remove columns with NaN or Inf values
nan = np.isnan(X).any()[np.isnan(X).any() == True]
inf = np.isinf(X).any()[np.isinf(X).any() == True]
X = X.drop(columns = list(nan.index))
X = X.drop(columns = list(inf.index))
##Encode target labels with value between 0 and 1
le = preprocessing.LabelEncoder()
y = le.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y)
print(X_train.shape) #number of initial features
print(X_test.shape) #number of initial features

cls = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60,
    per_run_time_limit=10,
    include={
            'classifier': ['AbessClassifier'],
            'feature_preprocessor': ['polynomial']
        },
    memory_limit=6144,
    ensemble_size=1,
)
cls.fit(X_train, y_train, X_test, y_test)
predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

After running this code , the memory gets to about 159MB, which is not friendly for users to open an .ipynb. Again, regression does not encounter the memory-out problem.

Can't find cpp function

In my fork repo, the developing part in "bbayukari/abess/develop-ordinal" worked well before, but the new version which I want to PR to upstream in "bbayukari/abess/ordinal" didn't work(not only my part didn't work, but all models can't find the api). In the new version, I locate the cpp code at abess/src where isn't in R-package, that's the only difference between the two versions.

After clone the code in "bbayukari/abess/ordinal", then "install and restart", and test simply like:

dataset <- generate.data(150,100,3)
abess(dataset[["x"]],dataset[["y"]])

then,

Error in abessGLM_API(x = x, y = y, n = nobs, p = nvars, normalize_type = normalize,  : 
  could not find function "abessGLM_API"
Called from: abess.default(dataset[["x"]], dataset[["y"]])

Cross validation is slower in version 0.4.5 than in 0.4.0

Describe the bug

In my experiments, after updating abess from 0.4.0 to 0.4.5, I found the cv procedure get slower in some cases. The following code provides an example.

Code for Reproduction

library(microbenchmark)
library(abess)
n <- 3000
p <- 500
support.size <- 10

sim_once <- function(seed) {
  dataset <- generate.data(n, p, support.size, family = "binomial", seed = seed)

  time_cv <- microbenchmark(
    abess_fit <- abess(dataset[["x"]], dataset[["y"]], family = "binomial", tune.type = "cv", nfolds = 10),
    times = 1
  ) [["time"]] / 10^9

  time_cv
}

# average time
time <- sapply(1:5, sim_once)
mean(time)

OrdinalRegression() in cross_validate report a typeerro said no scoring is specified.

Describe the bug

Hello, I try to use OrdinalRegression() in cross_validate, but it returns a TypeError said no scoring is specified. Here is a demo to recurrent it.

Code for Reproduction

# %%
from sklearn.model_selection import cross_validate
from abess import OrdinalRegression, make_glm_data
import numpy as np

np.random.seed(0)
data = make_glm_data(n=1000, p=200, k=30, family='ordinal')
model = OrdinalRegression()
result = cross_validate(model, data.x ,data.y)

Expected behavior

return the cv score of the ‘OrdinalRegression’ model.

Desktop (please complete the following information):

OS: Windows 10
Python Version 3.9.7
abess Version 0.4.5

Screenshots

Import error

When I import abess in each python version I try,such as ( 3.5 3.6 3.7 ) all report the error as follow'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Anaconda3\envs\cp3710\lib\site-packages\abess\__init__.py", line 9, in <module>
    from abess.linear import abessLogistic, abessLm, abessCox, abessPoisson, abessMultigaussian, abessMultinomial
  File "C:\ProgramData\Anaconda3\envs\cp3710\lib\site-packages\abess\linear.py", line 3, in <module>
    from .bess_base import bess_base
  File "C:\ProgramData\Anaconda3\envs\cp3710\lib\site-packages\abess\bess_base.py", line 7, in <module>
    from abess.cabess import pywrap_abess
  File "C:\ProgramData\Anaconda3\envs\cp3710\lib\site-packages\abess\cabess.py", line 13, in <module>
    from . import _cabess
ImportError: DLL load failed: 找不到指定的模块。

I try all I can find in the web,but not works, cloud anyone can help me .

There are some warnings when calling abess in DoubleML.

I use an abess learner in DoubleMLPLR, but there are some warnings. The following is my code:

from doubleml import DoubleMLData
from sklearn.base import clone
from abess.linear import LinearRegression    
n_obs = 500
n_vars = 100
theta = 3
X = np.random.normal(size=(n_obs, n_vars))
d = np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
y = theta * d + np.dot(X[:, :3], np.array([5, 5, 5])) + np.random.standard_normal(size=(n_obs,))
dml_data_sim = DoubleMLData.from_arrays(X, y, d)
abess = LinearRegression(cv = 5)
ml_g_abess = clone(abess)
ml_m_abess = clone(abess)
ml_plr_abess = DoubleMLPLR(dml_data_sim, ml_g_abess, ml_m_abess)
dml_plr_abess.fit();

After running the code, there will be warnings like "Learner provided for ml_g is probably invalid".

Add new Arxiv reference 2110.09697 to readme

Mention new reference https://arxiv.org/abs/2110.09697 in readme.

Apparently Inconsistent Behaviour in LinearRegression - Expected Behaviour or Not?

I am getting apparently inconsistent results from LinearRegression depending on how I specify the support_size.

Essentially, when using np.nonzero(model.coef_) to obtain the support set, I get inconsistent results between the following:

support_size = 1 : 'A' is chosen.
support_size = 2 : 'A' and 'B' are chosen.
support_size = [1,2] : 'A' and 'C' are chosen.

'A', 'B', and 'C' are all 'correct' to some extent, but one issue I am facing is that I cannot get 'C' to appear in a support set using a single value for support_size unless that value is much larger than it needs to be, in this case support_size = 18. All other arguments are their default values.

Before we delve into what might be happening, I guess I need to ask if this is expected behaviour or not?

Segmentation fault using v0.4.6 on Linux

Describe the bug

When installing v0.4.6 any calls to the abess library causes a segmentation fault.

The same simple script with v0.4.5 results in a normal execution.

Code for Reproduction

from abess.linear import LinearRegression
from abess.datasets import make_glm_data
import numpy as np
np.random.seed(12345)
data = make_glm_data(n = 100, p = 50, k = 10, family = 'gaussian')
model = LinearRegression(support_size = 10)
model.fit(data.x, data.y)

print(model.predict(data.x)[:4])

Using v0.4.6 results in the following error:

Expected behavior

I expected the predictions to be printed without segmentation fault.

Example run with v0.4.5:

Desktop (please complete the following information):

OS: Linux-5.4.0-1098-azure-x86_64-with-glibc2.17, 64bit
Python Version: 3.8.12
Package Version: 0.4.6

How I can aviod intercept when I do abess in R

I'm trying to use Abess in R, I didn't find any option in argument for avioding intercept. I also try to use fomular: y ~ 0 + x to aviod intercept but fail.

Is there a way to do that?

install R package from github via devtools::install_github()

I meet trouble when I install R package abess from github directly.

I have delete original abess, and then just use the code "devtools::install_github("abess-team/abess/R-package")" in RStudio, and this is the error information:

Downloading GitHub repo abess-team/abess@HEAD
These packages have more recent versions available.
It is recommended to update all of them.
Which would you like to update?

1: All
2: CRAN packages only
3: None
4: Rcpp (1.0.7 -> 1.0.8) [CRAN]

Enter one or more numbers, or an empty line to skip updates:
√ checking for file 'C:\Users\ustc\AppData\Local\Temp\RtmpKkyovB\remotes32f06dd94478\abess-team-abess-4d97f97\R-package/DESCRIPTION' ...

preparing 'abess':
√ checking DESCRIPTION meta-information ...
cleaning src
installing the package to process help pages
-----------------------------------
installing source package 'abess' ...
** using staged installation
cp: cannot stat '../include': No such file or directory
cp: cannot stat '../src/.cpp': No such file or directory
cp: cannot stat '../src/.h': No such file or directory
ERROR: configuration failed for package 'abess'
removing 'C:/Users/ustc/AppData/Local/Temp/RtmpEXjWyu/Rinst72056862b16/abess'
-----------------------------------
ERROR: package installation failed
Error: Failed to install 'abess' from GitHub:
System command 'Rcmd.exe' failed, exit status: 1, stdout + stderr (last 10 lines):
E> -----------------------------------
E> * installing source package 'abess' ...
E> ** using staged installation
E> cp: cannot stat '../include': No such file or directory
E> cp: cannot stat '../src/.cpp': No such file or directory
E> cp: cannot stat '../src/.h': No such file or directory
E> ERROR: configuration failed for package 'abess'
E> * removing 'C:/Users/ustc/AppData/Local/Temp/RtmpEXjWyu/Rinst72056862b16/abess'
E> -----------------------------------
E> ERROR: package installation failed

Provide any option for inference?

More of a feature request - do you think it might be possible to implement any form of statistical inference in abess? E.g. provide confidence intervals on coefficients via bootstrapping? Or repeatedly bootstrap the dataset & refit an abess model on each of those bootstrapped datasets, calculate the union of selected variables across all fits on bootstrapped datasets, repeat this until the size of the union of selected variables no longer grows & then refit a single regular GLM using base R's GLM function on this union of selected variables (I don't know if something like this has ever been suggested in the literature - I thought I had seen some suggestion along these lines in one of the Tibshirani articles on selective inference, but I can't seem to find it now).

Unexpected NaN produced when fitting logistic regression.

Describe the bug

Unexpected NaN appears in regression coefficients when fitting logistic regression on a real data.

Code for Reproduction

library(abess)
x_train <- as.matrix(read.table("arcene_train.txt"))
y_train <- (read.table("arcene_train_labels.txt")$V1 + 1) / 2
fit_abess <- abess(x_train, y_train, family = "binomial")
beta <- coef(fit_abess, fit_abess$best.size)
sum(is.nan(as.vector(beta)))

Issue
The coefficient contains NaNs, which is not expected, since there is no NA or NaN in the data, which can be easily checked via

sum(is.na(x_train))
sum(is.na(y_train))
sum(is.nan(x_train))
sum(is.nan(y_train))

Desktop

OS: Windows10
R Version: 4.0.5
Package Version: 0.3.0

Dataset

arcene_train.txt
arcene_train_labels.txt

Use same keyword arguments for sample weight as sklearn

First of all: Thank you for the great package, it has been very helpful. Now to my suggestion:

Abess uses weight
https://abess.readthedocs.io/en/latest/Python-package/linear/Logistic.html?highlight=score#abess.linear.LogisticRegression.fit
Sklearn uses sample_weight
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression.fit

I'm using both in a project and it would be helpful if abess followed the sklearn convention.

Is it possible to combine `group` and `always_select` to always select a whole group?

If we wish to use always_select to select an entire group of targets, how should this be done? I can imagine two possibilities:

Apply always_select to all members of the group, or
Apply always_select to any member of the group.

Which of these should work correctly?

Import error #2

When I import abess in my python file, I see the following error:

Traceback (most recent call last):
File "c:\Users\igork\code\best_subset.py", line 1, in
from abess.linear import abessLm
File "C:\Users\igork\AppData\Roaming\Python\Python39\site-packages\abess_init_.py", line 9, in
from abess.linear import abessLogistic, abessLm, abessCox, abessPoisson, abessMultigaussian, abessMultinomial, abessGamma
File "C:\Users\igork\AppData\Roaming\Python\Python39\site-packages\abess\linear.py", line 3, in
from .bess_base import bess_base
File "C:\Users\igork\AppData\Roaming\Python\Python39\site-packages\abess\bess_base.py", line 6, in
from .cabess import *
File "C:\Users\igork\AppData\Roaming\Python\Python39\site-packages\abess\cabess.py", line 13, in
from . import _cabess
ImportError: DLL load failed while importing _cabess: Не найден указанный модуль.

I tried installing MinGW-w64 and adding it to PATH as written here #259 : but the problem persists.

abess was installed via pip install. python version 3.9.6
Can you help me?

Perhaps a typo in the online tutorial for multi-response linear regression

The loss function seems incorrect in the first section of the online tutorial. Should the 2-norm of matrices be changed to the Frobenius norm instead? Both for R & Python tutorials.

R tutorial
Python tutorial

abess-team / abess Goto Github PK

abess's Introduction

abess: Fast Best-Subset Selection in Python and R

Overview

Quick start

Python package

R package

Runtime Performance

Python package

R package

Open source software

What's news

Citation

References

abess's People

Contributors

Stargazers

Watchers

Forkers

abess's Issues

Recommend Projects

Recommend Topics

Recommend Org