riken-aip / pyhsiclasso Goto Github PK
View Code? Open in Web Editor NEWVersatile Nonlinear Feature Selection Algorithm for High-dimensional Data
License: MIT License
Versatile Nonlinear Feature Selection Algorithm for High-dimensional Data
License: MIT License
There are few explanatory variables, so bug occurs. Please fix it!
When I used block Lasso for 77 features treshold (from 770 features) I got only 57 features. Block was divisor of number of data instances. However, when I used block as zero, I got exactly 77 features. Is it normal when block Lasso returns less features? This happened, when I used permutation parameter M
with value one.
The other difference is, that when I use vanilla Lasso I get following warning:
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:77: RuntimeWarning: divide by zero encountered in true_divide gamma1 = (C - c[I]) / (XtXw[A[0]] - XtXw[I])
Block lasso had no warnings.
Then I tried block Lasso with M=2
. I got 77 features, but also following warnings:
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:77: RuntimeWarning: invalid value encountered in true_divide gamma1 = (C - c[I]) / (XtXw[A[0]] - XtXw[I])
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:83: RuntimeWarning: invalid value encountered in less_equal gamma[gamma <= 1e-9] = np.inf
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:85: RuntimeWarning: invalid value encountered in less mu = min(gamma)
C:\Program Files\Python37\lib\site-packages\pyHSICLasso\nlars.py:77: RuntimeWarning: divide by zero encountered in true_divide gamma1 = (C - c[I]) / (XtXw[A[0]] - XtXw[I])
At last, I tried M=3
, also got 77 features and the same warning as with vanilla Lasso.
I have two questions. Should I use M=1
with no warning and less features or M=3
with the same warning as vanilla Lasso had? Are these warnings of some importance, or they are within normal expected behavior?
UPDATE
Now I tried to get 9200 features from 92000 with block Lasso with B=19, M=3 but I got even less features than before - only 33. Should I scale M
with number of features?
After HSIC Lasso (Regression) has finished executing, we will have the beta values for every feature in the training dataset. Therefore, is there a way to determine the predicted value of a given instance? I am trying to evaluate the model fit via mean squared error, as done in the original paper (High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso, Section 4.3.2)
Currently, the HSIC Lasso can only handle uni-variate output. Thus, extending the HSIC Lasso to multi-variate output.
pyHSICLasso/pyHSICLasso/hsic_lasso.py
Line 23 in 0617219
Hi, I'm wondering if some clarification could be provided on this difference.
In addition, is it necessary that the y_kernel and x_kernel are the same? My intuition is that they should be. But what I can see from the code, that is not enforced. What is the rationale that the y and X could be projected to a different space?
Accuracy decreased.
dataset: https://www.kaggle.com/artyomsalnikov/dataset-3
code: https://yadi.sk/d/xAsaL-TPGZe09A
Hello, I just tried this tool on a Metabolomics data I have. Interestingly, HSIC Lasso selects just 76 metabolites out of 2035 available metabolites. And the R-squared score if I use these selected metabolites is just 0.18. In comparison to Lasso on the original 2035 metabolites which obtains an R-squared of about 0.60. My assumption is probably the amount of selected features are too small. I used SVR (kernel='ref') from sklearn after feature selection with HSIC.
Is there a way to increase the number of features HSIC Lasso selects ?
Hey That's awesome and I'm trying to use it in my thesis , but may I ask how to use it as a classifier ? I have looked the whole code but how to fit to different subsets and get overall precision score?
What does Y represent when numpy is the input? How do I use it? I'm a little confused
Hi,
When trying to install the package in Anaconda, through pip or directly from setup.py, it throws the following error:
ImportError: cannot import name 'PackageFinder' from 'pip._internal.index'
Hi,
I've been extending this HSIC-LASSO implementation to use specific types of distance-based kernels for microbiome data. I'd like to verify if my understanding of the implementation and purpose of the "block" HSIC implementation is correct. First off, my understanding of the "block" part of the HSIC-LASSO is an optimization to speed up kernel computation time, correct? Second, in this code here, I have noticed that the "block" HSIC LASSO kernel computation constructs essentially "mini" (subsets of) kernels based on subsets of samples for single features (over the range of d features). If I am reading this is correctly, then this means that the kernel computation is constructed for a single dimension only, which misses modeling the combinatorial effects of multiple features. Of course, this function is ideal when combinatorial effects are present in the data. Perhaps I am missing something or not looking at the full picture. Could someone please elaborate on this? Thank you!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.