larskotthoff / fselector Goto Github PK
View Code? Open in Web Editor NEWFSelector R package
FSelector R package
running this code (from bernd, see link to issue below)
library(parallel)
fun = function(foo) {
chi.squared(Species ~ ., data = iris)
}
mclapply(1:2, fun, mc.cores = 2)
stalls. the same is true if instead of parallel, the parallelMap package is
used with 'multicore' (but parallelizing via 'socket' does not stall)--
is it possible that fselector cannot be used in 'multicore' fashion,
maybe because of the underlying java/weka packages?
the previous discussion is here: mlr-org/mlr#1802
This function:
get.data.frame.from.formula <- function(formula, data) {
d = model.frame(formula, data, na.action = NULL)
for(i in 1:dim(d)[2]) {
if(is.factor(d[[i]]) || is.logical(d[[i]]) || is.character(d[[i]]))
d[[i]] = factor(d[[i]])
}
return(d)
}
Compute model.frame
just in case there are some column with types logical
or character
and convert them to factors
- note that even factor
is converted to factor
(is.factor - > factor
)...
But what if there are only numeric, factor or integer data? model.frame
is computed every time before the class of columns from data.frame are checked... Maybe there should first be some column-type-check
before that computation.
More over this function is evaluated many times in many places - I don't get any idea why it should be computed 3 times for example in
discretize.all <- function(formula, data) {
(1) new_data = get.data.frame.from.formula(formula, data)
dest_column_name = dimnames(new_data)[[2]][1]
if(!is.factor(new_data[[1]])) {
new_data[[1]] = equal.frequency.binning.discretization(new_data[[1]], 5)
}
(2) new_data = supervised.discretization(formula, data = new_data)
# reorder attributes
(3) new_data = get.data.frame.from.formula(formula, new_data)
return(new_data)
}
I see the comment that this is done for reordering attributes but is really necessary to try this that way? Loop in get.data.frame.from.formula
can take long for huge datasets.
So it's done twice and also supervised.discretization
make this one more time at the beggining
supervised.discretization <- function(formula, data) {
data = get.data.frame.from.formula(formula, data)
Moreover what if data
argument isn't really a data.frame
? Is there any stopifnot
?
I've checked information.gain
function and there is any stopifnot
. Noone nowhere checks whether arguments are of propoer types. The only hope is that maybe model.frame
from get.data.frame.from.formula
checks this.. But what if formula
is not a formula and type
argument is just a random character?
Dear Lars,
The FSelector CFS algorithm crashes when I apply it to my data. The crash is accompanied by an error message that suggests that the source of the problem is in RWeka/Weka. I've tried to debug the problem, but do not know enough about RWeka or Weka to do so. The problem is reproducible, but only happens with certain data. I'm unable to determine what attributes of the data lead to this problem. There are no NaNs, NAs, or infinities in the data. It's numeric. I've tried to find a way to determine whether a specific data sample will result in a crash, so that I can subset-out the specific columns in my data.frame that are responsible for triggering this behaviour, but have been unable to do so. The data that triggers this behaviour is extremely highly skewed and centred at zero with a long tail out to relatively large values (basically, the histograms look at first glance like only one bin is filled). A similar crash and error message occurs if I run RWeka's Discretize algorithm on my data. I'm guessing that RWeka/Weka is not able to find an appropriate binning for the data to discretize it. Centering and scaling the data (and perhaps also applying a Box-Cox transform) using Caret's preProcess algorithm sometimes fixes the data.
Here's how I'm invoking CFS:
foo <- cfs(class ~ ., data = mydata)
Error message:
Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
Calls: ... Discretize -> RWeka_use_filter -> .jcall -> .jcheck -> .Call
Execution halted
Session info:
sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: i686-pc-linux-gnu (32-bit)
Running under: Ubuntu precise (12.04.5 LTS)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.9
loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.1 htmltools_0.2.6 tools_3.2.0 yaml_2.1.13
[6] stringi_0.4-1 rmarkdown_0.5.1 stringr_1.0.0 digest_0.6.8 evaluate_0.6
Is there a way that FSelector could be made to exit gracefully instead of crashing? This would allow execution of code to continue. Any ideas what the problem could be? Googling the error message wasn't particularly informative, and I'm just guessing that there is something about the distribution of my data that leads to RWeka/Weka failing to discretize it, leading to the crash.
I can supply data for troubleshooting if required.
With many thanks,
Andrew.
Can you elaborate, which function (Discretize(formula, data = new_data, na.action = na.pass)
) is called in line 57 and line 60 of R/discretize.R
. I am trying to understand how the discretization works.
Thanks in advance !
Have you considered supporting sparse matrixes format of the data?
It looks like for a (e.g) information.gain()
function only a data.frame
format of a data
parameter
is possible. Sparse matrixes has that huge advantage that can contain sparse ( :) ) information of a huge dimensional regular data.frame
. If a sparsity is over 99% there is great need and demand to store data in a sparse matrix. Sometimes there isn't even a RAM possibility to convert already stored data in a sparse format to a data.frame
. In my case I have data that in a sparse format has about 170 MB but in a regular data.frame
format it has over 3,6 GB. And that's the point.
Any plans or ideas for a future support?
Hi,
This is more of a question about how the function handles a situation rather than a technical issue. I am using information.gain
for feature selection. The data set contains 40 cases, 15 from class A, 10 from B, 3 from C and then the remaining are the single cases in their classes. When I label them like this using information gain they give me a list of attribute importance. However when I am using a single case against everything else e.g. Class D vs non-Class D, all attribute importance drops to zero.
That does make sense since you can't calculate the entropy for that if you have only one case in one class. However, I would like to know how it handles those classes with single case in the mix of some other "normal" classes. If the attribute score means nothing to those classes with single case, perhaps adding a warning message will be nice. It will be very useful for data science beginners like myself.
Thanks a lot.
To my understanding, oneR assesses the error rate of the following prediction rule for each feature vec: Predict the class that is most frequent among all instances with feature vec having level val.
For each level val, calculate the ratio of misclassified instances among all instances with feature vec having level val and store the result in the vector errors.
To obtain an aggregated error rate for feature vec, why do you simply sum up the components of vecor errors?
Intuitively, I would say that this only makes sense if the frequency of all levels is (almost) identical. Otherwise I would calculate a weighted mean.
I tried the discretization for some datasets e.g. the iris dataset and for example variable Sepal.Length was split into 3 categories with frequencies 59, 36 and 55. I'm not sure if a simple sum of the error rates really makes sense here.
Whenever I try to run library(FSelector)
, R Studio pops up a message box saying Fatal Error R Session Crashed.
System Information:
R version 3.5.1 (2018-07-02) -- "Feather Spray"
Platform: x86_64-apple-darwin15.6.0 (64-bit)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.