larskotthoff / fselector Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 7.0 239 KB

FSelector R package

R 100.00%

fselector's People

Contributors

Stargazers

Watchers

Forkers

iabrady sambrista lkegel strategist922 joaobat pat-s lkampoli

fselector's Issues

fselector seems to stall when parallelized with multicore?

running this code (from bernd, see link to issue below)

library(parallel)                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                       
 fun = function(foo) {                                                                                                                                             
   chi.squared(Species ~ ., data = iris)                                                                                                                           
 }                                                                                                                                                                 
                                                                                                                                                                    
 mclapply(1:2, fun, mc.cores = 2)

stalls. the same is true if instead of parallel, the parallelMap package is
used with 'multicore' (but parallelizing via 'socket' does not stall)--
is it possible that fselector cannot be used in 'multicore' fashion,
maybe because of the underlying java/weka packages?

the previous discussion is here: mlr-org/mlr#1802

get.data.frame.from.formula function disaster

This function:

get.data.frame.from.formula <- function(formula, data) {
    d = model.frame(formula, data, na.action = NULL)
    for(i in 1:dim(d)[2]) {
        if(is.factor(d[[i]]) || is.logical(d[[i]]) || is.character(d[[i]]))
            d[[i]] = factor(d[[i]])
    }
    return(d)
}

Compute model.frame just in case there are some column with types logical or characterand convert them to factors - note that even factor is converted to factor (is.factor - > factor)...

But what if there are only numeric, factor or integer data? model.frameis computed every time before the class of columns from data.frame are checked... Maybe there should first be some column-type-check before that computation.

More over this function is evaluated many times in many places - I don't get any idea why it should be computed 3 times for example in

discretize.all <- function(formula, data) {
(1) new_data = get.data.frame.from.formula(formula, data)

    dest_column_name = dimnames(new_data)[[2]][1]
    if(!is.factor(new_data[[1]])) {
        new_data[[1]] = equal.frequency.binning.discretization(new_data[[1]], 5)
    }

(2) new_data = supervised.discretization(formula, data = new_data)

    # reorder attributes
(3) new_data = get.data.frame.from.formula(formula, new_data)
    return(new_data)
}

I see the comment that this is done for reordering attributes but is really necessary to try this that way? Loop in get.data.frame.from.formula can take long for huge datasets.

So it's done twice and also supervised.discretization make this one more time at the beggining

supervised.discretization <- function(formula, data) {
    data = get.data.frame.from.formula(formula, data)

Moreover what if data argument isn't really a data.frame? Is there any stopifnot?
I've checked information.gain function and there is any stopifnot. Noone nowhere checks whether arguments are of propoer types. The only hope is that maybe model.frame from get.data.frame.from.formula checks this.. But what if formula is not a formula and type argument is just a random character?

CFS execution failure

Dear Lars,

The FSelector CFS algorithm crashes when I apply it to my data. The crash is accompanied by an error message that suggests that the source of the problem is in RWeka/Weka. I've tried to debug the problem, but do not know enough about RWeka or Weka to do so. The problem is reproducible, but only happens with certain data. I'm unable to determine what attributes of the data lead to this problem. There are no NaNs, NAs, or infinities in the data. It's numeric. I've tried to find a way to determine whether a specific data sample will result in a crash, so that I can subset-out the specific columns in my data.frame that are responsible for triggering this behaviour, but have been unable to do so. The data that triggers this behaviour is extremely highly skewed and centred at zero with a long tail out to relatively large values (basically, the histograms look at first glance like only one bin is filled). A similar crash and error message occurs if I run RWeka's Discretize algorithm on my data. I'm guessing that RWeka/Weka is not able to find an appropriate binning for the data to discretize it. Centering and scaling the data (and perhaps also applying a Box-Cox transform) using Caret's preProcess algorithm sometimes fixes the data.

Here's how I'm invoking CFS:

foo <- cfs(class ~ ., data = mydata)

Error message:

Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
Calls: ... Discretize -> RWeka_use_filter -> .jcall -> .jcheck -> .Call
Execution halted

Session info:

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: i686-pc-linux-gnu (32-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] knitr_1.9

loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.1 htmltools_0.2.6 tools_3.2.0 yaml_2.1.13
[6] stringi_0.4-1 rmarkdown_0.5.1 stringr_1.0.0 digest_0.6.8 evaluate_0.6

Is there a way that FSelector could be made to exit gracefully instead of crashing? This would allow execution of code to continue. Any ideas what the problem could be? Googling the error message wasn't particularly informative, and I'm just guessing that there is something about the distribution of my data that leads to RWeka/Weka failing to discretize it, leading to the crash.

I can supply data for troubleshooting if required.

With many thanks,

Andrew.

Discretization

Can you elaborate, which function (Discretize(formula, data = new_data, na.action = na.pass)) is called in line 57 and line 60 of R/discretize.R. I am trying to understand how the discretization works.

Thanks in advance !

Sparse Matrixes Support

Have you considered supporting sparse matrixes format of the data?
It looks like for a (e.g) information.gain() function only a data.frame format of a data parameter
is possible. Sparse matrixes has that huge advantage that can contain sparse ( :) ) information of a huge dimensional regular data.frame. If a sparsity is over 99% there is great need and demand to store data in a sparse matrix. Sometimes there isn't even a RAM possibility to convert already stored data in a sparse format to a data.frame. In my case I have data that in a sparse format has about 170 MB but in a regular data.frame format it has over 3,6 GB. And that's the point.

Any plans or ideas for a future support?

Handling class with single case

Hi,

This is more of a question about how the function handles a situation rather than a technical issue. I am using information.gain for feature selection. The data set contains 40 cases, 15 from class A, 10 from B, 3 from C and then the remaining are the single cases in their classes. When I label them like this using information gain they give me a list of attribute importance. However when I am using a single case against everything else e.g. Class D vs non-Class D, all attribute importance drops to zero.

That does make sense since you can't calculate the entropy for that if you have only one case in one class. However, I would like to know how it handles those classes with single case in the mix of some other "normal" classes. If the attribute score means nothing to those classes with single case, perhaps adding a warning message will be nice. It will be very useful for data science beginners like myself.

Thanks a lot.

Question concerning oneR

To my understanding, oneR assesses the error rate of the following prediction rule for each feature vec: Predict the class that is most frequent among all instances with feature vec having level val.
For each level val, calculate the ratio of misclassified instances among all instances with feature vec having level val and store the result in the vector errors.

To obtain an aggregated error rate for feature vec, why do you simply sum up the components of vecor errors?
Intuitively, I would say that this only makes sense if the frequency of all levels is (almost) identical. Otherwise I would calculate a weighted mean.
I tried the discretization for some datasets e.g. the iris dataset and for example variable Sepal.Length was split into 3 categories with frequencies 59, 36 and 55. I'm not sure if a simple sum of the error rates really makes sense here.

R Studio Crashing after loading FSelector

Whenever I try to run library(FSelector), R Studio pops up a message box saying Fatal Error R Session Crashed.

System Information:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Platform: x86_64-apple-darwin15.6.0 (64-bit)

larskotthoff / fselector Goto Github PK

fselector's People

Contributors

Stargazers

Watchers

Forkers

fselector's Issues

fselector seems to stall when parallelized with multicore?

get.data.frame.from.formula function disaster

CFS execution failure

Discretization

Sparse Matrixes Support

Handling class with single case

Question concerning oneR

R Studio Crashing after loading FSelector

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent