Giter VIP home page Giter VIP logo

fselector's People

Contributors

larskotthoff avatar pat-s avatar sambrista avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

fselector's Issues

fselector seems to stall when parallelized with multicore?

running this code (from bernd, see link to issue below)

library(parallel)                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                       
 fun = function(foo) {                                                                                                                                             
   chi.squared(Species ~ ., data = iris)                                                                                                                           
 }                                                                                                                                                                 
                                                                                                                                                                    
 mclapply(1:2, fun, mc.cores = 2)     

stalls. the same is true if instead of parallel, the parallelMap package is
used with 'multicore' (but parallelizing via 'socket' does not stall)--
is it possible that fselector cannot be used in 'multicore' fashion,
maybe because of the underlying java/weka packages?

the previous discussion is here: mlr-org/mlr#1802

get.data.frame.from.formula function disaster

This function:

get.data.frame.from.formula <- function(formula, data) {
    d = model.frame(formula, data, na.action = NULL)
    for(i in 1:dim(d)[2]) {
        if(is.factor(d[[i]]) || is.logical(d[[i]]) || is.character(d[[i]]))
            d[[i]] = factor(d[[i]])
    }
    return(d)
}

Compute model.frame just in case there are some column with types logical or characterand convert them to factors - note that even factor is converted to factor (is.factor - > factor)...

But what if there are only numeric, factor or integer data? model.frameis computed every time before the class of columns from data.frame are checked... Maybe there should first be some column-type-check before that computation.

More over this function is evaluated many times in many places - I don't get any idea why it should be computed 3 times for example in

discretize.all <- function(formula, data) {
(1) new_data = get.data.frame.from.formula(formula, data)

    dest_column_name = dimnames(new_data)[[2]][1]
    if(!is.factor(new_data[[1]])) {
        new_data[[1]] = equal.frequency.binning.discretization(new_data[[1]], 5)
    }

(2) new_data = supervised.discretization(formula, data = new_data)

    # reorder attributes
(3) new_data = get.data.frame.from.formula(formula, new_data)
    return(new_data)
}

I see the comment that this is done for reordering attributes but is really necessary to try this that way? Loop in get.data.frame.from.formula can take long for huge datasets.

So it's done twice and also supervised.discretization make this one more time at the beggining

supervised.discretization <- function(formula, data) {
    data = get.data.frame.from.formula(formula, data)

Moreover what if data argument isn't really a data.frame? Is there any stopifnot?
I've checked information.gain function and there is any stopifnot. Noone nowhere checks whether arguments are of propoer types. The only hope is that maybe model.frame from get.data.frame.from.formula checks this.. But what if formula is not a formula and type argument is just a random character?

CFS execution failure

Dear Lars,

The FSelector CFS algorithm crashes when I apply it to my data. The crash is accompanied by an error message that suggests that the source of the problem is in RWeka/Weka. I've tried to debug the problem, but do not know enough about RWeka or Weka to do so. The problem is reproducible, but only happens with certain data. I'm unable to determine what attributes of the data lead to this problem. There are no NaNs, NAs, or infinities in the data. It's numeric. I've tried to find a way to determine whether a specific data sample will result in a crash, so that I can subset-out the specific columns in my data.frame that are responsible for triggering this behaviour, but have been unable to do so. The data that triggers this behaviour is extremely highly skewed and centred at zero with a long tail out to relatively large values (basically, the histograms look at first glance like only one bin is filled). A similar crash and error message occurs if I run RWeka's Discretize algorithm on my data. I'm guessing that RWeka/Weka is not able to find an appropriate binning for the data to discretize it. Centering and scaling the data (and perhaps also applying a Box-Cox transform) using Caret's preProcess algorithm sometimes fixes the data.

Here's how I'm invoking CFS:

foo <- cfs(class ~ ., data = mydata)

Error message:

Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
Calls: ... Discretize -> RWeka_use_filter -> .jcall -> .jcheck -> .Call
Execution halted

Session info:

sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: i686-pc-linux-gnu (32-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] knitr_1.9

loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.1 htmltools_0.2.6 tools_3.2.0 yaml_2.1.13
[6] stringi_0.4-1 rmarkdown_0.5.1 stringr_1.0.0 digest_0.6.8 evaluate_0.6

Is there a way that FSelector could be made to exit gracefully instead of crashing? This would allow execution of code to continue. Any ideas what the problem could be? Googling the error message wasn't particularly informative, and I'm just guessing that there is something about the distribution of my data that leads to RWeka/Weka failing to discretize it, leading to the crash.

I can supply data for troubleshooting if required.

With many thanks,

Andrew.

Discretization

Can you elaborate, which function (Discretize(formula, data = new_data, na.action = na.pass)) is called in line 57 and line 60 of R/discretize.R. I am trying to understand how the discretization works.

Thanks in advance !

Sparse Matrixes Support

Have you considered supporting sparse matrixes format of the data?
It looks like for a (e.g) information.gain() function only a data.frame format of a data parameter
is possible. Sparse matrixes has that huge advantage that can contain sparse ( :) ) information of a huge dimensional regular data.frame. If a sparsity is over 99% there is great need and demand to store data in a sparse matrix. Sometimes there isn't even a RAM possibility to convert already stored data in a sparse format to a data.frame. In my case I have data that in a sparse format has about 170 MB but in a regular data.frame format it has over 3,6 GB. And that's the point.

Any plans or ideas for a future support?

Handling class with single case

Hi,

This is more of a question about how the function handles a situation rather than a technical issue. I am using information.gain for feature selection. The data set contains 40 cases, 15 from class A, 10 from B, 3 from C and then the remaining are the single cases in their classes. When I label them like this using information gain they give me a list of attribute importance. However when I am using a single case against everything else e.g. Class D vs non-Class D, all attribute importance drops to zero.

That does make sense since you can't calculate the entropy for that if you have only one case in one class. However, I would like to know how it handles those classes with single case in the mix of some other "normal" classes. If the attribute score means nothing to those classes with single case, perhaps adding a warning message will be nice. It will be very useful for data science beginners like myself.

Thanks a lot.

Question concerning oneR

To my understanding, oneR assesses the error rate of the following prediction rule for each feature vec: Predict the class that is most frequent among all instances with feature vec having level val.
For each level val, calculate the ratio of misclassified instances among all instances with feature vec having level val and store the result in the vector errors.

To obtain an aggregated error rate for feature vec, why do you simply sum up the components of vecor errors?
Intuitively, I would say that this only makes sense if the frequency of all levels is (almost) identical. Otherwise I would calculate a weighted mean.
I tried the discretization for some datasets e.g. the iris dataset and for example variable Sepal.Length was split into 3 categories with frequencies 59, 36 and 55. I'm not sure if a simple sum of the error rates really makes sense here.

R Studio Crashing after loading FSelector

Whenever I try to run library(FSelector), R Studio pops up a message box saying Fatal Error R Session Crashed.

System Information:

R version 3.5.1 (2018-07-02) -- "Feather Spray"
Platform: x86_64-apple-darwin15.6.0 (64-bit)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.