Giter VIP home page Giter VIP logo

Comments (6)

larskotthoff avatar larskotthoff commented on May 21, 2024

Hmm, it looks like this is an RWeka bug. I would try to reproduce the bug using RWeka's functions and then file a bug report with them. There's not a lot that can be done about it in FSelector, especially if the particular characteristics of the data that cause the bug are unknown.

from fselector.

andrewjohnlowe avatar andrewjohnlowe commented on May 21, 2024

I have developed a workaround that allows me to apply CFS to my data. Upon further inspection, I found that some of the columns in my data contain a lot of subnormal numbers, and their inclusion results in the data in these columns spanning around 250 orders of magnitude! There's no physical meaning to numbers like 1.2345E-95, so I zero them. I experimented with zeroing all values in my data with absolute magnitude < .Machine$double.neg.eps, also .Machine$double.eps, but neither limit was sufficient to provide a fix. After a search, I found that zeroing all values with absolute magnitude < 1E-7 did the trick. I'm guessing this might be machine- or OS-dependent. I haven't checked.

The behaviour changes depending on what other columns of data are present in my data.frame; I found that removing columns that I had previously identified as OK would often trigger a crash. Also, if the column that I had identified as problematic was set to be identical to a column that I had identified as OK, I got the same crash -- which is very strange. So the crash seems to depend on the collective properties of the data.frame, and not just on the contents of single columns. For example, say I have four columns, I1, I2, I3, I4, I5, and class.label in my data. CFS on I1,I2,I3,I4,class.label is OK. Adding I5 makes CFS crash. If I do I5 <- I1 so that the two are identical, I still get a crash. If I go back to I1,I2,I3,I4, class.label and remove say, I2 or I3, I get a crash. Very strange, no?

I'm guessing that the underlying problem is that when RWeka attempts to Discretize the data and produce nominal values, it sees that the data spans many orders of magnitude and this makes it fail. Perhaps it tries to bin the data and can't create enough bins to span the range of values with the precision required of the data (consider: 1.2345E+2, 1.3345E-95, 1.2345E-95, 1.2345, 1.2345E-16, 2.2345E-16, 12.345, 1.2345E+3, ...; what size bins shall we pick, and how many?), and this is what makes it crash. But this is pure speculation.

If this is indeed the root of the problem, it would be possible for FSelector to do a pre-check to see if the data if likely to cause a crash in RWeka; but, as you implied, the responsibility for doing this really rests with the developers of RWeka. In any case, the aforementioned workaround fixes the problem -- for now.

from fselector.

andrewjohnlowe avatar andrewjohnlowe commented on May 21, 2024

P.S. How should I cite FSelector in the journal paper that I'm writing? How do you prefer to be credited? Do you have a paper I can cite?

from fselector.

larskotthoff avatar larskotthoff commented on May 21, 2024

Have you tried using the same data with RWeka directly to see if it's actually something in RWeka and not in FSelector? It sounds like it's an issue with numeric precision, which may well be caused by the Java interface.

There's no paper to cite for FSelector, but you can cite the package manual (see citation("FSelector")).

from fselector.

andrewjohnlowe avatar andrewjohnlowe commented on May 21, 2024

I ran the Discretize function from RWeka on my data and saw exactly the same behaviour. So this is definitely an issue with RWeka, and not really an FSelector issue. I don't know exactly what is wrong. I had a hunch that the problem was due to my data spanning many orders of magnitude, and the step I took to ameliorate that seemed to fix the problem. If this is the source of the problem, FSelector could test for it and output an informative warning before the crash occurs, so that users are not left scratching their heads trying to figure out what happened. But really this is a problem for the RWeka developers to fix; it's not a problem in FSelector. Go ahead and close this issue if you wish.

Is there any other documentation explaining what FSelector is doing? I'm using "cfs" in my own analysis, which I assume is doing this: http://en.wikipedia.org/wiki/Feature_selection#Correlation_feature_selection
Is this correct? If so: is the CFS criterion the same as above, and what correlation metric is used?

What about the other methods provided by FSelector; how can I get a better understanding of what they are doing? These methods are all coming from Weka, and I should then refer to the Weka documentation?

By the way, FSelector rocks! I'm running on huge data.frames, and there's nothing else I've tried that comes close to the speed and performance of FSelector! Very nice tool! Thanks!

from fselector.

larskotthoff avatar larskotthoff commented on May 21, 2024

Thanks, I'll close this issue here. I'm fairly certain that this is a precision issue to do with RWeka's Java interface, but I don't know enough detail about that to put in a check like you're suggesting -- patches welcome though.

The CFS implementation follows what's described in the tech report you can download from here, but unfortunately there's no "proper" documentation for the implemented methods. The way to learn more about how they work is to read the source code.

from fselector.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.