Giter VIP home page Giter VIP logo

Comments (31)

zzawadz avatar zzawadz commented on September 13, 2024

We should think a little about most useful parts of our new package, and create them first.

I would be hard to rewrite everything randomly and then tries to put everything together;)

So, I think we should select all those parts that are needed to create something working and useful. And based on this we can create interface and internal bones of our package.

For me - first version should

  • work with data.frame
  • do all data preparing stuff
  • have one, best algorithm.

Sparse matrices should be postponed to version 0.1;)

I would be good if you will add here all things that are necessity to create simplest, but working version of FSelectorRcpp.

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

I think you should mention all collaborators with @ to ensure everyone got an e-mail.
@ItAlja
@Quares

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

I think all included in information.gain/gain.ratio/symmetrical.uncertinaty for a good start. So basically an entropy.

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

Ok. I have started some coding. For now new entropy function is ready, now I'm working on discretization in c++. It should be done by the end of this week (so we will have everthing to port information.gain for data.frame).

When I will finish discretization I will try to create some docs for you;) Beacuse c++ is static typing language we have to use a lots of templates to support different R types.

@ItAlja @Quares @MarcinKosinski have you ever heard about templates in c++?;)

from fselectorrcpp.

Quares avatar Quares commented on September 13, 2024

I know what templates are.

About R types, I thought Rcpp is handling that when moving between R and C++.

I have coded discretization a couple of times in R and I am not sure whether it needs to be ported to C++ for speed. Remember that using C++ function requires to copy/move data between R and C++, so this sometimes gives and additional overhead, which might have an influence on the performance. Discretization can be easily coded with base R functions that are fast enough for our purposes I believe. The simple code could look as follows:

buckets <- 5    #this could be provided by the user
quantiles <- seq(0, 1, 1/buckets)

# Determining unique quantiles

# x is the input vector (corresponding to a variable in a data.frame)
# na.rm = TRUE, removes NA values if they are present still at the moment
# Rounding is necessary for unique to work properly
# Otherwise numbers differing by machine epsilon are not equal
q <- round(quantile(x, probs = quantiles, na.rm = TRUE), digits = 10)
breaks <- unique(q)
cuts <- cut(x, breaks = breaks, right = TRUE, include.lowest = TRUE, ordered = TRUE)

# If there were NA's in 'x' then we can add them to the levels of 'cuts' using the code below
levels(cuts) <- c(levels(cuts), "NA")                                 
cuts[is.na(cuts)] <- "NA"

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

I think this is the current implementation in regular FSelector... NOT

The currents is:

library(FSelector)
FSelector:::information.gain.body
FSelector:::discretize.all
FSelector:::equal.frequency.binning.discretization
function (data, bins) 
{
    bins = as.integer(bins)
    if (!is.numeric(data)) 
        stop("Data must be numeric")
    if (bins < 1) 
        stop("Number of bins too small")
    complete = complete.cases(data)
    ord = order(data)
    len = length(data[complete])
    blen = len/bins
    new_data = data
    p1 = p2 = 0
    for (i in 1:bins) {
        p1 = p2 + 1
        p2 = round(i * blen)
        new_data[ord[p1:min(p2, len)]] = i
    }
    return(factor(new_data))
}
<environment: namespace:FSelector>

PS : @zzawadz you don't need to always use @ - if this is done one in th thread, then always people will receive notifications

from fselectorrcpp.

Quares avatar Quares commented on September 13, 2024

Marcin, I just posted my suggestion on how discretization could be achieved in R with base functions. The point was that it might not need to be ported to C++.

PS. Is the implementation of equal.frequency.binning.discretization in the current release of FSelector?

from fselectorrcpp.

Quares avatar Quares commented on September 13, 2024

PS2. OK guys, I think I am delusional here. The whole point of this package is to have C++ implementation of FSelector :D

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

I've posted what is already implemented in FSelector :P
If your snippet of code does the same and is faster than it's good!
If c++ implementation would be faster than yours, that's even better! Remember to test performance on sparse.matrixes and regular data.frames/matrix that are above 1GB size - this might be the situation in which base R is not enought :)

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

By discretizastion I mean MDL algorithm - this is this Weka (and Java) stuff. So we need to port this to c++.

And also - moving data to c++ does not require copies. You are sending pointers to data structures - it is not Java:)

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

Maybe we'll create a new hashtag on twitter #R_is_free_of_Java ?

2016-02-08 12:15 GMT+01:00 Zygmunt Zawadzki [email protected]:

By discretizastion I mean MDL algorithm - this is this Weka (and Java)
stuff. So we need to port this to c++.

And also - moving data to c++ does not require copies. You are sending
pointers to data structures - it not Java:)


Reply to this email directly or view it on GitHub
#5 (comment)
.

from fselectorrcpp.

ItAlja avatar ItAlja commented on September 13, 2024

@zzawadz unfortunately I've never heard about templates in c++ you mentioned. I don't have experience here. But I'll learn, for sure. (:

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

Me neither [https://en.wikipedia.org/wiki/Template_%28C%2B%2B%29] but i thought Rcpp solves such problems (or we are not using Rcpp anywhere?).

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

I will prepare docs with everything that is needed to work Rcpp, R and templates:)

Stay calm!

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

So maybe it could also be a simple R+C++/Rcpp tutorial as a small vignette
in our package?
Such tutorial with an example on how we have implemented one function could
use the rlp form of creating a package http://yihui.name/rlp/

PS: Just in case, I am only a dreamer

2016-02-08 13:02 GMT+01:00 Zygmunt Zawadzki [email protected]:

I will prepare docs with everything that is needed to work Rcpp, R and
templates:)

Stay calm!


Reply to this email directly or view it on GitHub
#5 (comment)
.

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

Of course we do not need to rewrite whole package in c++. We have to create some (a lot of...) benchmarks for FSelector, and rebuild only vital things:) I just love c++, and it might be hard for me to left some functions in pure R;)

For now - I will port this discretization to cpp, and then I will be working on sparse matricies. Everyone should feel free to do some coding in R, beacuse we have to find the most important bottlenecks in old pkg;)

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

Oki. information_gain is now fully in cpp (without NA support - I need some time to think on this..., but with discretisation - so we are JAVA free:)):

Some code to tests:

library(FSelector)
library(FSelectorRcpp)

dt = lapply(1:50, function(xx)
{
  x = rnorm(100000, mean = 10 * xx)
  y = rnorm(100000, mean = 0.5 * xx)
  z = 10 * xx + 0.5 * sqrt(xx)
  data.frame(x,y,z)
})

dt = Reduce(rbind, dt)

dt$z = as.factor(as.integer(round(dt$z)))

system.time(information.gain(z ~ ., dt))
system.time(information_gain(z ~ ., dt))

Some results:

> system.time(information.gain(z ~ ., dt))
   user  system elapsed 
 137.45    1.74  103.44 
> system.time(information_gain(z ~ ., dt))
   user  system elapsed 
  39.46    0.03   39.58 

Now - I have to do some code cleaning, and create some docs;)

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

Great! Looks nice.

Have You considered do.call(bind_rows,.) instead of Reduce(rbind,.) ?

Marcin Kosinski

Dnia 26.02.2016 o godz. 00:01 Zygmunt Zawadzki [email protected] napisał(a):

Oki. information_gain is now fully in cpp (without NA support - I need some time to think on this..., but with discretisation - so we are JAVA free:)):

Some code to tests:

library(FSelector)
library(FSelectorRcpp)

dt = lapply(1:50, function(xx)
{
x = rnorm(100000, mean = 10 * xx)
y = rnorm(100000, mean = 0.5 * xx)
z = 10 * xx + 0.5 * sqrt(xx)
data.frame(x,y,z)
})

dt = Reduce(rbind, dt)

dt$z = as.factor(as.integer(round(dt$z)))

system.time(information.gain(z ~ ., dt))
system.time(information_gain(z ~ ., dt))
Some results:

system.time(information.gain(z ~ ., dt))
user system elapsed
137.45 1.74 103.44
system.time(information_gain(z ~ ., dt))
user system elapsed
39.46 0.03 39.58
Now - I have to do some code cleaning, and create some docs;)


Reply to this email directly or view it on GitHub.

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

Wow! do.call(bind_rows,.) is awsome:)

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

Talking about handling NA maybe we should remove such rows and provide an
information that user should himself handle NAs? Like in ggplot2?

2016-02-26 10:07 GMT+01:00 Zygmunt Zawadzki [email protected]:

Wow! do.call(bind_rows,.) is awsome:)


Reply to this email directly or view it on GitHub
#5 (comment)
.

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

Yeah, but NA handling should be quite easy - there is nice std::isnan, and it is compatible with R's NA

#include <Rcpp.h>
#include <cmath>
using namespace Rcpp;

// [[Rcpp::export]]
bool is_na(NumericVector x) {
  return std::isnan(x[0]);
}

/*** R
is_na(42.0)
is_na(NA)
*/

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

Talking about coercing to factors:
https://github.com/mi2-warsaw/FSelectorRcpp/blob/master/R/information_gain.R#L12-L15

did you know:

library(purrr)
bob %>% map_if(is.character, as.factor) -> bob

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

@zzawadz do you think we have minimum viable product :P ?

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

@zzawadz I think that handling NAs should be done as mentioned here (#9 (comment)). Talking about minimum viable product I think information_gain is enough for ver 0.1.0.

from fselectorrcpp.

zzawadz avatar zzawadz commented on September 13, 2024

I think we only need to add better handling of discretisation function, improve some documentation, and then we will have MVP, that is suitable for CRAN:)

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

I can Try with documentation :) I'll be back in 2 days from holidays

Marcin Kosinski

Dnia 25.05.2016 o godz. 22:53 Zygmunt Zawadzki [email protected] napisał(a):

I think we only need to add better handling of discretisation function, improve some documentation, and then we will have MVP, that is suitable for CRAN:)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

from fselectorrcpp.

ItAlja avatar ItAlja commented on September 13, 2024

Guys, how can I help you in this project ? What can I do ?
regards,
Ola

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

@ItAlja what would you say about that: #13 (comment) ?

from fselectorrcpp.

ItAlja avatar ItAlja commented on September 13, 2024

@MarcinKosinski could you please describe for me more briefly what should I do ? Using what code, what kind of date ? My own, random ?
What did you actually mean by that: "provide some real test examples for greedy_search and exhaustive_search"
I'm a little bit out of the main project path and I'm not sure if I understand you correctly.
Thanks (:

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

@zzawadz created greedy_search and exhaustive_search as equivalents for FSelector::exhaustive.search and FSelector::backward.search and FSelector::forward.search.

You might try to check whether original functions from FSelector and those new ones from FSelectorRcpp give the same results on a new example (new dataset and new feature) and if FSelectorRcpp is faster using microbenchmark::microbenchmark :)

from fselectorrcpp.

MarcinKosinski avatar MarcinKosinski commented on September 13, 2024

I am working on examples for exhaustive search in #23

Talking about minimum viable product

We have prepared information_gain - entroby based filter algorithm with data frame and sparse matrix support with Rcpp and parallel backend. After I'll extend examples, provide a vignette and create benchmarks (which will also enable me to check this function) we'll have minimum vaiable product. I think this issue might be moved to #22

To sum up: our minimum viable product would be

  • information_gain: Rcpp+parallel and sparse matrix support with no Java
  • cut_off_attrs
  • searches

from fselectorrcpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.