Giter VIP home page Giter VIP logo

fselectorrcpp's People

Contributors

aleksandradabrowska avatar krzyslom avatar marcinkosinski avatar pat-s avatar zzawadz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fselectorrcpp's Issues

discretization specified by a user

How discretization is being done right now? https://github.com/mi2-warsaw/FSelectorRcpp/blob/master/src/discretize.cpp Do we cut continuous variable for 5 intervals where breaks are quantiles 20%, 40%, 60%, 80%?

Could this be specified by a user? On how many intervals this should be done? This could be an additional parameter in information_gain: intervals = 5 ?
The R logic for this parameter would be cut(variable, breaks = quantile(variable, seq(0,1,length.out=intervals)) ?

Discretize - arguments

discretize(x = iris[[1]], y = iris[[5]],k=2) - unused argument k , but there is no warning

discretize(x = iris[[1]], y = iris[[5]], control = list(equalsizeControl(k = -1)), keepAll = TRUE, call = NULL) -for k smaller than zero or non-integer we should get error (R encountered a fatal error)

BUG: snapping into wrong generation

@zzawadz I am trying to write few benchmarks (issue: #25 )

But I get the Rcpp/C++ error that I can not trace to resolve

Code

#install.packages('microbenchmark')
library(microbenchmark)

library(FSelectorRcpp)
library(FSelector)


library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <- 
   substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)
microbenchmark(times = 1L,
  information_gain(y = BRCA.rnaseq[, 1],
                 x = BRCA.rnaseq[, 2:10000]),# -> FSelectorRcpp.weights},
  information_gain(y = BRCA.rnaseq[, 1], threads = 2,
                 x = BRCA.rnaseq[, 2:10000]),# -> FSelectorRcpp2.weights}#,
  information_gain(y = BRCA.rnaseq[, 1], threads = 4,
                  x = BRCA.rnaseq[, 2:10000]),# -> FSelectorRcpp4.weights},
  information_gain(y = BRCA.rnaseq[, 1], threads = 6,
                  x = BRCA.rnaseq[, 2:10000])# -> FSelectorRcpp6.weights},
  # {information.gain(bcr_patient_barcode~.,
  #                BRCA.rnaseq[, 1:10000]) -> FSelector.weights}
)

error:

snapping

Build options for roxygen2 + man (documentation)

I have fixed build options for our project
-> Project -> Project Options -> Build Tools -> [Generate documentation with Roxygen] Configure (it can be also seen in plain text here (https://github.com/mi2-warsaw/FSelectorRcpp/blob/master/FSelectorRcpp.Rproj)

So now, after every build we will receive man folder with documentation :P AND ONLY THEN I CAN BUILD A STATICDOCS WEBPAGE as was trying here #15 :P I couldn't done this before because we did not have any documentation.

Compilation error

I got this error when trying to install the package. I run R version 3.2.5, Platform: x86_64-pc-linux-gnu (64-bit)

The error message pasted below:

g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -fopenmp -I../inst/include -I"/home/owca/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include" -I"/home/owca/R/x86_64-pc-linux-gnu-library/3.2/BH/include" -I"/home/owca/R/x86_64-pc-linux-gnu-library/3.2/RcppArmadillo/include" -I"/home/owca/R/x86_64-pc-linux-gnu-library/3.2/testthat/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c support.cpp -o support.o
support.cpp: In function ‘Rcpp::IntegerVector fs_table1d(SEXPREC*&)’:
support.cpp:140:10: error: ‘strncmp’ is not a member of ‘std’
if(std::strncmp(xx.attr("class"), "factor", 6) == 0)
^
support.cpp:140:10: note: suggested alternative:
In file included from /usr/share/R/include/R_ext/RS.h:26:0,
from /usr/share/R/include/R.h:50,
from /home/owca/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp/r/headers.h:52,
from /home/owca/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/RcppCommon.h:29,
from /home/owca/R/x86_64-pc-linux-gnu-library/3.2/Rcpp/include/Rcpp.h:27,
from support.cpp:1:
/usr/include/string.h:147:12: note: ‘strncmp’
extern int strncmp (const char *__s1, const char *__s2, size_t __n)
^
make: *** [support.o] Błąd 1
ERROR: compilation failed for package ‘FSelectorRcpp’

  • removing ‘/home/owca/R/x86_64-pc-linux-gnu-library/3.2/FSelectorRcpp’
    BŁĄD: Command failed (1)

feature_search documentation

We should add to the arguments documentation that the deafult values for mode is "greedy" and for type is "forward".

Cut_attrs documentation

In description of argument k, should be a information that for k => 1 we get floor function of k.

Error when column names are not syntactically valid variable names

E.g., this results in error:

> df <- data.frame(a=0:1, "b + c"=3:4, check.names=F)
> FSelectorRcpp::information_gain(a ~ ., df)
Error in `[.data.frame`(data, c(formula$x, formula$y)) : 
  undefined columns selected

The problem comes from attr(terms(a ~ ., data=df), "term.labels") returning such column names enclosed in ` `

Overall vignette a'la rticle

One could start writing a vignette in which we could so far sum up our motivation for this project, and after version 0.0.1 is finished we could provide an example of performance progress in the information.gain function.

Handling NA's

I think we should add support for NA (and I am working on this).

It should be just 10% time overhead - so I don't think that is a big deal:)

Authorship?

@MarcinKosinski
I need to copy one whole function from FSelector (we can't have FSelector in import field, because FSelector depends on Java, and we don't want to have any Java dependency problem).

How we should endorse original function author in our package?

Create functions for entropy based filters

We have all building blocks to create following functions on R side:

information.gain(formula, data) #(it's nearly ready)
gain.ratio(formula, data)
symmetrical.uncertainty(formula, data)

For now - they should work only with data.frames without NA's (I'am working on this on the c++ side).

Any volunteer?

Who rewrites which functions?

> library(FSelector)
> ls("package:FSelector", all.names = TRUE)
 [1] "as.simple.formula"        "backward.search"          "best.first.search"        "cfs"                     
 [5] "chi.squared"              "consistency"              "cutoff.biggest.diff"      "cutoff.k"                
 [9] "cutoff.k.percent"         "exhaustive.search"        "forward.search"           "gain.ratio"              
[13] "hill.climbing.search"     "information.gain"         "linear.correlation"       "oneR"                    
[17] "random.forest.importance" "rank.correlation"         "relief"                   "symmetrical.uncertainty"

Rewrite exhaustive.search function

As first subtask, which could be useful in other tasks, I have rewritten function combn() from R. This function returns matrix of all k-subsets from given set.
C++ template class Subset() doing exactly the same job as R::combn() and it could be found in inst/include/exhaustive.search/Subset.h.

@MarcinKosinski , @zzawadz

Incorrect output in information_gain function

After taking wrong combination of parameters, function doesn't return any errors or warnings.
Example:

irisX = iris[-5]
y = as.vector(iris$Species)

information_gain(x = irisX)
information_gain(formula = Species ~ .)
information_gain(data = iris)
information_gain(x = irisX, data = iris)
information_gain(y = y)

No to std::cout

@DSkrzypiec
If we plan CRAN release we can't use iostream and std::cout in our code:( It's causes this warning in R check:

File ‘FSelectorRcpp/libs/FSelectorRcpp.so’:
  Found ‘__assert_fail’, possibly from ‘assert’ (C)
    Object: ‘information_gain.o’
Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor the system RNG.

I've added special macro to handle this, you need to #include "../FSelectorConfig.h", and then use FS_OUTPUT like this

FS_OUTPUT << " " << std::endl;

For now I think it's good solution to satify CRAN:) But feel free to use std::cout during development:D

Installation bug: cutOff.o:cutOff.cpp:(.text.startup

@zzawadz I was trying to build package on Windows and it failed with such a warning.
More over regular devtools installation also fails with this warning

> library(devtools)
Warning message:
pakietdevtoolszostał zbudowany w wersji R 3.2.5 
> install_github('mi2-warsaw/FSelectorRcpp')
Downloading GitHub repo mi2-warsaw/FSelectorRcpp@master
from URL https://api.github.com/repos/mi2-warsaw/FSelectorRcpp/zipball/master
Installing FSelectorRcpp
Installing 3 packages: BH, Rcpp, RcppArmadillo
Installing packages intoC:/Users/Marcin/Documents/R/win-library/3.2’
(aslibis unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.2/BH_1.60.0-2.zip'
Content type 'application/zip' length 15529294 bytes (14.8 MB)
downloaded 14.8 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.2/Rcpp_0.12.5.zip'
Content type 'application/zip' length 3192046 bytes (3.0 MB)
downloaded 3.0 MB

trying URL 'https://cran.rstudio.com/bin/windows/contrib/3.2/RcppArmadillo_0.7.100.3.1.zip'
Content type 'application/zip' length 1729119 bytes (1.6 MB)
downloaded 1.6 MB

packageBHsuccessfully unpacked and MD5 sums checked
packageRcppsuccessfully unpacked and MD5 sums checked
packageRcppArmadillosuccessfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\Marcin\AppData\Local\Temp\Rtmpaw3JLC\downloaded_packages
"C:/PROGRA~1/R/R-32~1.3/bin/x64/R" --no-site-file --no-environ  \
  --no-save --no-restore --quiet CMD INSTALL  \
  "C:/Users/Marcin/AppData/Local/Temp/Rtmpaw3JLC/devtools1ce84e0f40da/mi2-warsaw-FSelectorRcpp-1705d75"  \
  --library="C:/Users/Marcin/Documents/R/win-library/3.2"  \
  --install-tests 

* installing *source* package 'FSelectorRcpp' ...
** libs
g++ -m64 -std=c++0x -I"C:/PROGRA~1/R/R-32~1.3/include" -DNDEBUG -I../inst/include -fopenmp   -I"C:/Users/Marcin/Documents/R/win-library/3.2/Rcpp/include" -I"C:/Users/Marcin/Documents/R/win-library/3.2/BH/include" -I"C:/Users/Marcin/Documents/R/win-library/3.2/RcppArmadillo/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c RcppExports.cpp -o RcppExports.o
g++ -m64 -std=c++0x -I"C:/PROGRA~1/R/R-32~1.3/include" -DNDEBUG -I../inst/include -fopenmp   -I"C:/Users/Marcin/Documents/R/win-library/3.2/Rcpp/include" -I"C:/Users/Marcin/Documents/R/win-library/3.2/BH/include" -I"C:/Users/Marcin/Documents/R/win-library/3.2/RcppArmadillo/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c cutOff.cpp -o cutOff.o
In file included from cutOff.cpp:2:0:
../inst/include/cutoff/cutOff.h: In function 'std::vector<T> fselector::cutoff::cutOff_k(std::vector<T>&, std::vector<T2>&, double, bool) [with T1 = std::basic_string<char>, T2 = double]':
cutOff.cpp:26:52:   instantiated from here
../inst/include/cutoff/cutOff.h:70:17: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
g++ -m64 -std=c++0x -I"C:/PROGRA~1/R/R-32~1.3/include" -DNDEBUG -I../inst/include -fopenmp   -I"C:/Users/Marcin/Documents/R/win-library/3.2/Rcpp/include" -I"C:/Users/Marcin/Documents/R/win-library/3.2/BH/include" -I"C:/Users/Marcin/Documents/R/win-library/3.2/RcppArmadillo/include" -I"d:/RCompile/r-compiling/local/local323/include"     -O2 -Wall  -mtune=core2 -c discretize.cpp -o discretize.o
discretize.cpp: In function 'Rcpp::IntegerVector discretize_cpp(const NumericVector&, const IntegerVector&)':
discretize.cpp:62:30: error: 'to_string' is not a member of 'std'
discretize.cpp:70:37: error: 'to_string' is not a member of 'std'
make: *** [discretize.o] Error 1
Ostrzeżenie: uruchomione polecenie 'make -f "Makevars.win" -f "C:/PROGRA~1/R/R-32~1.3/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-32~1.3/share/make/winshlib.mk" CXX='$(CXX1X) $(CXX1XSTD)' CXXFLAGS='$(CXX1XFLAGS)' CXXPICFLAGS='$(CXX1XPICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX1XLDFLAGS)' SHLIB_LD='$(SHLIB_CXX1XLD)' SHLIB="FSelectorRcpp.dll" WIN=64 TCLBIN=64 OBJECTS="RcppExports.o cutOff.o discretize.o entropy.o information_gain.o support.o table.o"' otrzymało status 2
ERROR: compilation failed for package 'FSelectorRcpp'
* removing 'C:/Users/Marcin/Documents/R/win-library/3.2/FSelectorRcpp'
Error: Command failed (1)

CRAN submission

@MarcinKosinski what do you think about going to CRAN with first release? I think it's quite stable and information_gain works pretty well, so maybe it's time?

include column with variable name in information_gain() output

information_gain() returns single-column data.frame with importance scores:

infrm_> information_gain(formula = Species ~ ., data = iris, type = "symuncert")
             importance
Sepal.Length  0.4155563
Sepal.Width   0.2452743
Petal.Length  0.8571872
Petal.Width   0.8705214

This output is however not very friendly since names of the variables are provided as rownames. Instead they should rather be provided as additional column with their names. This would make them easier accessible from other functions.

Notice that using rownames to pass on additional information about data is rather discouraged by many authors. Moreover, transforming the information.gain() output to other objects, e.g. dplyr's tibble, could possibly lead to dropping the rownames.

Sparse matrix support

I have added simple sparse matrix support for information_gain. It assumes that we have sparse matrix with factors (or integers), and dependent variable is also factor (or integer).

For now this hack is just a simple just a simple hack - I convert sparse matrix column by column (I don't convert whole matrix - just one column at a time!) to dense format and then apply our standard functions. This approach is not very memory (and probably performance) efficient, but it's simple, and it's works:D

But for the future We should prepare much more robust and elegant way of handling sparse matrices:)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.