imbs-hl / ranger Goto Github PK

View Code? Open in Web Editor NEW

765.0 765.0 191.0 19.18 MB

A Fast Implementation of Random Forests

Home Page: http://imbs-hl.github.io/ranger/

Shell 0.04% CMake 0.71% R 39.74% C 0.01% C++ 59.50%

ranger's People

Contributors

Stargazers

Watchers

Forkers

wlattner chuanhn greenore 0x7f ridereport directorscut82 khotilov mwsill pej liyistat caomw patr1ckm kirillseva vikas84bf philipppro goomens crnsmile mdeleeuw giuseppec robinbing yaohuizeng hqleeustc wrymm fulquan ajaytalati meratzest biocodings mbonakda sharadgupta27 phylonyus ifenglin amal2nes mingliangfu zhangxujinsh jsonko stevenjj jmeller hiredd solertis theinfamouswayne curiositycreations zeneofa edoffagne unixjunkie fototo rubythonode ruankotovich yochju buster7094 rohitthakur01 rqz233 zhengle2008 fdrennan animusnaturae limeng12 douchmehdi harshvp1621 frodofine shubhampachori12110095 khuang93 futurev sandy4321 solidusabi afcarl gravesee captainemerson ebaruw jsliu yishgene mazata ben519 tagteam fradav xcorail sandyyy123 sdanielzafar haozhestat wangdaiyin barupal maximilianpi stjordanis shihuang047 edenhuangsh acamargofb brunaw buwangchuxin1992 nalzok sahanduiuc adamwanggang lorismichel rcannood apn-pucky bschiffthaler fmarotta donume dancooke ahmedsharaf98 regularization-rf hijasonzou thefooj

ranger's Issues

Allow for random splits in regression

Unfortunately i cannot find a source for how this is done in practice and how it benchmarks but two ways of implementing it would be:

For one tree the optimal split for a feature X is between a and b.

Set the split point on a fixed but randomly drawn value between a and b
Save both values a and b and in each prediction draw the split point randomly between a and b

For 1. and 2. the regression will be smoother then with a fixed split point (a+b)/2. This is the main benefit!
For one prediction 1. and 2. would not make a difference.
Multiple predictions on the same random forest model will lead to different predictions with 2. but to always the same with 1.
Version 1. is probably computationally faster and easier to implement.

I opt for 1.

How to correctly work with unordered factors in ranger

I am trying to understand how I should enter unordered factors in ranger:

# create some sample data
df_foo = data_frame(
  factor1 = factor(sample(LETTERS[1:10], size = 1000, replace = TRUE)),
  numeric1 = rnorm(1000),
  target = as.factor(sample(c(0, 1), size = 1000, replace = TRUE))
)

# respect.unordered.factors = TRUE 
# factor
ranger_foo_1 = ranger(
  formula = target ~ factor1 + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = TRUE,
  seed = 1234
)

# respect.unordered.factors = TRUE 
# character
ranger_foo_2 = ranger(
  formula = target ~ as.character(factor1) + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = TRUE,
  seed = 1234
)

# respect.unordered.factors = FALSE 
# character
ranger_foo_3 = ranger(
  formula = target ~ as.character(factor1) + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = FALSE,
  seed = 1234
)

# respect.unordered.factors = FALSE
# factor
ranger_foo_4 = ranger(
  formula = target ~ factor1 + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = FALSE,
  seed = 1234, 

)

# check the differences among the results
ranger_foo_1
ranger_foo_2
ranger_foo_3
ranger_foo_4

The result of ranger_foo_1 is different from the rest. Assuming that from a theoretical point of view, this is the right answer (basically equivalent to one-hot encoding of the variables), should I always explicitly convert my character variables into factor before entering them into ranger? What then does the call with respect.unordered.factors = TRUE and as.character(factor1) (ranger_foo_2) do, in terms of computation?

Thanks.

split.select.weights for each single tree

Hi,

it would be cool if the user would be able to use different split.select.weights vectors for each single tree.

Thanks.

probabilities ranger vs. randomForest

hi,

I just wanted to link to this interesting question on stackoverflow to different behaviour of ranger and randomForest when predicting probabilities:

http://stats.stackexchange.com/questions/226109/how-does-predict-randomforest-estimate-class-probabilities

Maybe somebody can answer it.

Multithreading on Windows

The main page says:

Note that, for now, R-devel and the new RTools toolchain is required for multithreading on Windows platforms (or install a binary version).

I installed the current R-devel version (3.4.0), and the current RTools version (34), so the num.threads parameter of ranger should immediately work. But the function still runs on only one processor. Is there more that is required? Maybe you could expand on the instructions to make multithreading work on Windows as many of your users are likely interested in this functionality.

vote counts / probabilities for classification forest

As an alternative to just returning the majority vote when predicting from a classification forest, could the number of votes for each class be returned, alternatively normalized by the number of trees?

I think this would be equivalent to the prob and votes types in randomForest::predict.randomForest.

Add Kaplan-Meier estimator

Add an option to choose from Nelson-Aalen and Kaplan-Meier estimator for survival prediction.

Handling samples with undefined feature values?

First, thanks for ranger. Keep up the good work!

My issue: I have trouble using ranger with sparse data, i.e. when samples do not have certain continuous variables/features at all. At the moment, I set them to 0.0 but this produces wrong results of course. Checking the code, this is not really possible at the moment, right? First, I was confused by the sparse data feature in the Data class, but it is some feature for the GenABEL library. I mean sparse in the sense of sparse features matrices as in e.g. scipy.

Is this feature planned? If not, could you maybe sketch the solution, so I could help out with a patch?

Unexpected "Missing values in data" issue

I train the following random survival forest:

rf <- ranger(Surv(start, end, Y, type = 'interval') ~ ., data = train_frame[1:10000, ], write.forest = TRUE)

Works great. When I try

test_frame <-  train_frame[1:10000, ]
test_frame$Y <- NULL
preds <- predict(rf, train_frame)

however, I get the error:

Error in predict.ranger.forest(forest, data, seed, num.threads, verbose) : 
  Missing values in data.

It's the exact same data set, and I've verified that there is no missingness in here. IT's 983 features, all of them coded numerically, missing values indicated with -9999.

What's going on?

Retrieving individual predictions for each tree

Possibly related to #18.

Something like predict.all in predict.randomForest would be great, i.e. return a [obs]x[tree] matrix. We need this to estimate the standard deviation of the predictions.

min.node.size.

##Hello Marvin @mnwright ,

I really appreciate your efforts on writing this package, it is really fast!

I just have a quick question on the min.node.size. The default of min.node.size for regression is 5, however, in some RF model that we fit, I found the average node size is smaller than 5 (I have 15000 observations and the approximate average tree size is around 4000, so about 3 observations per terminal). I looked up the R code, it seems like it is set to 0 if not specified. Could you please check the setting of min.node.size?

Btw, I was just wondering if it there a way to compute the tree size of the model? Thank you!!

Sincerely,
Mutian

oobError for single trees

Due to the nature of my data I'm generating individual trees using ranger, the reported OOB error is around 17-18% but the confusion matrices look similar to the below

    predicted

true 1 2
1 40 40
2 32 44

Generating the OOB error manually (using individual object predictions) shows an error of ~50%, very different to that reported by ranger.

As an aside, is it possible to recombine the trees I'm growing back into a ranger object?

Cheers,
Chris

Support for Survival Forests with time-varying covariates

The survival package in R supports time-varying covariates by using the three-parameter form of Surv(start, end, status). The basic idea is to split one individual's observations at each time point where a covariate changes value, and mark the resulting data point as censored. This ensures that no individual is counted multiple times (as only individuals at risk at each time point are part of the survival analysis), that the baseline hazard is formed based on the correct time, and that just surviving long enough to experience the co-variate change at all is not mixed up with the effect the change has on survival. (see also https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf for a better explanation:-) )

Ranger (and as far as I know all other Survival Forest implementations for R) only supports Surv objects with one time variable, the death/censoring time.

Would it be possible to allow the use of the more flexible Surv(start, end, status) interface in Ranger?

Variable Species not found

Hello，when I do "./ranger --verbose --file /home/magic/software/ranger/source/src/letter_recognition.data --depvarname Species --treetype 1 --ntree 1000 --nthreads 4
" on ubuntu ,it shows an error:
Starting Ranger.
Loading input file: /home/magic/software/ranger/source/src/letter_recognition.data.
Error: Variable Species not found. Ranger will EXIT now.
I want to know how to deal with the error and what is the "Species".
I will very appreciate for your reply.

Bug when target column is not first/last (only R?)

See this issue: mlr-org/mlr#986

Is there a way to correctly pass a big.matrix instance into ranger?

I need to train over 100GB dataset of 3600columns.
After I read this(http://stackoverflow.com/questions/8315575/big-matrix-as-data-frame-in-r),I tried same method for ranger. but it didn't work.

Install ranger on server

Hi,

I want to install ranger on a server but I get following error:

I/home/hpc/ua341/di49ruw/R/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/home/hpc/ua341/di49ruw/R/lib64/R/library/Rcpp/include"      -c AAA_check_cpp11.cpp -o AAA_check_cpp11.o
/bin/sh: I/home/hpc/ua341/di49ruw/R/lib64/R/include: Datei oder Verzeichnis nicht gefunden

And many more times the error, that this folder does not exist.
Can I solve this problem somehow?

Set write.forest = TRUE by default?

Should we set write.forest = TRUE by default?

The default is 'FALSE because the forest takes a lot of memory for very large datasets or huge forests and in some cases, e.g., when you are interested in variable importance, you don't need it.
On the other hand, for prediction it's annoying to always set the option.

What do you think?

Warning on class comparision

Hallo Marvin,
I get the following warning message with the CRAN version 0.2.7:

Warning message:
In if (class(data) == "gwaa.data") { :
  the condition has length > 1 and only the first element will be used

You can reproduce it by the following MWE:

library(data.table)
data("iris")
ranger(Species~., data = as.data.table(iris))

Thanks for your work!

Enhancement: num.trees for predict

If a num.trees parameter for predict is implemented to limit the number of trees for prediction, it makes it very easy to choose the right number of trees for a model, because one can grow a very large tree and then scale back the number of trees based on what the predictions are telling them.

Enhancement : faster getTerminalNodeIDs

Hi.
Thanks a lot for this brilliant package.
I would like to derive proximity matrices from forests built with ranger. As far as I know there is currently no direct builtin functionality to do so.
Yet the getTerminalNodeIds function might be used. Unfortunately this function (due to its crude R implementation) is quite slow compared with forest learning or predicting processes. I was wondering if you could speed it up or if the learning or predicting functions may also return the getTerminalNodeIds matrix or even better a proximity matrix ?

Thanks

"case.weights" take very long

The factory fresh option of using case weights in drawing the bootstrap sample is very important in practice. However I recognized an explosion in runtime when using it. In below example, time consumption with case weights is about ten times as large as without. Is this as expected?

library(ranger)

n <- 10000

set.seed(4)
y <- rnorm(n)
x <- rnorm(n)
w <- runif(n)

# No case weights: User 9.96, System 0.04 on a 8 GB RAM windows laptop
system.time(fit.1 <- ranger(y ~ x)) 

# Uniform case weights: User 114.69, System 0.12
system.time(fit.2 <- ranger(y ~ x, case.weights = w)) 

# Equal case weights: User 112.36, System 0.11
system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n)))

Install ranger on server

Hi,

I want to install ranger on a server, but I get following error:

I/home/hpc/ua341/di49ruw/R/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/home/hpc/ua341/di49ruw/R/lib64/R/library/Rcpp/include"      -c AAA_check_cpp11.cpp -o AAA_check_cpp11.o
/bin/sh: I/home/hpc/ua341/di49ruw/R/lib64/R/include: Datei oder Verzeichnis nicht gefunden

And many more times the error, that it does not exist. In fact only R/include exists and not R/lib64.
Can I solve this error somehow?

Predictions dependent on interface of model fit

IMHO the outcome should be the same if I opt to favor dependent.variable.name over the formula interface.

library(ranger)

set.seed(1)
ind = 1:150 %in% sample(150, 100)

set.seed(2)
mod1 = ranger(Species ~ ., data = iris[ind, ], write.forest = TRUE)
pred1 = predict(mod1, data = iris[!ind, ])

set.seed(2)
mod2 = ranger(Species ~ ., data = iris[ind, ], write.forest = TRUE)
pred2 = predict(mod2, data = iris[!ind, ])

set.seed(2)
mod3 = ranger(dependent.variable.name = "Species", data = iris[ind, ], write.forest = TRUE)
pred3 = predict(mod3, data = iris[!ind, ])

all.equal(pred1$predictions, pred2$predictions)
all.equal(pred1$predictions, pred3$predictions)

Bootstrapping with class weights

Currently, each observation is equally probable to be picked up while bootstrapping for a tree. Please add an option to bootstrap based on number of files in class and weighting accordingly.

or, an option where, there is a weight to each observation.

Predict with fewer trees

Is it possible to only specify a certain number of trees (or possibly which trees) in the predict method?

predict with missing values

Why ranger can't make predictions when there are missing values in the data? In predict.R file there are the following lines (160-162):

function return :if (any(is.na(data.final))) {
stop("Missing values in data.")
}`

Ranger successfully trains random forest on data which contains missing values, so it makes me think why it can't make predictions as well?

[R-package] regression in 1D case does only work with formula interface

Following error case

library(ranger)
data = data.frame(x = 1:10, y = 1:10)
newdata = data.frame(x = runif(10,0,10))
m1 = ranger(formula = y~x, data = data, mtry = 1, write.forest = TRUE)
m2 = ranger(formula = NULL, dependent.variable.name = "y", data = data, mtry = 1, write.forest = TRUE)
p1 = predict(m1, data = newdata) #works
p2 = predict(m2, data = newdata) #doesn't
# Error: mtry can not be larger than number of variables in data. Ranger will EXIT now.
# Error in predict.ranger.forest(forest, data, seed, num.threads, verbose) : 
#   User interrupt or internal error.

non-formula interface

Any possibility of adding the ability to call ranger models like this?

ranger(x = xData, y = yData)

Most other ML algorithms provide this as an option (e.g. randomForest, Rborist, glmnet, xgboost, lm.fit, glm.fit, etc). These interfaces can usually be faster since the formula does not need to be parsed and the data transformed to this form later anyway. Without looking through rangerCPP() I can't tell if this would be the case or not.

protection stack overflow

ranger (R version) give a Error: protect(): protection stack overflow with a 141*17222 data frame.
I used mtry of 131 and 1000 trees. save.memory = TRUE does not help.

If need it I could provide the data.

ranger wont compile with Intel 15 compilers

Compilation with the Intel 15.0.5 compiler (icpc -std=c++11)

Ends with error :

AAA_check_cpp11.cpp(3): error: #error directive: Error: ranger requires a real C++11 compiler. You probably have to update gcc.
#error Error: ranger requires a real C++11 compiler. You probably have to update gcc.
^

Is it a real problem or all Intel compilers are now banned?

Transpose 'classification.table' and rename to 'confusion.matrix'

In other packages and in general, the confusion matrix is transposed, compared to ranger. We should therefore transpose it. The standard name is 'confusion.matrix' instead of 'classification.table', so it should be renamed.

Thanks to Winston Peak.

No error by importance() if importance = "none"

The importance function should throw an error if the ranger model was fit with importance = "none" (the default) as defined in importance.R:

if (is.null(x$variable.importance) | length(x$variable.importance) < 1) {
    stop("No variable importance found. Please use 'importance' option when growing the forest.")
}

Currently it returns a zero for every independent variable:

library(ranger)
mod <- ranger(Species ~ ., data = iris)
importance(mod)
# [1] 0 0 0 0

It happens in 0.4.0, 0.5.0 and 0.5.4. I'll try to look into it, but don't let that stop you from fixing this...

Importance calculation takes too much memory and time

Computing importance (all kinds) in ranger involves first creating a numTrees x numFeatures matrix (made of variable_importance vectors in each tree object), and then averaging it by row, which has a substantial impact on memory use and speed for large numTrees; ranger should just accumulate those values in place, thus using at most numFeatures x numThreads memory.

Function to convert forest in better human readable format

Something like randomForest::getTree(). Any other ideas?

Can't install; multiple compilation error messages

install.packages("ranger")
Installing package into ‘/home/andy/R/i686-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/ranger_0.2.7.tar.gz'

Content type 'application/x-gzip' length 50771 bytes (49 KB)

downloaded 49 KB

installing source package ‘ranger’ ...
** package ‘ranger’ successfully unpacked and MD5 sums checked
** libs
g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c Data.cpp -o Data.o
g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c DataChar.cpp -o DataChar.o
g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c DataDouble.cpp -o DataDouble.o
g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c DataFloat.cpp -o DataFloat.o
g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c Forest.cpp -o Forest.o
Forest.cpp: In member function ‘void Forest::showProgress(std::string)’:
Forest.cpp:657:22: error: ‘std::chrono::steady_clock’ has not been declared
Forest.cpp:661:3: error: ‘steady_clock’ has not been declared
Forest.cpp:661:28: error: expected ‘;’ before ‘start_time’
Forest.cpp:662:3: error: ‘steady_clock’ has not been declared
Forest.cpp:662:28: error: expected ‘;’ before ‘last_time’
Forest.cpp:668:51: error: ‘steady_clock’ has not been declared
Forest.cpp:668:73: error: ‘last_time’ was not declared in this scope
Forest.cpp:672:56: error: ‘steady_clock’ has not been declared
Forest.cpp:672:78: error: ‘start_time’ was not declared in this scope
Forest.cpp:676:19: error: ‘steady_clock’ has not been declared
make: *** [Forest.o] Error 1
ERROR: compilation failed for package ‘ranger’
removing ‘/home/andy/R/i686-pc-linux-gnu-library/3.2/ranger’
Warning in install.packages :
installation of package ‘ranger’ had non-zero exit status

The downloaded source packages are in
‘/tmp/RtmpBv2B7n/downloaded_packages’

devtools::session_info()
Session info ---------------------------------------------------------------------------
setting value
version R version 3.2.2 (2015-08-14)
system i686, linux-gnu
ui RStudio (0.99.446)
language (EN)
collate en_US.UTF-8
tz
date 2015-09-23

Packages -------------------------------------------------------------------------------
package * version date source
Boruta * 4.0.0 2014-12-07 CRAN (R 3.2.2)
curl 0.9.3 2015-08-25 CRAN (R 3.2.1)
devtools * 1.9.1 2015-09-11 CRAN (R 3.2.2)
digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
httr 1.0.0 2015-06-25 CRAN (R 3.2.1)
magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
R6 2.1.1 2015-08-19 CRAN (R 3.2.1)
randomForest * 4.6-10 2014-07-17 CRAN (R 3.2.0)
rFerns * 1.1.0 2014-11-30 CRAN (R 3.2.0)
stringi 0.5-5 2015-06-29 CRAN (R 3.2.1)
stringr 1.0.0 2015-04-30 CRAN (R 3.2.1)

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: i686-pc-linux-gnu (32-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_1.9.1 Boruta_4.0.0 rFerns_1.1.0 randomForest_4.6-10

loaded via a namespace (and not attached):
[1] httr_1.0.0 R6_2.1.1 magrittr_1.5 tools_3.2.2
[5] rstudioapi_0.3.1 curl_0.9.3 memoise_0.2.1 stringi_0.5-5
[9] stringr_1.0.0 digest_0.6.8

What does ranger do with new factor levels in prediction?

Hello rangers

I was recently stumbling over the error message

Error in predict.ranger.forest(forest, data, predict.all, seed, num.threads, :
Missing values in data.

It was the "classic" problem of having a new fector level in a categorical predictor during prediction which seems to happen if respect.unordered.factors = ["order"/TRUE] only. For "partition" and "ignore"/FALSE, there is no such message.

I think the behaviour in the cases "order" and "ignore" (FALSE) is clear although the error message for "order" could be more specific like "new or unknown factor levels in regressor". But what does ranger do in the last case respect.unordered.factors = "partition" (no error)?

Below the small example for test:

# All possible two-partitions
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "partition")
predict(fit, data.frame(Species = ""))$predictions

# Ordered by proportion of second class (respect.unordered.factors = TRUE)
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "order")
predict(fit, data.frame(Species = ""))$predictions

# Factors are considered ordered (respect.unordered.factors = FALSE)
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "ignore")
predict(fit, data.frame(Species = ""))$predictions

Add option to return in-bag count

Should return how many times a sample was included in the bootstrap for a given tree. Required for support of ranger in sorhawell/forestFloor.

Add predictions of quantiles

(Feature request.) It would be handy if ranger could predict conditional quantile functions, as in the quantregForest package:

https://cran.r-project.org/web/packages/quantregForest/index.html

Misleading Classification in ranger 0.5

Problem with ranger prediction when there is only one dependent variable

See mlr-org/mlr#678 for a reproducible example.

Question on how to pass "split.select.weights"

Hi rangers

I am unsure how the probabilities in "split.select.weights" are associated with the regressors. It it based on the order they are appearing in the formula? Or is "split.select.weights" simply a named numeric vector in any order?

Thanks for clarification.

Allow the user to abort calculation

This is pretty simple, you just need to call the Rcpp function checkUserInterrupt() every now and then. See Section 2.4 of http://dirk.eddelbuettel.com/code/rcpp/Rcpp-attributes.pdf.

Install error in utility.h

hi.. Install error as below in utility.h.. Any suggestions?

Environment - Red Hat Enterprise Linux Server release 6.5 (Santiago)
R --version
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

sudo R CMD INSTALL ranger_0.3.0.tar.gz

installing to library ‘/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library’
installing source package ‘ranger’ ...
** package ‘ranger’ successfully unpacked and MD5 sums checked
** libs
g++ -std=c++0x -I/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/Rcpp/include" -fpic -g -O2 -c Data.cpp -o Data.o
In file included from Data.cpp:36:
utility.h: In function ‘void saveVector2D(std::vector<std::vector<T, std::allocator<_CharT> >, std::allocator<std::vector<T, std::allocator<_CharT> > > >&, std::ofstream&)’:
utility.h:111: error: expected initializer before ‘:’ token
Data.cpp:216: error: expected primary-expression at end of input
Data.cpp:216: error: expected ‘;’ at end of input
Data.cpp:216: error: expected primary-expression at end of input
Data.cpp:216: error: expected ‘)’ at end of input
Data.cpp:216: error: expected statement at end of input
Data.cpp:216: error: expected ‘}’ at end of input
make: *** [Data.o] Error 1
ERROR: compilation failed for package ‘ranger’
removing ‘/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/ranger’

Thanks,
Manish

When growing a probability forest missing classes in the training data cause problems

See here for an example: mlr-org/mlr#909 (comment)

Predictions seem wrong when compared to randomForest

data(iris)
library(randomForest)

## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

iris_spec <- as.factor(iris$Species)
iris_dat <- as.matrix(iris[, !(names(iris) %in% "Species")])

set.seed(1234)

test_index <- sample(nrow(iris), 10)
train_index <- seq(1, nrow(iris))[-test_index]

iris_train <- randomForest(x = iris_dat[train_index, ], y = iris_spec[train_index], keep.forest = TRUE)
iris_pred <- predict(iris_train, iris_dat[test_index, ])

iris_train$confusion

##            setosa versicolor virginica class.error
## setosa         47          0         0  0.00000000
## versicolor      0         42         3  0.06666667
## virginica       0          4        44  0.08333333

table(iris_pred, iris_spec[test_index])

##             
## iris_pred    setosa versicolor virginica
##   setosa          3          0         0
##   versicolor      0          5         0
##   virginica       0          0         2

library(ranger)

## 
## Attaching package: 'ranger'
## 
## The following object is masked from 'package:randomForest':
## 
##     importance

iris_train2 <- ranger(data = iris[train_index, ], dependent.variable.name = "Species", write.forest = TRUE)
iris_pred2 <- predict(iris_train2, dat = iris[test_index, ])

iris_train2$classification.table

##             true
## predicted    setosa versicolor virginica
##   setosa         47          0         0
##   versicolor      0         41         3
##   virginica       0          4        45

table(iris_pred2$predictions, iris_spec[test_index])

##             
##              setosa versicolor virginica
##   setosa          0          0         0
##   versicolor      3          0         0
##   virginica       0          5         2

PDF and sample size

Hi Marvin,

nice package. I have two suggestions:

In the description pdf (https://cran.r-project.org/web/packages/ranger/ranger.pdf) in the "ranger" section the "values" are too long and are bigger than the page size.
I wanted to specify the sample size of the trees, but I could not find an option for that in the function ranger.

Cheers,
Philipp

Ranger will not compile with Intel C++ compiler due to a custom Mersenne Twister use

Ranger uses C++11 feature N3551, which is not implemented (and probably will never be) by Intel compilers, causing compilation error. When uses as an R package, it probably should rather use R PRNG infrastructure which is exposed by Rcpp, like here.

Install Error

Hi,

Get the below error during install. Can you pls help?

[xx@xyyyyR]$ sudo R CMD INSTALL ranger_0.3.0.tar.gz 
* installing to library â€˜/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/libraryâ€™
* installing *source* package â€˜rangerâ€™ ...
** package â€˜rangerâ€™ successfully unpacked and MD5 sums checked
** libs
g++ -std=c++0x -I/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/Rcpp/include"   -fpic  -g -O2 -c Data.cpp -o Data.o
In file included from Data.cpp:36:
utility.h: In function â€˜void saveVector2D(std::vector<std::vector<T, std::allocator<_CharT> >, std::allocator<std::vector<T, std::allocator<_CharT> > > >&, std::ofstream&)â€™:
utility.h:111: error: expected initializer before â€˜:â€™ token
Data.cpp:216: error: expected primary-expression at end of input
Data.cpp:216: error: expected â€˜;â€™ at end of input
Data.cpp:216: error: expected primary-expression at end of input
Data.cpp:216: error: expected â€˜)â€™ at end of input
Data.cpp:216: error: expected statement at end of input
Data.cpp:216: error: expected â€˜}â€™ at end of input
make: *** [Data.o] Error 1
ERROR: compilation failed for package â€˜rangerâ€™
* removing â€˜/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/rangerâ€™

Thanks, Manish

Slow for classification on many classes

Hi, I'm trying to figure out why ranger is taking longer than the equivalent command in randomForest in R. Are there options that are slowing it down? Thanks for any help.

dim(data_simple[training,])
[1] 4104   95

length(unique(as.factor(k.row)))  
[1] 704

num.trees <- 100000

#completes in < 12 hours on 1 thread
rf.out <- randomForest(x=data_simple[training,], y=as.factor(k.row), importance=TRUE, proximity=TRUE, ntree=num.trees, keep.forest=T, do.trace=100)

#predicted completion in 39 hours on 2 threads
ranger.out <- ranger(data=data.frame("classes"=as.factor(k.row),data_simple[training,]),importance="impurity", num.trees=num.trees, num.threads=2, write.forest=T, verbose=T, dependent.variable.name="classes", classification=T)

imbs-hl / ranger Goto Github PK

ranger's People

Contributors

Stargazers

Watchers

Forkers

ranger's Issues

Content type 'application/x-gzip' length 50771 bytes (49 KB)

Recommend Projects

Recommend Topics

Recommend Org