Giter VIP home page Giter VIP logo

ranger's People

Contributors

0x7f avatar ben519 avatar bfgray3 avatar bgreenwell avatar brunaw avatar dependabot[bot] avatar edoffagne avatar gregordecillia avatar ironholds avatar jemus42 avatar jtibshirani avatar katrinleinweber avatar kirillseva avatar krlmlr avatar kysolvik avatar lnicola avatar lorentzenchr avatar michaelchirico avatar mnwright avatar olivroy avatar rcannood avatar rnowling avatar romanhornung avatar rvalavi avatar spineki avatar stanlazic avatar stephematician avatar svenvw avatar talegari avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ranger's Issues

Allow for random splits in regression

Unfortunately i cannot find a source for how this is done in practice and how it benchmarks but two ways of implementing it would be:

For one tree the optimal split for a feature X is between a and b.

  1. Set the split point on a fixed but randomly drawn value between a and b
  2. Save both values a and b and in each prediction draw the split point randomly between a and b

For 1. and 2. the regression will be smoother then with a fixed split point (a+b)/2. This is the main benefit!
For one prediction 1. and 2. would not make a difference.
Multiple predictions on the same random forest model will lead to different predictions with 2. but to always the same with 1.
Version 1. is probably computationally faster and easier to implement.

I opt for 1.

How to correctly work with unordered factors in ranger

I am trying to understand how I should enter unordered factors in ranger:

# create some sample data
df_foo = data_frame(
  factor1 = factor(sample(LETTERS[1:10], size = 1000, replace = TRUE)),
  numeric1 = rnorm(1000),
  target = as.factor(sample(c(0, 1), size = 1000, replace = TRUE))
)

# respect.unordered.factors = TRUE 
# factor
ranger_foo_1 = ranger(
  formula = target ~ factor1 + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = TRUE,
  seed = 1234
)

# respect.unordered.factors = TRUE 
# character
ranger_foo_2 = ranger(
  formula = target ~ as.character(factor1) + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = TRUE,
  seed = 1234
)

# respect.unordered.factors = FALSE 
# character
ranger_foo_3 = ranger(
  formula = target ~ as.character(factor1) + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = FALSE,
  seed = 1234
)

# respect.unordered.factors = FALSE
# factor
ranger_foo_4 = ranger(
  formula = target ~ factor1 + numeric1,
  data = df_foo,
  num.trees = 100, 
  mtry = 2, 
  min.node.size = 10,
  respect.unordered.factors = FALSE,
  seed = 1234, 

)

# check the differences among the results
ranger_foo_1
ranger_foo_2
ranger_foo_3
ranger_foo_4

The result of ranger_foo_1 is different from the rest. Assuming that from a theoretical point of view, this is the right answer (basically equivalent to one-hot encoding of the variables), should I always explicitly convert my character variables into factor before entering them into ranger? What then does the call with respect.unordered.factors = TRUE and as.character(factor1) (ranger_foo_2) do, in terms of computation?

Thanks.

Multithreading on Windows

The main page says:

Note that, for now, R-devel and the new RTools toolchain is required for multithreading on Windows platforms (or install a binary version).

I installed the current R-devel version (3.4.0), and the current RTools version (34), so the num.threads parameter of ranger should immediately work. But the function still runs on only one processor. Is there more that is required? Maybe you could expand on the instructions to make multithreading work on Windows as many of your users are likely interested in this functionality.

vote counts / probabilities for classification forest

As an alternative to just returning the majority vote when predicting from a classification forest, could the number of votes for each class be returned, alternatively normalized by the number of trees?

I think this would be equivalent to the prob and votes types in randomForest::predict.randomForest.

Handling samples with undefined feature values?

First, thanks for ranger. Keep up the good work!

My issue: I have trouble using ranger with sparse data, i.e. when samples do not have certain continuous variables/features at all. At the moment, I set them to 0.0 but this produces wrong results of course. Checking the code, this is not really possible at the moment, right? First, I was confused by the sparse data feature in the Data class, but it is some feature for the GenABEL library. I mean sparse in the sense of sparse features matrices as in e.g. scipy.

Is this feature planned? If not, could you maybe sketch the solution, so I could help out with a patch?

Unexpected "Missing values in data" issue

I train the following random survival forest:

rf <- ranger(Surv(start, end, Y, type = 'interval') ~ ., data = train_frame[1:10000, ], write.forest = TRUE)

Works great. When I try

test_frame <-  train_frame[1:10000, ]
test_frame$Y <- NULL
preds <- predict(rf, train_frame)

however, I get the error:

Error in predict.ranger.forest(forest, data, seed, num.threads, verbose) : 
  Missing values in data.

It's the exact same data set, and I've verified that there is no missingness in here. IT's 983 features, all of them coded numerically, missing values indicated with -9999.

What's going on?

min.node.size.

##Hello Marvin @mnwright ,

I really appreciate your efforts on writing this package, it is really fast!

I just have a quick question on the min.node.size. The default of min.node.size for regression is 5, however, in some RF model that we fit, I found the average node size is smaller than 5 (I have 15000 observations and the approximate average tree size is around 4000, so about 3 observations per terminal). I looked up the R code, it seems like it is set to 0 if not specified. Could you please check the setting of min.node.size?

Btw, I was just wondering if it there a way to compute the tree size of the model? Thank you!!

Sincerely,
Mutian

oobError for single trees

Due to the nature of my data I'm generating individual trees using ranger, the reported OOB error is around 17-18% but the confusion matrices look similar to the below

    predicted

true 1 2
1 40 40
2 32 44

Generating the OOB error manually (using individual object predictions) shows an error of ~50%, very different to that reported by ranger.

As an aside, is it possible to recombine the trees I'm growing back into a ranger object?

Cheers,
Chris

Support for Survival Forests with time-varying covariates

The survival package in R supports time-varying covariates by using the three-parameter form of Surv(start, end, status). The basic idea is to split one individual's observations at each time point where a covariate changes value, and mark the resulting data point as censored. This ensures that no individual is counted multiple times (as only individuals at risk at each time point are part of the survival analysis), that the baseline hazard is formed based on the correct time, and that just surviving long enough to experience the co-variate change at all is not mixed up with the effect the change has on survival. (see also https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf for a better explanation:-) )

Ranger (and as far as I know all other Survival Forest implementations for R) only supports Surv objects with one time variable, the death/censoring time.

Would it be possible to allow the use of the more flexible Surv(start, end, status) interface in Ranger?

Variable Species not found

Hello,when I do "./ranger --verbose --file /home/magic/software/ranger/source/src/letter_recognition.data --depvarname Species --treetype 1 --ntree 1000 --nthreads 4
" on ubuntu ,it shows an error:
Starting Ranger.
Loading input file: /home/magic/software/ranger/source/src/letter_recognition.data.
Error: Variable Species not found. Ranger will EXIT now.
I want to know how to deal with the error and what is the "Species".
I will very appreciate for your reply.

Install ranger on server

Hi,

I want to install ranger on a server but I get following error:

I/home/hpc/ua341/di49ruw/R/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/home/hpc/ua341/di49ruw/R/lib64/R/library/Rcpp/include"      -c AAA_check_cpp11.cpp -o AAA_check_cpp11.o
/bin/sh: I/home/hpc/ua341/di49ruw/R/lib64/R/include: Datei oder Verzeichnis nicht gefunden

And many more times the error, that this folder does not exist.
Can I solve this problem somehow?

Set write.forest = TRUE by default?

Should we set write.forest = TRUE by default?

The default is 'FALSE because the forest takes a lot of memory for very large datasets or huge forests and in some cases, e.g., when you are interested in variable importance, you don't need it.
On the other hand, for prediction it's annoying to always set the option.

What do you think?

Warning on class comparision

Hallo Marvin,
I get the following warning message with the CRAN version 0.2.7:

Warning message:
In if (class(data) == "gwaa.data") { :
  the condition has length > 1 and only the first element will be used

You can reproduce it by the following MWE:

library(data.table)
data("iris")
ranger(Species~., data = as.data.table(iris))

Thanks for your work!

Enhancement: num.trees for predict

If a num.trees parameter for predict is implemented to limit the number of trees for prediction, it makes it very easy to choose the right number of trees for a model, because one can grow a very large tree and then scale back the number of trees based on what the predictions are telling them.

Enhancement : faster getTerminalNodeIDs

Hi.
Thanks a lot for this brilliant package.
I would like to derive proximity matrices from forests built with ranger. As far as I know there is currently no direct builtin functionality to do so.
Yet the getTerminalNodeIds function might be used. Unfortunately this function (due to its crude R implementation) is quite slow compared with forest learning or predicting processes. I was wondering if you could speed it up or if the learning or predicting functions may also return the getTerminalNodeIds matrix or even better a proximity matrix ?

Thanks

"case.weights" take very long

The factory fresh option of using case weights in drawing the bootstrap sample is very important in practice. However I recognized an explosion in runtime when using it. In below example, time consumption with case weights is about ten times as large as without. Is this as expected?

library(ranger)

n <- 10000

set.seed(4)
y <- rnorm(n)
x <- rnorm(n)
w <- runif(n)

# No case weights: User 9.96, System 0.04 on a 8 GB RAM windows laptop
system.time(fit.1 <- ranger(y ~ x)) 

# Uniform case weights: User 114.69, System 0.12
system.time(fit.2 <- ranger(y ~ x, case.weights = w)) 

# Equal case weights: User 112.36, System 0.11
system.time(fit.3 <- ranger(y ~ x, case.weights = rep(1, times = n))) 

Install ranger on server

Hi,

I want to install ranger on a server, but I get following error:

I/home/hpc/ua341/di49ruw/R/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/home/hpc/ua341/di49ruw/R/lib64/R/library/Rcpp/include"      -c AAA_check_cpp11.cpp -o AAA_check_cpp11.o
/bin/sh: I/home/hpc/ua341/di49ruw/R/lib64/R/include: Datei oder Verzeichnis nicht gefunden

And many more times the error, that it does not exist. In fact only R/include exists and not R/lib64.
Can I solve this error somehow?

Predictions dependent on interface of model fit

IMHO the outcome should be the same if I opt to favor dependent.variable.name over the formula interface.

library(ranger)

set.seed(1)
ind = 1:150 %in% sample(150, 100)

set.seed(2)
mod1 = ranger(Species ~ ., data = iris[ind, ], write.forest = TRUE)
pred1 = predict(mod1, data = iris[!ind, ])

set.seed(2)
mod2 = ranger(Species ~ ., data = iris[ind, ], write.forest = TRUE)
pred2 = predict(mod2, data = iris[!ind, ])

set.seed(2)
mod3 = ranger(dependent.variable.name = "Species", data = iris[ind, ], write.forest = TRUE)
pred3 = predict(mod3, data = iris[!ind, ])

all.equal(pred1$predictions, pred2$predictions)
all.equal(pred1$predictions, pred3$predictions)

Bootstrapping with class weights

Currently, each observation is equally probable to be picked up while bootstrapping for a tree. Please add an option to bootstrap based on number of files in class and weighting accordingly.

or, an option where, there is a weight to each observation.

Predict with fewer trees

Is it possible to only specify a certain number of trees (or possibly which trees) in the predict method?

predict with missing values

Why ranger can't make predictions when there are missing values in the data? In predict.R file there are the following lines (160-162):

function return :if (any(is.na(data.final))) {
stop("Missing values in data.")
}`

Ranger successfully trains random forest on data which contains missing values, so it makes me think why it can't make predictions as well?

[R-package] regression in 1D case does only work with formula interface

Following error case

library(ranger)
data = data.frame(x = 1:10, y = 1:10)
newdata = data.frame(x = runif(10,0,10))
m1 = ranger(formula = y~x, data = data, mtry = 1, write.forest = TRUE)
m2 = ranger(formula = NULL, dependent.variable.name = "y", data = data, mtry = 1, write.forest = TRUE)
p1 = predict(m1, data = newdata) #works
p2 = predict(m2, data = newdata) #doesn't
# Error: mtry can not be larger than number of variables in data. Ranger will EXIT now.
# Error in predict.ranger.forest(forest, data, seed, num.threads, verbose) : 
#   User interrupt or internal error.

non-formula interface

Any possibility of adding the ability to call ranger models like this?

ranger(x = xData, y = yData)

Most other ML algorithms provide this as an option (e.g. randomForest, Rborist, glmnet, xgboost, lm.fit, glm.fit, etc). These interfaces can usually be faster since the formula does not need to be parsed and the data transformed to this form later anyway. Without looking through rangerCPP() I can't tell if this would be the case or not.

protection stack overflow

ranger (R version) give a Error: protect(): protection stack overflow with a 141*17222 data frame.
I used mtry of 131 and 1000 trees. save.memory = TRUE does not help.

If need it I could provide the data.

ranger wont compile with Intel 15 compilers

Compilation with the Intel 15.0.5 compiler (icpc -std=c++11)

Ends with error :

AAA_check_cpp11.cpp(3): error: #error directive: Error: ranger requires a real C++11 compiler. You probably have to update gcc.
#error Error: ranger requires a real C++11 compiler. You probably have to update gcc.
^

Is it a real problem or all Intel compilers are now banned?

No error by importance() if importance = "none"

The importance function should throw an error if the ranger model was fit with importance = "none" (the default) as defined in importance.R:

if (is.null(x$variable.importance) | length(x$variable.importance) < 1) {
    stop("No variable importance found. Please use 'importance' option when growing the forest.")
}

Currently it returns a zero for every independent variable:

library(ranger)
mod <- ranger(Species ~ ., data = iris)
importance(mod)
# [1] 0 0 0 0

It happens in 0.4.0, 0.5.0 and 0.5.4. I'll try to look into it, but don't let that stop you from fixing this...

Importance calculation takes too much memory and time

Computing importance (all kinds) in ranger involves first creating a numTrees x numFeatures matrix (made of variable_importance vectors in each tree object), and then averaging it by row, which has a substantial impact on memory use and speed for large numTrees; ranger should just accumulate those values in place, thus using at most numFeatures x numThreads memory.

Can't install; multiple compilation error messages

install.packages("ranger")
Installing package into ‘/home/andy/R/i686-pc-linux-gnu-library/3.2’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/ranger_0.2.7.tar.gz'

Content type 'application/x-gzip' length 50771 bytes (49 KB)

downloaded 49 KB

  • installing source package ‘ranger’ ...
    ** package ‘ranger’ successfully unpacked and MD5 sums checked
    ** libs
    g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c Data.cpp -o Data.o
    g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c DataChar.cpp -o DataChar.o
    g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c DataDouble.cpp -o DataDouble.o
    g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c DataFloat.cpp -o DataFloat.o
    g++ -std=c++0x -I/usr/share/R/include -DNDEBUG -I"/home/andy/R/i686-pc-linux-gnu-library/3.2/Rcpp/include" -fpic -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Wformat-security -Werror=format-security -D_FORTIFY_SOURCE=2 -g -c Forest.cpp -o Forest.o
    Forest.cpp: In member function ‘void Forest::showProgress(std::string)’:
    Forest.cpp:657:22: error: ‘std::chrono::steady_clock’ has not been declared
    Forest.cpp:661:3: error: ‘steady_clock’ has not been declared
    Forest.cpp:661:28: error: expected ‘;’ before ‘start_time’
    Forest.cpp:662:3: error: ‘steady_clock’ has not been declared
    Forest.cpp:662:28: error: expected ‘;’ before ‘last_time’
    Forest.cpp:668:51: error: ‘steady_clock’ has not been declared
    Forest.cpp:668:73: error: ‘last_time’ was not declared in this scope
    Forest.cpp:672:56: error: ‘steady_clock’ has not been declared
    Forest.cpp:672:78: error: ‘start_time’ was not declared in this scope
    Forest.cpp:676:19: error: ‘steady_clock’ has not been declared
    make: *** [Forest.o] Error 1
    ERROR: compilation failed for package ‘ranger’
  • removing ‘/home/andy/R/i686-pc-linux-gnu-library/3.2/ranger’
    Warning in install.packages :
    installation of package ‘ranger’ had non-zero exit status

The downloaded source packages are in
‘/tmp/RtmpBv2B7n/downloaded_packages’

devtools::session_info()
Session info ---------------------------------------------------------------------------
setting value
version R version 3.2.2 (2015-08-14)
system i686, linux-gnu
ui RStudio (0.99.446)
language (EN)
collate en_US.UTF-8
tz
date 2015-09-23

Packages -------------------------------------------------------------------------------
package * version date source
Boruta * 4.0.0 2014-12-07 CRAN (R 3.2.2)
curl 0.9.3 2015-08-25 CRAN (R 3.2.1)
devtools * 1.9.1 2015-09-11 CRAN (R 3.2.2)
digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
httr 1.0.0 2015-06-25 CRAN (R 3.2.1)
magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
R6 2.1.1 2015-08-19 CRAN (R 3.2.1)
randomForest * 4.6-10 2014-07-17 CRAN (R 3.2.0)
rFerns * 1.1.0 2014-11-30 CRAN (R 3.2.0)
stringi 0.5-5 2015-06-29 CRAN (R 3.2.1)
stringr 1.0.0 2015-04-30 CRAN (R 3.2.1)

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: i686-pc-linux-gnu (32-bit)
Running under: Ubuntu precise (12.04.5 LTS)

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] devtools_1.9.1 Boruta_4.0.0 rFerns_1.1.0 randomForest_4.6-10

loaded via a namespace (and not attached):
[1] httr_1.0.0 R6_2.1.1 magrittr_1.5 tools_3.2.2
[5] rstudioapi_0.3.1 curl_0.9.3 memoise_0.2.1 stringi_0.5-5
[9] stringr_1.0.0 digest_0.6.8

What does ranger do with new factor levels in prediction?

Hello rangers

I was recently stumbling over the error message

Error in predict.ranger.forest(forest, data, predict.all, seed, num.threads, :
Missing values in data.

It was the "classic" problem of having a new fector level in a categorical predictor during prediction which seems to happen if respect.unordered.factors = ["order"/TRUE] only. For "partition" and "ignore"/FALSE, there is no such message.

I think the behaviour in the cases "order" and "ignore" (FALSE) is clear although the error message for "order" could be more specific like "new or unknown factor levels in regressor". But what does ranger do in the last case respect.unordered.factors = "partition" (no error)?

Below the small example for test:

# All possible two-partitions
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "partition")
predict(fit, data.frame(Species = ""))$predictions

# Ordered by proportion of second class (respect.unordered.factors = TRUE)
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "order")
predict(fit, data.frame(Species = ""))$predictions

# Factors are considered ordered (respect.unordered.factors = FALSE)
fit <- ranger(Sepal.Width ~ Species, data = iris, write.forest = TRUE, respect.unordered.factors = "ignore")
predict(fit, data.frame(Species = ""))$predictions

Add option to return in-bag count

Should return how many times a sample was included in the bootstrap for a given tree. Required for support of ranger in sorhawell/forestFloor.

Question on how to pass "split.select.weights"

Hi rangers

I am unsure how the probabilities in "split.select.weights" are associated with the regressors. It it based on the order they are appearing in the formula? Or is "split.select.weights" simply a named numeric vector in any order?

Thanks for clarification.

Install error in utility.h

hi.. Install error as below in utility.h.. Any suggestions?

Environment - Red Hat Enterprise Linux Server release 6.5 (Santiago)
R --version
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

sudo R CMD INSTALL ranger_0.3.0.tar.gz

  • installing to library ‘/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library’
  • installing source package ‘ranger’ ...
    ** package ‘ranger’ successfully unpacked and MD5 sums checked
    ** libs
    g++ -std=c++0x -I/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/Rcpp/include" -fpic -g -O2 -c Data.cpp -o Data.o
    In file included from Data.cpp:36:
    utility.h: In function ‘void saveVector2D(std::vector<std::vector<T, std::allocator<_CharT> >, std::allocator<std::vector<T, std::allocator<_CharT> > > >&, std::ofstream&)’:
    utility.h:111: error: expected initializer before ‘:’ token
    Data.cpp:216: error: expected primary-expression at end of input
    Data.cpp:216: error: expected ‘;’ at end of input
    Data.cpp:216: error: expected primary-expression at end of input
    Data.cpp:216: error: expected ‘)’ at end of input
    Data.cpp:216: error: expected statement at end of input
    Data.cpp:216: error: expected ‘}’ at end of input
    make: *** [Data.o] Error 1
    ERROR: compilation failed for package ‘ranger’
  • removing ‘/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/ranger’

Thanks,
Manish

Predictions seem wrong when compared to randomForest

data(iris)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
iris_spec <- as.factor(iris$Species)
iris_dat <- as.matrix(iris[, !(names(iris) %in% "Species")])
set.seed(1234)

test_index <- sample(nrow(iris), 10)
train_index <- seq(1, nrow(iris))[-test_index]
iris_train <- randomForest(x = iris_dat[train_index, ], y = iris_spec[train_index], keep.forest = TRUE)
iris_pred <- predict(iris_train, iris_dat[test_index, ])

iris_train$confusion
##            setosa versicolor virginica class.error
## setosa         47          0         0  0.00000000
## versicolor      0         42         3  0.06666667
## virginica       0          4        44  0.08333333
table(iris_pred, iris_spec[test_index])
##             
## iris_pred    setosa versicolor virginica
##   setosa          3          0         0
##   versicolor      0          5         0
##   virginica       0          0         2
library(ranger)
## 
## Attaching package: 'ranger'
## 
## The following object is masked from 'package:randomForest':
## 
##     importance
iris_train2 <- ranger(data = iris[train_index, ], dependent.variable.name = "Species", write.forest = TRUE)
iris_pred2 <- predict(iris_train2, dat = iris[test_index, ])

iris_train2$classification.table
##             true
## predicted    setosa versicolor virginica
##   setosa         47          0         0
##   versicolor      0         41         3
##   virginica       0          4        45
table(iris_pred2$predictions, iris_spec[test_index])
##             
##              setosa versicolor virginica
##   setosa          0          0         0
##   versicolor      3          0         0
##   virginica       0          5         2

Install Error

Hi,

Get the below error during install. Can you pls help?

[xx@xyyyyR]$ sudo R CMD INSTALL ranger_0.3.0.tar.gz 
* installing to library ‘/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library’
* installing *source* package ‘ranger’ ...
** package ‘ranger’ successfully unpacked and MD5 sums checked
** libs
g++ -std=c++0x -I/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/include -DNDEBUG -DR_BUILD -I/usr/local/include -I"/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/Rcpp/include"   -fpic  -g -O2 -c Data.cpp -o Data.o
In file included from Data.cpp:36:
utility.h: In function ‘void saveVector2D(std::vector<std::vector<T, std::allocator<_CharT> >, std::allocator<std::vector<T, std::allocator<_CharT> > > >&, std::ofstream&)’:
utility.h:111: error: expected initializer before ‘:’ token
Data.cpp:216: error: expected primary-expression at end of input
Data.cpp:216: error: expected ‘;’ at end of input
Data.cpp:216: error: expected primary-expression at end of input
Data.cpp:216: error: expected ‘)’ at end of input
Data.cpp:216: error: expected statement at end of input
Data.cpp:216: error: expected ‘}’ at end of input
make: *** [Data.o] Error 1
ERROR: compilation failed for package ‘ranger’
* removing ‘/usr/lib64/RRO-8.0.2/R-3.1.2/lib64/R/library/ranger’

Thanks, Manish

Slow for classification on many classes

Hi, I'm trying to figure out why ranger is taking longer than the equivalent command in randomForest in R. Are there options that are slowing it down? Thanks for any help.

dim(data_simple[training,])
[1] 4104   95

length(unique(as.factor(k.row)))  
[1] 704

num.trees <- 100000

#completes in < 12 hours on 1 thread
rf.out <- randomForest(x=data_simple[training,], y=as.factor(k.row), importance=TRUE, proximity=TRUE, ntree=num.trees, keep.forest=T, do.trace=100)

#predicted completion in 39 hours on 2 threads
ranger.out <- ranger(data=data.frame("classes"=as.factor(k.row),data_simple[training,]),importance="impurity", num.trees=num.trees, num.threads=2, write.forest=T, verbose=T, dependent.variable.name="classes", classification=T)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.