carolssnz / gradientboostedmodels Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/gradientboostedmodels
Automatically exported from code.google.com/p/gradientboostedmodels
Instead of training with CV for a fixed number of trees, please consider adding
a feature to train until the best number of iterations is reached.
What sometimes happen is I train a large model, say n.trees=5000, and the best
iteration is just beyond that, so I have to start over. Or the opposite
happens and I waste time in building the first model (say, if best.iter=100 out
of 1000).
Original issue reported on code.google.com by [email protected]
on 20 Jan 2013 at 4:17
Reported by John Merrill via email.
The change from 1.6-3.2 to 2.0-8 breaks objects trained with older versions of
GBM, as 2.0-8 GBM objects include a new field, num.classes, which was not
present in older objects and the new versions of plot.gbm and predict.gbm don't
check for absence.
It's not hard to fix predict.gbm, at least -- instead of using the value in the
object, check for absence and set a local variable to 1 if there's nothing
there and to the value in the object if there's something there.
Thanks for work on the package, and thanks for all the new features in GBM 2.0
Original issue reported on code.google.com by harry.southworth
on 29 Jan 2013 at 9:59
The gbm function should be genericized:
gbm <- function() UseMethod
gbm.formula <- function()
blah blah blah
NexMethod
gbm.default <- function(
Calling gbm.default directly ought then to be quicker than looping on the
current gbm.
Also, moving more stuff out of gbm.fit ought to make it quicker to call that
directly, but need to be careful not to break dependencies.
Original issue reported on code.google.com by harry.southworth
on 29 Jan 2013 at 10:07
What steps will reproduce the problem?
1. Create a toy GBM solution
2. Use shrink.gbm with this solution
3. Use shrink.gbm.pred on the resulting system
What is the expected output? What do you see instead?
I'd expect to see a list of numbers. Instead, I start with a pair of R errors
(the wrappers for gbm_shrink_grad nor gbm_shrink_pred both omit the cNumClasses
variable). Those are straightforward to fix -- one simply adds the expected
parameters to the .Call items.
Then, however, the predicted values coming out of shrink.gbm.pred are always
NaN. I spent a while groveling in the code, and it jsut seems wrong to me --
where is the handling of multinomial classes? Why are all node predictions set
to R_NaN at line 753 of gbmentry.cpp? I don't see any leaf node handling
further down, nor any root node handling, so I can't see how the shrinkages
could possibly propagate upwards or downwards.
What version of the product are you using? On what operating system?
GBM 2.1
Linux (Ubuntu Lucid Lynx)
R 2.15.3 and 3.0.1
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 4 Jun 2013 at 12:27
What steps will reproduce the problem?
I called gbm() for binary classification with 5-fold CV. Normally this works,
but sometimes it fails. I think it has to do mostly with the data set (most
data sets are fine) and some with the GBM parameters.
What is the expected output?
No error. Just a nice graph. If graph cannot be made, handle errors better
and fall back as if called with plot.it=FALSE to return the optimal number of
trees.
What do you see instead?
Error in plot.window(...) : need finite 'ylim' values
What version of the product are you using? On what operating system?
Windows 7 64-bit
R 3.0.0 64-bit
gbm 2.0-8
Please provide any additional information below.
I stepped through the gbm.perf() and I think the problem is here
> ylim <- range(object$train.error, object$cv.error)
> ylim
[1] 0.0239321 Inf
Original issue reported on code.google.com by [email protected]
on 7 May 2013 at 7:04
Pairwise regression is fine with formula interface, but when
I try it with gbm.fit, the same model the Rgui crash.
This is my naive
model:
gbm1<-gbm.fit(x=train[,c(2,4)],
y=train[,1],
n.trees=1000,
distribution = list(name="pairwise",group="query",metric="conc"),
interaction.depth = 2,
n.minobsinnode = 30,
shrinkage = 0.01,
bag.
fraction = 0.95,
verbose=TRUE)
Is it a known issue / restriction or
is a bug?
Thank you for your work!
Jose A.
Original issue reported on code.google.com by harry.southworth
on 14 Jun 2013 at 8:39
What steps will reproduce the problem?
1. Build a gbm object.
2. Call predict function with assigning zero-length integer to the number of
trees argument.
What is the expected output? What do you see instead?
An error message is expected, but it goes into an infinite loop and the R
session becomes unresponsive.
What version of the product are you using? On what operating system?
mac osx
Please provide any additional information below.
It seems like the LENGTH macro in Rinternals.h returns garbage value for
zero-length vectors.
Original issue reported on code.google.com by [email protected]
on 10 Oct 2013 at 12:19
What steps will reproduce the problem?
## 1. creates a categorial predictor
iris$Sepal.LengthCat <- as.factor(trunc(iris$Sepal.Length))
## 2. fits the model
gbm.model <- gbm( Sepal.Length ~ Sepal.LengthCat+ Sepal.Width+ + Petal.Length+
Petal.Width,
data=iris,
distribution = "gaussian",
# distribution = "multinomial",
n.trees = 500,
shrinkage = 0.01,
interaction.depth = 5,
bag.fraction = 0.8,
train.fraction = 1,
n.minobsinnode = 4,
cv.folds = 5,
keep.data = TRUE,
n.cores = 4,
verbose = T)
summary(gbm.model, n.trees = best.iter)
### works fine
plot.gbm(gbm.model, i.var = 2, n.trees = 400)
### 3. fails
plot.gbm(gbm.model, i.var = 1, n.trees = 400)
What is the expected output? What do you see instead?
Expected output would be predicted probabilities for each class
What version of the product are you using? On what operating system?
gbm_2.1-0.3 with R 3.0.2 and Ubuntu 12.04
Please provide any additional information below.
Plots work on quantitative predictors. However, the values on the Y-axis are
strange : should not it be probabilities (hence in [0,1])?
Original issue reported on code.google.com by [email protected]
on 25 Mar 2014 at 2:44
What steps will reproduce the problem?
1. s <- Surv(y,ysensor)
2. class(s) <- c("someclass","Surv")
3. gbm(Surv(y,ysensor)~x,data=dat,distribution="coxph")
What is the expected output? What do you see instead?
You expect to get a model. Instead you receive the error:
Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w, :
The number of rows in x does not equal the length of y.
In addition: Warning message:
In if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) { :
the condition has length > 1 and only the first element will be used
What version of the product are you using? On what operating system?
Windows 7, R-2.15.3
'gbm' version 2.1
Please provide any additional information below.
A simple fix would be to change the class test to:
ifelse("Surv" %in% class(y), nrow(y), length(y))
Original issue reported on code.google.com by [email protected]
on 21 Jun 2013 at 8:03
What steps will reproduce the problem?
I ran gbm() with
max.trees: 500
interaction.depth: 3
shrinkage: 0.1
bag fraction: 1
cv folds: 5
n.cores=3
distribution="bernoulli"
train.fraction = 1
n.minobsinnode = 15
What is the expected output?
Something like this
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.6876 nan 0.1000 0.0981
2 1.4814 nan 0.1000 0.3318
What do you see instead?
Cross validating: 1 2 3
Error in cut.default(i, breaks) : 'breaks' are not unique
What version of the product are you using? On what operating system?
gbm 2.0-9.5 Windows binary from
https://code.google.com/p/gradientboostedmodels/downloads/detail?name=gbm_2.0-9.
5.zip&can=2&q=
R 2.15.3 64-bit
Windows 7 64-bit
4 physical CPUs
8 logical CPUs
Please provide any additional information below.
Even after adjusting some settings, I cannot reproduce this error using the gbm
example code and data in the gbm documentation.
Original issue reported on code.google.com by [email protected]
on 9 May 2013 at 2:38
What steps will reproduce the problem?
1.
Add these two lines to .Rprofile
cat(paste("This session PID is ", system("echo $PPID",intern=TRUE), ":\n",sep =
""))
cat(paste("begun at ", as.POSIXct(date(), format = "%a %b %d %H:%M:%S %Y"),
":\n",sep = ""))
2.
Source the attached file to create a function which uses the gbm function
example code into a function to use with mclapply using different seed numbers.
3.
What is the expected output? What do you see instead?
Using gbm-1.6-3.1, I get this as expected;
> testing(4)
2013-08-21 20:57:31 Begin using multicore method with phony data with 4 cores.
Core 1 uses 20442
Core 2 uses 20443
Core 3 uses 20445
Core 4 uses 20447
2013-08-21 20:57:36
....Completed testing multicore method with invented data.
$a
CV Test OOB
1 126 131 79
$b
CV Test OOB
1 117 121 85
$c
CV Test OOB
1 157 126 83
$d
CV Test OOB
1 123 140 81
Using the current gbm package, I get this:
system.time(bbb <- testing(4))
2013-08-21 16:18:03 Begin using multicore method with phony data with 4 cores.
Core 1 uses 22812
Core 2 uses 22814
Core 3 uses 22816
Core 4 uses 22819
This session PID is 22821:
begun at 2013-08-21 16:18:04:
This session PID is 22829:
begun at 2013-08-21 16:18:04:
This session PID is 22838:
begun at 2013-08-21 16:18:04:
This session PID is 22847:
begun at 2013-08-21 16:18:04:
2013-08-21 16:18:07
....Completed testing multicore method with invented data.
user system elapsed
0.460 1.760 3.926
Warning message:
In mclapply(subsets, FUN = test.gbm, mc.cores = nc, mc.cleanup = FALSE, :
3 function calls resulted in an error
bbb$b
[1] "Error in socketConnection(\"localhost\", port = port, server = TRUE,
blocking = TRUE, : \n cannot open the connection\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in socketConnection("localhost", port = port, server = TRUE,
blocking = TRUE, open = "a+b", timeout = timeout): cannot open the
connection>
>
bbb$a works properly and the errors on bbb$c and bbb$d are identical
to the above.
Notice as well as starting a new R process for each parallel process, another
one is immediately stared within each one. That prevents mclapply from
collecting more than the first element of subsets.
What version of the product are you using? On what operating system?
To work, it needs gbm-1.6-3.1, otherwise gbm-2.x gives the result above. The
old version is on a Fedora 15 (32bit) installation, the newer one on a Kubuntu
10.04 (64bit)
Please provide any additional information below.
When more challenging functions are tried, instead of just one extra R process
being started, one for each core on the machine (irrespective of how many are
asked for) are started.
Setting mc.cores to 1 will work 9i.e. return the complete list) but it still
starts those extra R processes and it takes twice as long as on an identical
hardware running Windows 7.
Original issue reported on code.google.com by [email protected]
on 23 Aug 2013 at 9:03
Attachments:
What steps will reproduce the problem?
1. Fit model using cv.folds > 1
2. Install older version from CRAN and refit
3. Look at gbm.perf
What is the expected output? What do you see instead?
The plots should be similar
I think cv.error is being wrongly scaled. If so, the 'best' number of trees is
still being used.
Original issue reported on code.google.com by harry.southworth
on 1 Feb 2013 at 7:19
I plotted NDCG versus iteration figure. Why does NDCG decrease as the
iterations increase? We want to maximize NDCG.
Thanks
Original issue reported on code.google.com by [email protected]
on 3 Jun 2013 at 4:20
removed X= on the X=1:cv.folds in the parLapply function call. Here is the
function call for parLappy:
parLapply <- function(cl, x, fun, ...)
docall(c, clusterApply(cl, splitList(x, length(cl)), lapply, fun, ...))
I think there was confusion between "X" and 'x', since we already pass 'x'
through to gbm.fit, I felt it was best to delete 'X' assignment and let it use
the position of the variable to assign it to the 'x' in the parLapply.
Original issue reported on code.google.com by [email protected]
on 29 May 2013 at 6:45
Attachments:
<[email protected]>
"When I call pretty.gbm.tree(...) on gbm object with recent version, the first
row of the table shows wrong weight and error reduction. For example, if my
data consists of 1000 elements, pretty.gbm.tree(...) shows 999 in the "Weight"
column in the first row."
Mindaugas also provided a potential fix that I modified.
I'm now waiting for a response
Original issue reported on code.google.com by harry.southworth
on 8 Apr 2013 at 4:12
What steps will reproduce the problem?
1. Have a big training set
2. Train a gbm model with 2k trees
3. remove the previous model from memory using rm and train again
What is the expected output? What do you see instead?
The memory from previous training shuldnt be released
What version of the product are you using? On what operating system?
gbm 2.1-0.2 on ubuntu 12.04
Please provide any additional information below.
in the GBMRESULT CLaplace::InitF method, it seems that the adArr variable never
is released.
Original issue reported on code.google.com by [email protected]
on 15 Sep 2013 at 10:13
What steps will reproduce the problem?
1. Run gbm with cv.folds>1 using a dataset with factor variables but restrict
the model formula to a subset of the dataset variables leaving out the factor
variables
2.
3.
What is the expected output? What do you see instead?
gbm should run without a problem, instead giving "Error in
object$var.levels[[i]] : subscript out of bounds"
What version of the product are you using? On what operating system?
gbm 2.1
Windows 7 64bit
Please provide any additional information below.
Attached code shows the error using the sample code from the package
It appears error is coming from predict.gbm from the step given below. Please
also note that index i below for the factor variable in dataset x is different
from object$var.levels[[i]] since object$var.levels is limited to model
variables. In the attached code for instance length(object$var.levels)=4
whereas cCols = 6
for (i in 1:cCols) {
if (is.factor(x[, i])) {
if (length(levels(x[, i])) > length(object$var.levels[[i]])) {
new.compare <- levels(x[, i])[1:length(object$var.levels[[i]])]
}
else {
new.compare <- levels(x[, i])
}
if (!identical(object$var.levels[[i]], new.compare)) {
x[, i] <- factor(x[, i], union(object$var.levels[[i]],
levels(x[, i])))
}
x[, i] <- as.numeric(x[, i]) - 1
}
}
Original issue reported on code.google.com by [email protected]
on 26 Jun 2013 at 6:18
Attachments:
What steps will reproduce the problem?
1. Run the attached script
What is the expected output? What do you see instead?
Instead of fitting a model, the Rgui window suddenly closes without any error
message.
What version of the product are you using? On what operating system?
R 2.15.3 and R 3.0.0
GBM 2.1
Please provide any additional information below.
I think this has to do with CV and n.cores=1
Original issue reported on code.google.com by [email protected]
on 18 May 2013 at 3:42
Attachments:
What steps will reproduce the problem?
1. Print a model fit to a classification problem: dist='multinomial',
'bernoulli', 'huberized'.
The confusion matrix is obtained using the optimal number of trees as decided
by cross-validation or whatever method was selected. However, think about the
cross-validation deviance and training set deviance plotted by gbm.perf. The
confusion matrix is for predictions from a grossly overfit model, even though
it uses the number of trees selected by cross-validation.
Need to either just get rid of the confusion matrix calculations, or get the
cross-validation predictions somewhere within the cross-validation loop.
Original issue reported on code.google.com by harry.southworth
on 9 Jan 2013 at 10:07
What steps will reproduce the problem?
1. Create a dataset with 1st observation's offset = 0
2. Build two gbm models; one with offset and the other without offset.
3. The result is identical.
4. If you reshuffle the data so that 1st observation != 0, the two models will
be different.
What is the expected output? What do you see instead?
expect models with and without offset will be different, instead they are the
same.
What version of the product are you using? On what operating system?
gbm 1.6-3.1
Please provide any additional information below.
The problem is this line in gbm.fit:
If(is.null(offset) || (offset == 0))
{
offset <- NA
}
Here offset is a vector. Offset == 0 is evaluated to a vector of TRUE and FALSE
values. When the condition inside an if-clause has length > 1, only the first
element will be used. So if offset[1] == 0, then the condition is TRUE, and
offset <- NA.
I am guessing Greg Ridgeway’s intent was all(offset == 0), not (offset == 0).
Original issue reported on code.google.com by [email protected]
on 25 Jan 2013 at 6:35
What steps will reproduce the problem?
1. Train a large data set with 5 CV on a machine with 2+ CPUs and adequate
What is the expected output? What do you see instead?
I expect all the CPUs are used for CV, but instead it is slow. Some of my
models take many hours, and after trying different interaction levels and
maximum trees, it can take days.
What version of the product are you using? On what operating system?
gbm 1.6 (and probably 2.0)
Please provide any additional information below.
Please support %dopar% like the caret package
Original issue reported on code.google.com by [email protected]
on 20 Jan 2013 at 4:13
gbm() works great with multinomial outcomes, but gbm.fit() does not. Since
gbm() requires a formula it is less efficient and using gbm.fit() would be
preferred for me (and works for other distributions).
> test <- gbm.fit(iris[, 1:4], iris$Species, distribution = "multinomial")
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.0010 nan
2 nan nan 0.0010 nan
3 nan nan 0.0010 nan
4 nan nan 0.0010 nan
5 nan nan 0.0010 nan
6 nan nan 0.0010 nan
7 nan nan 0.0010 nan
8 nan nan 0.0010 nan
9 nan nan 0.0010 nan
10 nan nan 0.0010 nan
20 nan nan 0.0010 nan
40 nan nan 0.0010 nan
60 nan nan 0.0010 nan
80 nan nan 0.0010 nan
100 nan nan 0.0010 nan
> test
NULL
A gradient boosted model with multinomial loss function.
100 iterations were performed.
There were 4 predictors of which 0 had non-zero influence.
> predict(test, head(iris[, 1:4]), n.trees = 50)
, , 50
setosa versicolor virginica
[1,] NA NA NA
[2,] NA NA NA
[3,] NA NA NA
[4,] NA NA NA
[5,] NA NA NA
[6,] NA NA NA
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel splines stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] gbm_2.0-9.3 lattice_0.20-10 survival_2.36-14
loaded via a namespace (and not attached):
[1] grid_2.15.2
Original issue reported on code.google.com by [email protected]
on 6 Feb 2013 at 2:41
There are two issues:
1) setting the seed does not ensure reproducibility of the model
2) when no predictors are used in any splits, the predicted values are somewhat
inconsistent; sometimes NA values are produced.
> library(gbm)
> library(caret)
> data(mdrr)
>
> set.seed(1)
> gbm1 <- gbm.fit(mdrrDescr[, 1:20], ifelse(mdrrClass == "Active", 1, 0),
+ distribution = "bernoulli")
Iter TrainDeviance ValidDeviance StepSize Improve
1 inf nan 0.0010 nan
2 inf nan 0.0010 nan
3 inf nan 0.0010 nan
4 inf nan 0.0010 nan
5 inf nan 0.0010 nan
6 inf nan 0.0010 nan
7 inf nan 0.0010 nan
8 inf nan 0.0010 nan
9 inf nan 0.0010 nan
10 inf nan 0.0010 nan
20 inf nan 0.0010 nan
40 inf nan 0.0010 nan
60 inf nan 0.0010 nan
80 inf nan 0.0010 nan
100 inf nan 0.0010 nan
> gbm1
NULL
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 20 predictors of which 0 had non-zero influence.
>
> predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
[1] 0.485376 0.485376 0.485376 0.485376 0.485376 0.485376
> set.seed(1)
> gbm1 <- gbm.fit(mdrrDescr[, 1:20], ifelse(mdrrClass == "Active", 1, 0),
+ distribution = "bernoulli")
Iter TrainDeviance ValidDeviance StepSize Improve
1 nan nan 0.0010 nan
2 nan nan 0.0010 nan
3 nan nan 0.0010 nan
4 nan nan 0.0010 nan
5 nan nan 0.0010 nan
6 nan nan 0.0010 nan
7 nan nan 0.0010 nan
8 nan nan 0.0010 nan
9 nan nan 0.0010 nan
10 nan nan 0.0010 nan
20 nan nan 0.0010 nan
40 nan nan 0.0010 nan
60 nan nan 0.0010 nan
80 nan nan 0.0010 nan
100 nan nan 0.0010 nan
> gbm1
NULL
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 20 predictors of which 0 had non-zero influence.
>
> predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
[1] NaN NaN NaN NaN NaN NaN
It looks like older versions would produce a non-NA value for all samples (as
in the top example).
I'm also not sure why no splits would occur in the model. This seems to be
occurring with a higher frequency than before.
Thanks,
Max
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel splines stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] caret_5.15-60 reshape2_1.2.1 plyr_1.8
[4] foreach_1.4.0 cluster_1.14.3 gbm_2.0-9.3
[7] lattice_0.20-10 survival_2.36-14
loaded via a namespace (and not attached):
[1] codetools_0.2-8 grid_2.15.2 iterators_1.0.6 stringr_0.6.1
[5] tools_2.15.2
Original issue reported on code.google.com by [email protected]
on 6 Feb 2013 at 3:54
What steps will reproduce the problem?
I'm not yet sure how to reproduce this a simple, public data set, but the error
has something to do with cross validation.
What is the expected output? What do you see instead?
Very quickly after starting GBM (before it does any training), I get this error:
Error in x[i] : object of type 'closure' is not subsettable
What version of the product are you using? On what operating system?
GBM 2.1
R 2.15.3
Please provide any additional information below.
I will attack the traceback which includes the gbm() invocation.
If I hear you accept this bug, I am glad to continue to try to get the error to
reproduce on your system.
As a workaround, is an old version of GBM available? GBM 1.6 and GBM 2.0-8
seemed OK, but the recent CRAN update is causing me problems.
Original issue reported on code.google.com by [email protected]
on 18 May 2013 at 4:15
Attachments:
Fix formatting from a code flag to an item flag for the new cv.fitted entry.
Original issue reported on code.google.com by Shea.Parkes
on 18 Apr 2013 at 3:02
Attachments:
shrGBM <- shrink.gbm(GBM_model, 10)
Error in shrink.gbm(GBM_model, 10) :
type 11 is unimplemented in 'type2char'
Reported by email, 2013-05-30 (or 31)
Original issue reported on code.google.com by harry.southworth
on 31 May 2013 at 7:17
Instead of a constant shrinkage (learning rate), how about one that starts out
high (say, 0.1) and gradually decreases to a small value (say, 0.001)? In
neural networks this is called momentum. When used properly it can speed up
learning while still allowing small adjustments at the end. I assume momentum
would help with gradient boosting, and for GBM, it could create smaller,
simpler models (fewer trees).
Original issue reported on code.google.com by [email protected]
on 2 Apr 2013 at 3:49
> set.seed(1)
> test1 <- gbm(Species ~ ., data = iris, n.trees = 1000, verbose = FALSE)
Distribution not specified, assuming multinomial ...
>
> predict(test1, head(iris), n.trees = 1000, type = "response")
setosa versicolor virginica
[1,] 0.8351586 0.1132224 0.05161896
[2,] 0.8327497 0.1157803 0.05147007
[3,] 0.8351586 0.1132224 0.05161896
[4,] 0.8344369 0.1139887 0.05157435
[5,] 0.8351586 0.1132224 0.05161896
[6,] 0.8348343 0.1135668 0.05159892
> predict(test1, head(iris), n.trees = c(500, 1000), type = "response")
500 1000
[1,] 0.6692731 0.11322242
[2,] 0.6663942 0.11578028
[3,] 0.6692731 0.11322242
[4,] 0.6685032 0.11398871
[5,] 0.6692731 0.11322242
[6,] 0.6692731 0.11356676
[7,] 0.8351586 0.13365581
[8,] 0.8327497 0.13308088
[9,] 0.8351586 0.13365581
[10,] 0.8344369 0.13350206
[11,] 0.8351586 0.13365581
[12,] 0.8348343 0.13365581
[13,] 0.1970711 0.05161896
[14,] 0.2005250 0.05147007
[15,] 0.1970711 0.05161896
[16,] 0.1979947 0.05157435
[17,] 0.1970711 0.05161896
[18,] 0.1970711 0.05159892
Looking at predict.gbm, the object predF is converted to an array of
matrices, but then:
if (length(n.trees) > 1) {
## the problem is here:
predF <- matrix(predF, ncol = length(n.trees), byrow = FALSE)
colnames(predF) <- n.trees
predF[, i.ntree.order] <- predF
}
This worked with bernoulli since the results for the event were returned
(and had one dimension). It would be helpful to keep the array structure.
Thanks,
Max
> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] splines stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] gbm_2.0-8 lattice_0.20-10 survival_2.36-14
loaded via a namespace (and not attached):
[1] grid_2.15.2
Original issue reported on code.google.com by harry.southworth
on 5 Feb 2013 at 5:13
What steps will reproduce the problem?
1. Run a model with cv.folds > 1
If verbose = TRUE (the default) an awful lot of updating gets spewed to the
screen. This is a hangover from when the computations ran much more slowly. I
think verbose should have options TRUE (current), FALSE (no printing to screen)
and CV (print to screen when starting a new CV fold) with the default being CV.
Original issue reported on code.google.com by harry.southworth
on 25 Jan 2013 at 2:28
What steps will reproduce the problem?
1. Run the attached script with cv.folds=1
What is the expected output? What do you see instead?
Error in gbm(mydata[TRUE, "survived"] ~ ., data = mydata2, distribution =
"bernoulli", :
object 'p' not found
There should be a sanity check like stopifnot(cv.folds>=2 |cv.folds==0)
What version of the product are you using? On what operating system?
R 2.15.3 and R 3.0.0
gbm 2.1
Please provide any additional information below.
Minor issue
Original issue reported on code.google.com by [email protected]
on 18 May 2013 at 4:03
Attachments:
What steps will reproduce the problem?
1. Build a gbm model with a factor variable, but with a level missing.
2. use the gbm model to predict an outcome with the missing level.
3. the prediction comes out wrong
What is the expected output? What do you see instead?
I expect the prediction of the missing level would come from the missing node,
but it picks up a value from another node instead.
What version of the product are you using? On what operating system?
2.0-8
Please provide any additional information below.
Please refer to the script after ****. Please take a look at the output of
prettyTree.
"c" is a missing level.
I expect when "c" is supplied at prediction, gbm will go to node #6, but it
seems to have gone to node #5 instead.
Also, in the old version of gbm, weight at node #1 is 300. In the current
version, the weight is 299 (300-1). Could you explain?
********************
y1 = rep(1, 100)
y2 = rep(3, 100)
y3 = rep(11, 100)
y = c(y1, y2, y3)
x1 = rep("a", 100)
x2 = rep("b", 100)
x3 = rep("d", 100)
x = c(x1, x2, x3)
x = factor(x, c("a", "b", "c", "d"))
df = data.frame(y,x)
gbm1 = gbm(y~x, data=df, train.fraction=1, n.trees=1, interaction.depth=2,
shrinkage = 1, cv.fold=0, bag.fraction=1, dist="gaussian")
pretty.gbm.tree(gbm1, i.tree=1)
gbm1$c.splits
gbm1$initF
df1 = data.frame(x=c("a", "b", "c", "d"))
p = predict(gbm1, df1, n.trees=1, type="response")
p
Original issue reported on code.google.com by [email protected]
on 26 Jan 2013 at 6:23
Attachments:
What steps will reproduce the problem?
1. run a gbm model with cv.folds=0
2. call predict.gbm(model)
The error you will see is from this line in the source for predict.gbm:
cat(paste("Using", n.trees, "trees...\n"))
and is caused by this line immediately above:
best <- length(object$train.error)
The variable 'best' is never referenced again, I believe this is simple bug and
n.trees should be set to the length of object$train.error if n.trees is not
specified in the arguments.
This is bug arises in version 2.1 of gbm.
Original issue reported on code.google.com by [email protected]
on 29 Jan 2015 at 8:30
It seems that interact.gbm does not work for family=multinomial
> data(iris)
> set.seed(10)
> f<-gbm(Species~.,data=iris,n.trees=1000,interaction.depth=2)
Distribution not specified, assuming multinomial ...
> interact.gbm(f,data=iris,i.var=c(1,2))
Error in weighted.mean.default(f, n) :
'x' and 'w' must have the same length
Works fine with the bernoulli family
> set.seed(10)
> f1<-gbm(I(Species=="setosa")~.,data=iris,n.trees=1000,interaction.depth=2)
Distribution not specified, assuming bernoulli ...
> interact.gbm(f1,data=iris,i.var=c(1,2))
[1] 0.8861943
> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=sv_SE.UTF-8 LC_NUMERIC=C LC_TIME=sv_SE.UTF-8 LC_COLLATE=sv_SE.UTF-8
[5] LC_MONETARY=sv_SE.UTF-8 LC_MESSAGES=sv_SE.UTF-8 LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel splines stats graphics grDevices utils datasets
methods base
other attached packages:
[1] Hmisc_3.10-1 gbm_2.0-9.5 lattice_0.20-6 survival_2.36-14
loaded via a namespace (and not attached):
[1] cluster_1.14.3 grid_2.15.3 tools_2.15.3
Original issue reported on code.google.com by [email protected]
on 25 Mar 2013 at 4:46
Issue is to do with gbm.fit not returning train.fraction. A related issue has
been described by Elisabeth Freeman:
Thank you for the windows binary.
I have started testing ModelMap with the new gbm package, and I think at least
some of the issues are related to the change in function arguments names in
gbm.fit() from ‘train.fraction’ to ‘nTrain’. ModelMap not only used the
old argument name (which just generates a warning), but for prediction, it
extracted the value of ‘train.fraction’ from the model object by name, and
the new model objects created by gbm.fit() do not have a component by that
name, resulting in an error.
I can update ModelMap to use nTrain, but I also ran into an issue with the
gbm.more() function. This function seems to still require that the model object
have a component named ‘train.fraction’. For example, in these lines of
code from the gbm.more function:
num.groups.train <- max(1, round(object$train.fraction *nlevels(group)))
Model objects created by the new gbm() still have a train.fraction component,
but objects created by the new gbm.fit() only have the ‘nTrain’ component.
Since ModelMap is often used on large data sets, it uses the gbm.fit() function
for model building. When I use gbm.more() on these models, I get the following
result:
model.obj
gbm.more(object = SGB, n.new.trees = 100)
A gradient boosted model with gaussian loss function.
1200 iterations were performed.
Error in if (x$train.fraction < 1) { : argument is of length zero
Here is some sample code adapted from the gbm help files that shows the issue I
am running into with using the gbm.more function on models fitted with
gbm.fit():
################################################################################
########
N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]
SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)
# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA
X<-data.frame(X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)
# fit initial model with gbm function
gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6, # formula
data=data, # dataset
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
#0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best
train.fraction = 0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=FALSE) # don't print out progress
# fit initial model with gbm.fit function
gbm1fit <-
gbm.fit(x=X,y=Y,
var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
# +1: monotone increase,
# 0: no monotone restrictions
distribution="gaussian", # see the help for other choices
n.trees=1000, # number of trees
shrinkage=0.05, # shrinkage or learning rate,
#0.001 to 0.1 usually work
interaction.depth=3, # 1: additive model, 2: two-way interactions, etc.
bag.fraction = 0.5, # subsampling fraction, 0.5 is probably best
nTrain = nrow(X)*0.5, # fraction of data for training,
# first train.fraction*N used for training
n.minobsinnode = 10, # minimum total weight needed in each node
keep.data=TRUE, # keep a copy of the dataset with the object
verbose=FALSE) # don't print out progress
names(gbm1)
names(gbm1fit)
# do another 100 iterations
gbm2 <- gbm.more(gbm1,100,verbose=FALSE) # stop printing detailed progress
gbm2
# do another 100 iterations
gbm2fit <- gbm.more(gbm1fit,100,verbose=FALSE) # stop printing detailed progress
gbm2fit
#add train.fractiion to gbm1fit model object
gbm1fit<-gbm1fit
gbm1fit$train.fraction<-0.5
# do another 100 iterations
gbm2fit <- gbm.more(gbm1fit,100,verbose=FALSE) # stop printing detailed progress
gbm2fit
Original issue reported on code.google.com by harry.southworth
on 9 Jan 2013 at 10:03
What steps will reproduce the problem?
1. run gbm with cv.folds>1 and n.cores = 1
I have a setting (linux, but it doesn't matter) where I am running gbm with
cross validation on many data sets in parallel. In this setting, I want n.cores
set to 1.
However, when examining the running processes and the source, it seems that
"makeCluster" is still called and the code is still run on a subprocess which
creates large overheads for larger training datasets.
My understanding of the source leads me to think that avoiding this behavior
would be fairly simple by conditioning on "n.cores==1" in just 1 or 2 places.
Please let me know what you think.
Original issue reported on code.google.com by [email protected]
on 28 May 2013 at 10:42
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.