carolssnz / gradientboostedmodels Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 588 KB

Automatically exported from code.google.com/p/gradientboostedmodels

R 36.85% TeX 6.89% HTML 0.14% PostScript 4.17% C++ 51.48% C 0.47%

gradientboostedmodels's Introduction

carolssnz

gradientboostedmodels's People

Contributors

Watchers

gradientboostedmodels's Issues

stop training when best iteration is reached

Instead of training with CV for a fixed number of trees, please consider adding 
a feature to train until the best number of iterations is reached.  

What sometimes happen is I train a large model, say n.trees=5000, and the best 
iteration is just beyond that, so I have to start over.  Or the opposite 
happens and I waste time in building the first model (say, if best.iter=100 out 
of 1000).

Original issue reported on code.google.com by [email protected] on 20 Jan 2013 at 4:17

Incompatibility with 1.6

Reported by John Merrill via email.


The change from 1.6-3.2 to 2.0-8 breaks objects trained with older versions of 
GBM, as 2.0-8 GBM objects include a new field, num.classes, which was not 
present in older objects and the new versions of plot.gbm and predict.gbm don't 
check for absence.

It's not hard to fix predict.gbm, at least -- instead of using the value in the 
object, check for absence and set a local variable to 1 if there's nothing 
there and to the value in the object if there's something there.

Thanks for work on the package, and thanks for all the new features in GBM 2.0

Original issue reported on code.google.com by harry.southworth on 29 Jan 2013 at 9:59

Make gbm generic

The gbm function should be genericized:

gbm <- function() UseMethod
gbm.formula <- function()
                  blah blah blah
                  NexMethod
gbm.default <- function(

Calling gbm.default directly ought then to be quicker than looping on the 
current gbm.

Also, moving more stuff out of gbm.fit ought to make it quicker to call that 
directly, but need to be careful not to break dependencies.

Original issue reported on code.google.com by harry.southworth on 29 Jan 2013 at 10:07

shrink.gbm.pred simply doesn't work

What steps will reproduce the problem?
1. Create a toy GBM solution
2. Use shrink.gbm with this solution
3. Use shrink.gbm.pred on the resulting system

What is the expected output? What do you see instead?
I'd expect to see a list of numbers.  Instead, I start with a pair of R errors 
(the wrappers for gbm_shrink_grad nor gbm_shrink_pred both omit the cNumClasses 
variable).  Those are straightforward to fix -- one simply adds the expected 
parameters to the .Call items.

Then, however, the predicted values coming out of shrink.gbm.pred are always 
NaN.  I spent a while groveling in the code, and it jsut seems wrong to me -- 
where is the handling of multinomial classes?  Why are all node predictions set 
to R_NaN at line 753 of gbmentry.cpp?  I don't see any leaf node handling 
further down, nor any root node handling, so I can't see how the shrinkages 
could possibly propagate upwards or downwards.

What version of the product are you using? On what operating system?
GBM 2.1
Linux (Ubuntu Lucid Lynx)
R 2.15.3 and 3.0.1

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 4 Jun 2013 at 12:27

gbm.perf() gives Error in plot.window(...) : need finite 'ylim' values

What steps will reproduce the problem?
I called gbm() for binary classification with 5-fold CV.  Normally this works, 
but sometimes it fails.  I think it has to do mostly with the data set (most 
data sets are fine) and some with the GBM parameters.

What is the expected output? 
No error.  Just a nice graph.  If graph cannot be made, handle errors better 
and fall back as if called with plot.it=FALSE to return the optimal number of 
trees.

What do you see instead?
Error in plot.window(...) : need finite 'ylim' values



What version of the product are you using? On what operating system?
Windows 7 64-bit
R 3.0.0 64-bit
gbm 2.0-8

Please provide any additional information below.
I stepped through the gbm.perf() and I think the problem is here

>                 ylim <- range(object$train.error, object$cv.error)
> ylim
[1] 0.0239321       Inf

Original issue reported on code.google.com by [email protected] on 7 May 2013 at 7:04

Pairwise distribution in gbm crash under gbm.fit interface

Pairwise regression is fine with formula interface, but when
I try it with gbm.fit, the same model the Rgui crash.

This is my naive
model:

gbm1<-gbm.fit(x=train[,c(2,4)],
y=train[,1],
n.trees=1000,

distribution = list(name="pairwise",group="query",metric="conc"),

interaction.depth = 2,
n.minobsinnode = 30,
shrinkage = 0.01,
bag.
fraction = 0.95,
verbose=TRUE)


Is it a known issue / restriction or
is a bug?

Thank you for your work!

Jose A.

Original issue reported on code.google.com by harry.southworth on 14 Jun 2013 at 8:39

Error checking is desired for the "n.trees" argument in predict.gbm

What steps will reproduce the problem?
1. Build a gbm object.
2. Call predict function with assigning zero-length integer to the number of 
trees argument.

What is the expected output? What do you see instead?
An error message is expected, but it goes into an infinite loop and the R 
session becomes unresponsive. 


What version of the product are you using? On what operating system?
mac osx

Please provide any additional information below.
It seems like the LENGTH macro in Rinternals.h returns garbage value for 
zero-length vectors.

Original issue reported on code.google.com by [email protected] on 10 Oct 2013 at 12:19

plot.gbm fails with multinomial distribution and categorical predictor

What steps will reproduce the problem?

## 1. creates a categorial predictor
iris$Sepal.LengthCat <- as.factor(trunc(iris$Sepal.Length))

## 2. fits the model
gbm.model <- gbm( Sepal.Length ~ Sepal.LengthCat+ Sepal.Width+ + Petal.Length+ 
Petal.Width,
         data=iris,
                 distribution = "gaussian",
#                 distribution = "multinomial",
                 n.trees = 500,
                 shrinkage = 0.01,
                 interaction.depth = 5,
                 bag.fraction = 0.8,
                 train.fraction = 1,
                 n.minobsinnode = 4,
                 cv.folds = 5,
                 keep.data = TRUE,
                 n.cores = 4, 
         verbose = T)
summary(gbm.model, n.trees = best.iter)
### works fine
plot.gbm(gbm.model, i.var = 2, n.trees = 400)

### 3. fails
plot.gbm(gbm.model, i.var = 1, n.trees = 400)

What is the expected output? What do you see instead?
Expected output would be predicted probabilities for each class

What version of the product are you using? On what operating system?
gbm_2.1-0.3 with R 3.0.2 and Ubuntu 12.04

Please provide any additional information below.
Plots work on quantitative predictors. However, the values on the Y-axis are 
strange : should not it be probabilities (hence in [0,1])?

Original issue reported on code.google.com by [email protected] on 25 Mar 2014 at 2:44

Improper class detection for Surv objects

What steps will reproduce the problem?
1. s <- Surv(y,ysensor)
2. class(s) <- c("someclass","Surv")
3. gbm(Surv(y,ysensor)~x,data=dat,distribution="coxph")

What is the expected output? What do you see instead?
You expect to get a model. Instead you receive the error:
Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  : 
  The number of rows in x does not equal the length of y.
In addition: Warning message:
In if (nrow(x) != ifelse(class(y) == "Surv", nrow(y), length(y))) { :
  the condition has length > 1 and only the first element will be used

What version of the product are you using? On what operating system?
Windows 7, R-2.15.3
'gbm' version 2.1

Please provide any additional information below.
A simple fix would be to change the class test to:
ifelse("Surv" %in% class(y), nrow(y), length(y))

Original issue reported on code.google.com by [email protected] on 21 Jun 2013 at 8:03

Error in cut.default(i, breaks) : 'breaks' are not unique

What steps will reproduce the problem?
I ran gbm() with 
max.trees: 500
interaction.depth: 3
shrinkage: 0.1
bag fraction: 1
cv folds: 5
n.cores=3
distribution="bernoulli"
train.fraction = 1
n.minobsinnode = 15

What is the expected output? 
Something like this
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.6876             nan     0.1000    0.0981
     2        1.4814             nan     0.1000    0.3318


What do you see instead?
Cross validating:  1 2 3 
Error in cut.default(i, breaks) : 'breaks' are not unique


What version of the product are you using? On what operating system?
gbm 2.0-9.5 Windows binary from 
https://code.google.com/p/gradientboostedmodels/downloads/detail?name=gbm_2.0-9.
5.zip&can=2&q=
R 2.15.3 64-bit
Windows 7 64-bit
4 physical CPUs
8 logical CPUs

Please provide any additional information below.
Even after adjusting some settings, I cannot reproduce this error using the gbm 
example code and data in the gbm documentation.

Original issue reported on code.google.com by [email protected] on 9 May 2013 at 2:38

gbm 2.x clashes with mclapply

What steps will reproduce the problem?
1.

Add these two lines to .Rprofile

cat(paste("This session PID is ", system("echo $PPID",intern=TRUE), ":\n",sep = 
""))
cat(paste("begun at ", as.POSIXct(date(), format = "%a %b %d %H:%M:%S %Y"), 
":\n",sep = ""))

2.
Source the attached file to create a function which uses the gbm function 
example code into a function to use with mclapply using different seed numbers.

3.

What is the expected output? What do you see instead?

Using gbm-1.6-3.1, I get this as expected;

> testing(4)
  2013-08-21 20:57:31  Begin using multicore method with phony data with 4 cores.
Core 1 uses 20442 

Core 2 uses 20443 

Core 3 uses 20445 

Core 4 uses 20447 

 2013-08-21 20:57:36 
....Completed testing multicore method with invented data.
$a
   CV Test OOB
1 126  131  79

$b
   CV Test OOB
1 117  121  85

$c
   CV Test OOB
1 157  126  83

$d
   CV Test OOB
1 123  140  81


Using the current gbm package, I get this:


system.time(bbb <- testing(4))
  2013-08-21 16:18:03  Begin using multicore method with phony data with 4 cores.

Core 1 uses 22812 

Core 2 uses 22814 

Core 3 uses 22816 

Core 4 uses 22819 
This session PID is 22821:
begun at 2013-08-21 16:18:04:
This session PID is 22829:
begun at 2013-08-21 16:18:04:
This session PID is 22838:
begun at 2013-08-21 16:18:04:
This session PID is 22847:
begun at 2013-08-21 16:18:04:
 2013-08-21 16:18:07 
....Completed testing multicore method with invented data.
   user  system elapsed 
  0.460   1.760   3.926 
Warning message:
In mclapply(subsets, FUN = test.gbm, mc.cores = nc, mc.cleanup = FALSE,  :
  3 function calls resulted in an error


bbb$b
[1] "Error in socketConnection(\"localhost\", port = port, server = TRUE, 
blocking = TRUE,  : \n  cannot open the connection\n"
attr(,"class")
[1] "try-error"
attr(,"condition")
<simpleError in socketConnection("localhost", port = port, server = TRUE, 
blocking = TRUE,     open = "a+b", timeout = timeout): cannot open the 
connection>
> 

bbb$a works properly and the errors on bbb$c and bbb$d are identical
to the above.

Notice as well as starting a new R process for each parallel process, another 
one is immediately stared within each one.  That prevents mclapply from 
collecting more than the first element of subsets.



What version of the product are you using? On what operating system?

To work, it needs gbm-1.6-3.1, otherwise gbm-2.x gives the result above.  The 
old version is on a Fedora 15 (32bit) installation, the newer one on a Kubuntu 
10.04 (64bit)



Please provide any additional information below.

When more challenging functions are tried, instead of just one extra R process 
being started, one for each core on the machine (irrespective of how many are 
asked for) are started.

Setting mc.cores to 1 will work 9i.e. return the complete list) but it still 
starts those extra R processes and it takes twice as long as on an identical 
hardware running Windows 7.

Original issue reported on code.google.com by [email protected] on 23 Aug 2013 at 9:03

Attachments:

testing.fn.sc

cv.error must be wrong

What steps will reproduce the problem?
1. Fit model using cv.folds > 1
2. Install older version from CRAN and refit
3. Look at gbm.perf

What is the expected output? What do you see instead?
The plots should be similar

I think cv.error is being wrongly scaled. If so, the 'best' number of trees is 
still being used.

Original issue reported on code.google.com by harry.southworth on 1 Feb 2013 at 7:19

NDCG plot

I plotted NDCG versus iteration figure. Why does NDCG decrease as the 
iterations increase? We want to maximize NDCG. 
Thanks

Original issue reported on code.google.com by [email protected] on 3 Jun 2013 at 4:20

Patch for /gbm/R/gbmCrossVal.R

removed X= on the X=1:cv.folds in the parLapply function call.  Here is the 
function call for parLappy:
parLapply <- function(cl, x, fun, ...)
    docall(c, clusterApply(cl, splitList(x, length(cl)), lapply, fun, ...))

I think there was confusion between "X" and 'x', since we already pass 'x' 
through to gbm.fit, I felt it was best to delete 'X' assignment and let it use 
the position of the variable to assign it to the 'x' in the parLapply.

Original issue reported on code.google.com by [email protected] on 29 May 2013 at 6:45

Attachments:

gbmCrossVal.R.patch

pretty.gbm.tree bug

<[email protected]>
"When I call pretty.gbm.tree(...) on gbm object with recent version, the first 
row of the table shows wrong weight and error reduction. For example, if my 
data consists of 1000 elements, pretty.gbm.tree(...) shows 999 in the "Weight" 
column in the first row."


Mindaugas also provided a potential fix that I modified.

I'm now waiting for a response

Original issue reported on code.google.com by harry.southworth on 8 Apr 2013 at 4:12

Memory leak with laplace distribution

What steps will reproduce the problem?
1. Have a big training set
2. Train a gbm model with 2k trees
3. remove the previous model from memory using rm and train again

What is the expected output? What do you see instead?
The memory from previous training shuldnt be released

What version of the product are you using? On what operating system?
gbm 2.1-0.2 on ubuntu 12.04

Please provide any additional information below.
in the GBMRESULT CLaplace::InitF method, it seems that the adArr variable never 
is released.

Original issue reported on code.google.com by [email protected] on 15 Sep 2013 at 10:13

predict.gbm error when cv.folds>1

What steps will reproduce the problem?
1. Run gbm with cv.folds>1 using a dataset with factor variables but restrict 
the model formula to a subset of the dataset variables leaving out the factor 
variables   
2.
3.

What is the expected output? What do you see instead?

gbm should run without a problem, instead giving "Error in 
object$var.levels[[i]] : subscript out of bounds"

What version of the product are you using? On what operating system?
gbm 2.1 
Windows 7 64bit

Please provide any additional information below.
Attached code shows the error using the sample code from the package 

It appears error is coming from predict.gbm from the step given below. Please 
also note that index i below for the factor variable in dataset x is different 
from object$var.levels[[i]] since object$var.levels is limited to model 
variables. In the attached code for instance length(object$var.levels)=4 
whereas cCols = 6



for (i in 1:cCols) {
    if (is.factor(x[, i])) {
        if (length(levels(x[, i])) > length(object$var.levels[[i]])) {
            new.compare <- levels(x[, i])[1:length(object$var.levels[[i]])]
        }
        else {
            new.compare <- levels(x[, i])
        }
        if (!identical(object$var.levels[[i]], new.compare)) {
            x[, i] <- factor(x[, i], union(object$var.levels[[i]], 
                levels(x[, i])))
        }
        x[, i] <- as.numeric(x[, i]) - 1
    }
}

Original issue reported on code.google.com by [email protected] on 26 Jun 2013 at 6:18

Attachments:

gbm_error.R

gbm 2.1 reliably crashes R

What steps will reproduce the problem?
1. Run the attached script

What is the expected output? What do you see instead?
Instead of fitting a model, the Rgui window suddenly closes without any error 
message.


What version of the product are you using? On what operating system?
R 2.15.3 and R 3.0.0
GBM 2.1

Please provide any additional information below.
I think this has to do with CV and n.cores=1

Original issue reported on code.google.com by [email protected] on 18 May 2013 at 3:42

Attachments:

gbm_2.1_crashes_R.R

Add CV fitted values to output object

What steps will reproduce the problem?
1. Print a model fit to a classification problem: dist='multinomial', 
'bernoulli', 'huberized'.

The confusion matrix is obtained using the optimal number of trees as decided 
by cross-validation or whatever method was selected. However, think about the 
cross-validation deviance and training set deviance plotted by gbm.perf. The 
confusion matrix is for predictions from a grossly overfit model, even though 
it uses the number of trees selected by cross-validation.

Need to either just get rid of the confusion matrix calculations, or get the 
cross-validation predictions somewhere within the cross-validation loop.

Original issue reported on code.google.com by harry.southworth on 9 Jan 2013 at 10:07

if the first observation's offset = 0, then the offset will be ignored.

What steps will reproduce the problem?
1. Create a dataset with 1st observation's offset = 0
2. Build two gbm models; one with offset and the other without offset.
3. The result is identical.
4. If you reshuffle the data so that 1st observation != 0, the two models will 
be different.

What is the expected output? What do you see instead?
expect models with and without offset will be different, instead they are the 
same.

What version of the product are you using? On what operating system?
gbm 1.6-3.1 

Please provide any additional information below.

The problem is this line in gbm.fit:

If(is.null(offset) || (offset == 0))
{
offset <- NA
}

Here offset is a vector. Offset == 0 is evaluated to a vector of TRUE and FALSE 
values. When the condition inside an if-clause has length > 1, only the first 
element will be used. So if offset[1] == 0, then the condition is TRUE, and 
offset <- NA.

I am guessing Greg Ridgeway’s intent was all(offset == 0), not (offset == 0).

Original issue reported on code.google.com by [email protected] on 25 Jan 2013 at 6:35

support parallel processing

What steps will reproduce the problem?
1. Train a large data set with 5 CV on a machine with 2+ CPUs and adequate

What is the expected output? What do you see instead?
I expect all the CPUs are used for CV, but instead it is slow.  Some of my 
models take many hours, and after trying different interaction levels and 
maximum trees, it can take days.

What version of the product are you using? On what operating system?
gbm 1.6 (and probably 2.0)

Please provide any additional information below.
Please support %dopar% like the caret package

Original issue reported on code.google.com by [email protected] on 20 Jan 2013 at 4:13

gbm.fit with multinomial outcomes

gbm() works great with multinomial outcomes, but gbm.fit() does not. Since 
gbm() requires a formula it is less efficient and using gbm.fit() would be 
preferred for me (and works for other distributions).


> test <- gbm.fit(iris[, 1:4], iris$Species, distribution = "multinomial")
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           nan             nan     0.0010       nan
     2           nan             nan     0.0010       nan
     3           nan             nan     0.0010       nan
     4           nan             nan     0.0010       nan
     5           nan             nan     0.0010       nan
     6           nan             nan     0.0010       nan
     7           nan             nan     0.0010       nan
     8           nan             nan     0.0010       nan
     9           nan             nan     0.0010       nan
    10           nan             nan     0.0010       nan
    20           nan             nan     0.0010       nan
    40           nan             nan     0.0010       nan
    60           nan             nan     0.0010       nan
    80           nan             nan     0.0010       nan
   100           nan             nan     0.0010       nan

> test
NULL
A gradient boosted model with multinomial loss function.
100 iterations were performed.
There were 4 predictors of which 0 had non-zero influence.
> predict(test, head(iris[, 1:4]), n.trees = 50)
, , 50

     setosa versicolor virginica
[1,]     NA         NA        NA
[2,]     NA         NA        NA
[3,]     NA         NA        NA
[4,]     NA         NA        NA
[5,]     NA         NA        NA
[6,]     NA         NA        NA

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] gbm_2.0-9.3      lattice_0.20-10  survival_2.36-14

loaded via a namespace (and not attached):
[1] grid_2.15.2

Original issue reported on code.google.com by [email protected] on 6 Feb 2013 at 2:41

inconsistent predictions when 0 predictors had non-zero influence

There are two issues:

1) setting the seed does not ensure reproducibility of the model
2) when no predictors are used in any splits, the predicted values are somewhat 
inconsistent; sometimes NA values are produced.

> library(gbm)
> library(caret)
> data(mdrr)
> 
> set.seed(1)
> gbm1 <- gbm.fit(mdrrDescr[, 1:20], ifelse(mdrrClass == "Active", 1, 0),
+                 distribution = "bernoulli")
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           inf             nan     0.0010       nan
     2           inf             nan     0.0010       nan
     3           inf             nan     0.0010       nan
     4           inf             nan     0.0010       nan
     5           inf             nan     0.0010       nan
     6           inf             nan     0.0010       nan
     7           inf             nan     0.0010       nan
     8           inf             nan     0.0010       nan
     9           inf             nan     0.0010       nan
    10           inf             nan     0.0010       nan
    20           inf             nan     0.0010       nan
    40           inf             nan     0.0010       nan
    60           inf             nan     0.0010       nan
    80           inf             nan     0.0010       nan
   100           inf             nan     0.0010       nan

> gbm1
NULL
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 20 predictors of which 0 had non-zero influence.
> 
> predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
[1] 0.485376 0.485376 0.485376 0.485376 0.485376 0.485376
> set.seed(1)
> gbm1 <- gbm.fit(mdrrDescr[, 1:20], ifelse(mdrrClass == "Active", 1, 0),
+                 distribution = "bernoulli")
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1           nan             nan     0.0010       nan
     2           nan             nan     0.0010       nan
     3           nan             nan     0.0010       nan
     4           nan             nan     0.0010       nan
     5           nan             nan     0.0010       nan
     6           nan             nan     0.0010       nan
     7           nan             nan     0.0010       nan
     8           nan             nan     0.0010       nan
     9           nan             nan     0.0010       nan
    10           nan             nan     0.0010       nan
    20           nan             nan     0.0010       nan
    40           nan             nan     0.0010       nan
    60           nan             nan     0.0010       nan
    80           nan             nan     0.0010       nan
   100           nan             nan     0.0010       nan

> gbm1
NULL
A gradient boosted model with bernoulli loss function.
100 iterations were performed.
There were 20 predictors of which 0 had non-zero influence.
> 
> predict(gbm1, head(mdrrDescr), n.trees = 100, type = "response")
[1] NaN NaN NaN NaN NaN NaN

It looks like older versions would produce a non-NA value for all samples (as 
in the top example).

I'm also not sure why no splits would occur in the model. This seems to be 
occurring with a higher frequency than before.

Thanks,

Max

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils    
[7] datasets  methods   base     

other attached packages:
[1] caret_5.15-60    reshape2_1.2.1   plyr_1.8        
[4] foreach_1.4.0    cluster_1.14.3   gbm_2.0-9.3     
[7] lattice_0.20-10  survival_2.36-14

loaded via a namespace (and not attached):
[1] codetools_0.2-8 grid_2.15.2     iterators_1.0.6 stringr_0.6.1  
[5] tools_2.15.2

Original issue reported on code.google.com by [email protected] on 6 Feb 2013 at 3:54

Error in x[i] : object of type 'closure' is not subsettable

What steps will reproduce the problem?
I'm not yet sure how to reproduce this a simple, public data set, but the error 
has something to do with cross validation.

What is the expected output? What do you see instead?
Very quickly after starting GBM (before it does any training), I get this error:

Error in x[i] : object of type 'closure' is not subsettable


What version of the product are you using? On what operating system?
GBM 2.1
R 2.15.3

Please provide any additional information below.
I will attack the traceback which includes the gbm() invocation.  

If I hear you accept this bug, I am glad to continue to try to get the error to 
reproduce on your system.

As a workaround, is an old version of GBM available?  GBM 1.6 and GBM 2.0-8 
seemed OK, but the recent CRAN update is causing me problems.

Original issue reported on code.google.com by [email protected] on 18 May 2013 at 4:15

Attachments:

[gbm 2.1 R 2.15.13 closure traceback.txt](https://storage.googleapis.com/google-code-attachments/gradientboostedmodels/issue-22/comment-0/gbm 2.1 R 2.15.13 closure traceback.txt)

Patch for /gbm/man/gbm.object.Rd

Fix formatting from a code flag to an item flag for the new cv.fitted entry.

Original issue reported on code.google.com by Shea.Parkes on 18 Apr 2013 at 3:02

Attachments:

gbm.object.Rd.patch

shrink.gbm error

shrGBM <- shrink.gbm(GBM_model, 10)
Error in shrink.gbm(GBM_model, 10) : 
  type 11 is unimplemented in 'type2char'

Reported by email, 2013-05-30 (or 31)

Original issue reported on code.google.com by harry.southworth on 31 May 2013 at 7:17

add momentum option (like with neural networks)

Instead of a constant shrinkage (learning rate), how about one that starts out 
high (say, 0.1) and gradually decreases to a small value (say, 0.001)? In 
neural networks this is called momentum.  When used properly it can speed up 
learning while still allowing small adjustments at the end.  I assume momentum 
would help with gradient boosting, and for GBM, it could create smaller, 
simpler models (fewer trees).

Original issue reported on code.google.com by [email protected] on 2 Apr 2013 at 3:49

bug in multinominal model predictions

> set.seed(1)
> test1 <- gbm(Species ~ ., data = iris, n.trees = 1000, verbose = FALSE)
Distribution not specified, assuming multinomial ...
>
> predict(test1, head(iris), n.trees = 1000, type = "response")
        setosa versicolor  virginica
[1,] 0.8351586  0.1132224 0.05161896
[2,] 0.8327497  0.1157803 0.05147007
[3,] 0.8351586  0.1132224 0.05161896
[4,] 0.8344369  0.1139887 0.05157435
[5,] 0.8351586  0.1132224 0.05161896
[6,] 0.8348343  0.1135668 0.05159892
> predict(test1, head(iris), n.trees = c(500, 1000), type = "response")
            500       1000
 [1,] 0.6692731 0.11322242
 [2,] 0.6663942 0.11578028
 [3,] 0.6692731 0.11322242
 [4,] 0.6685032 0.11398871
 [5,] 0.6692731 0.11322242
 [6,] 0.6692731 0.11356676
 [7,] 0.8351586 0.13365581
 [8,] 0.8327497 0.13308088
 [9,] 0.8351586 0.13365581
[10,] 0.8344369 0.13350206
[11,] 0.8351586 0.13365581
[12,] 0.8348343 0.13365581
[13,] 0.1970711 0.05161896
[14,] 0.2005250 0.05147007
[15,] 0.1970711 0.05161896
[16,] 0.1979947 0.05157435
[17,] 0.1970711 0.05161896
[18,] 0.1970711 0.05159892



Looking at predict.gbm, the object predF is converted to an array of
matrices, but then:

  if (length(n.trees) > 1) {
      ## the problem is here:
      predF <- matrix(predF, ncol = length(n.trees), byrow = FALSE)
      colnames(predF) <- n.trees
      predF[, i.ntree.order] <- predF
  }


This worked with bernoulli since the results for the event were returned
(and had one dimension). It would be helpful to keep the array structure.

Thanks,

Max

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] gbm_2.0-8        lattice_0.20-10  survival_2.36-14

loaded via a namespace (and not attached):
[1] grid_2.15.2

Original issue reported on code.google.com by harry.southworth on 5 Feb 2013 at 5:13

Better control over verbosity

What steps will reproduce the problem?
1. Run a model with cv.folds > 1

If verbose = TRUE (the default) an awful lot of updating gets spewed to the 
screen. This is a hangover from when the computations ran much more slowly. I 
think verbose should have options TRUE (current), FALSE (no printing to screen) 
and CV (print to screen when starting a new CV fold) with the default being CV.

Original issue reported on code.google.com by harry.southworth on 25 Jan 2013 at 2:28

object 'p' not found when cv.folds=1

What steps will reproduce the problem?
1. Run the attached script with cv.folds=1

What is the expected output? What do you see instead?
Error in gbm(mydata[TRUE, "survived"] ~ ., data = mydata2, distribution = 
"bernoulli",  : 
  object 'p' not found

There should be a sanity check like stopifnot(cv.folds>=2 |cv.folds==0)


What version of the product are you using? On what operating system?
R 2.15.3 and R 3.0.0
gbm 2.1

Please provide any additional information below.
Minor issue

Original issue reported on code.google.com by [email protected] on 18 May 2013 at 4:03

Attachments:

[object p not found.R](https://storage.googleapis.com/google-code-attachments/gradientboostedmodels/issue-21/comment-0/object p not found.R)

gbm gives wrong prediction on a missing level.

What steps will reproduce the problem?
1. Build a gbm model with a factor variable, but with a level missing.
2. use the gbm model to predict an outcome with the missing level.
3. the prediction comes out wrong

What is the expected output? What do you see instead?
I expect the prediction of the missing level would come from the missing node, 
but it picks up a value from another node instead.

What version of the product are you using? On what operating system?
2.0-8

Please provide any additional information below.

Please refer to the script after ****. Please take a look at the output of 
prettyTree.

"c" is a missing level.
I expect when "c" is supplied at prediction, gbm will go to node #6, but it 
seems to have gone to node #5 instead.
Also, in the old version of gbm, weight at node #1 is 300. In the current 
version, the weight is 299 (300-1). Could you explain?


********************
y1 = rep(1, 100)
y2 = rep(3, 100)
y3 = rep(11, 100)
y = c(y1, y2, y3)

x1 = rep("a", 100)
x2 = rep("b", 100)
x3 = rep("d", 100)
x = c(x1, x2, x3)

x = factor(x, c("a", "b", "c", "d"))

df = data.frame(y,x)

gbm1 = gbm(y~x, data=df, train.fraction=1, n.trees=1, interaction.depth=2, 
shrinkage = 1, cv.fold=0, bag.fraction=1, dist="gaussian")

pretty.gbm.tree(gbm1, i.tree=1)
gbm1$c.splits
gbm1$initF

df1 = data.frame(x=c("a", "b", "c", "d"))

p = predict(gbm1, df1, n.trees=1, type="response")
p

Original issue reported on code.google.com by [email protected] on 26 Jan 2013 at 6:23

Attachments:

symptom.R

predict.gbm fails if n.trees not set

What steps will reproduce the problem?
1.  run a gbm model with cv.folds=0
2.  call predict.gbm(model)

The error you will see is from this line in the source for predict.gbm:

    cat(paste("Using", n.trees, "trees...\n"))

and is caused by this line immediately above:

   best <- length(object$train.error)

The variable 'best' is never referenced again, I believe this is simple bug and 
n.trees should be set to the length of object$train.error if n.trees is not 
specified in the arguments.

This is bug arises in version 2.1 of gbm.

Original issue reported on code.google.com by [email protected] on 29 Jan 2015 at 8:30

Error in interact.gbm with multinomial family

It seems that interact.gbm does not work for family=multinomial

> data(iris)

> set.seed(10)

> f<-gbm(Species~.,data=iris,n.trees=1000,interaction.depth=2)

Distribution not specified, assuming multinomial ...

> interact.gbm(f,data=iris,i.var=c(1,2))

Error in weighted.mean.default(f, n) : 
  'x' and 'w' must have the same length


Works fine with the bernoulli family

> set.seed(10)
> f1<-gbm(I(Species=="setosa")~.,data=iris,n.trees=1000,interaction.depth=2)

Distribution not specified, assuming bernoulli ...

> interact.gbm(f1,data=iris,i.var=c(1,2))

[1] 0.8861943


> sessionInfo()
R version 2.15.3 (2013-03-01)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=sv_SE.UTF-8       LC_NUMERIC=C               LC_TIME=sv_SE.UTF-8        LC_COLLATE=sv_SE.UTF-8    
 [5] LC_MONETARY=sv_SE.UTF-8    LC_MESSAGES=sv_SE.UTF-8    LC_PAPER=C                 LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  splines   stats     graphics  grDevices utils     datasets  
methods   base     

other attached packages:
[1] Hmisc_3.10-1     gbm_2.0-9.5      lattice_0.20-6   survival_2.36-14

loaded via a namespace (and not attached):
[1] cluster_1.14.3 grid_2.15.3    tools_2.15.3

Original issue reported on code.google.com by [email protected] on 25 Mar 2013 at 4:46

ModelMap fails CRAN checks

Issue is to do with gbm.fit not returning train.fraction. A related issue has 
been described by Elisabeth Freeman:

Thank you for the windows binary.

I have started testing ModelMap with the new gbm package, and I think at least 
some of the issues are related to the change in function arguments names in 
gbm.fit() from ‘train.fraction’ to ‘nTrain’. ModelMap not only used the 
old argument name (which just generates a warning), but for prediction, it 
extracted the value of ‘train.fraction’ from the model object by name, and 
the new model objects created by gbm.fit() do not have a component by that 
name, resulting in an error.

I can update ModelMap to use nTrain, but  I also ran into an issue with the 
gbm.more() function. This function seems to still require that the model object 
have a component named ‘train.fraction’. For example, in these lines of 
code from the gbm.more function:

          num.groups.train <- max(1, round(object$train.fraction *nlevels(group)))

Model objects created by the new gbm() still have a train.fraction component, 
but objects created by the new gbm.fit() only have the ‘nTrain’ component. 
Since ModelMap is often used on large data sets, it uses the gbm.fit() function 
for model building. When I use gbm.more() on these models, I get the following 
result:

model.obj
gbm.more(object = SGB, n.new.trees = 100)
A gradient boosted model with gaussian loss function.
1200 iterations were performed.
Error in if (x$train.fraction < 1) { : argument is of length zero

Here is some sample code adapted from the gbm help files that shows the issue I 
am running into with using the gbm.more function on models fitted with 
gbm.fit():

################################################################################
########

N <- 1000
X1 <- runif(N)
X2 <- 2*runif(N)
X3 <- ordered(sample(letters[1:4],N,replace=TRUE),levels=letters[4:1])
X4 <- factor(sample(letters[1:6],N,replace=TRUE))
X5 <- factor(sample(letters[1:3],N,replace=TRUE))
X6 <- 3*runif(N)
mu <- c(-1,0,1,2)[as.numeric(X3)]

SNR <- 10 # signal-to-noise ratio
Y <- X1**1.5 + 2 * (X2**.5) + mu
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(N,0,sigma)

# introduce some missing values
X1[sample(1:N,size=500)] <- NA
X4[sample(1:N,size=300)] <- NA

X<-data.frame(X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

data <- data.frame(Y=Y,X1=X1,X2=X2,X3=X3,X4=X4,X5=X5,X6=X6)

# fit initial model with gbm function
gbm1 <-
gbm(Y~X1+X2+X3+X4+X5+X6,         # formula
    data=data,                   # dataset

    var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                 # +1: monotone increase,
                                 #  0: no monotone restrictions
    distribution="gaussian",     # see the help for other choices
    n.trees=1000,                # number of trees
    shrinkage=0.05,              # shrinkage or learning rate,
                                 #0.001 to 0.1 usually work
    interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
    bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
    train.fraction = 0.5,        # fraction of data for training,
                                 # first train.fraction*N used for training
    n.minobsinnode = 10,         # minimum total weight needed in each node
    keep.data=TRUE,              # keep a copy of the dataset with the object
    verbose=FALSE)               # don't print out progress


# fit initial model with gbm.fit function
gbm1fit <-
gbm.fit(x=X,y=Y,
    var.monotone=c(0,0,0,0,0,0), # -1: monotone decrease,
                                 # +1: monotone increase,
                                 #  0: no monotone restrictions
    distribution="gaussian",     # see the help for other choices
    n.trees=1000,                # number of trees
    shrinkage=0.05,              # shrinkage or learning rate,
                                 #0.001 to 0.1 usually work
    interaction.depth=3,         # 1: additive model, 2: two-way interactions, etc.
    bag.fraction = 0.5,          # subsampling fraction, 0.5 is probably best
    nTrain = nrow(X)*0.5,                # fraction of data for training,
                                 # first train.fraction*N used for training
    n.minobsinnode = 10,         # minimum total weight needed in each node
    keep.data=TRUE,              # keep a copy of the dataset with the object
    verbose=FALSE)               # don't print out progress

names(gbm1)
names(gbm1fit)

# do another 100 iterations
gbm2 <- gbm.more(gbm1,100,verbose=FALSE) # stop printing detailed progress
gbm2

# do another 100 iterations
gbm2fit <- gbm.more(gbm1fit,100,verbose=FALSE) # stop printing detailed progress
gbm2fit

#add train.fractiion to gbm1fit model object
gbm1fit<-gbm1fit
gbm1fit$train.fraction<-0.5
# do another 100 iterations
gbm2fit <- gbm.more(gbm1fit,100,verbose=FALSE) # stop printing detailed progress
gbm2fit

Original issue reported on code.google.com by harry.southworth on 9 Jan 2013 at 10:03

n.cores = 1 should not create a subprocess in CV

What steps will reproduce the problem?
1. run gbm with cv.folds>1 and n.cores = 1

I have a setting (linux, but it doesn't matter) where I am running gbm with 
cross validation on many data sets in parallel. In this setting, I want n.cores 
set to 1.

However, when examining the running processes and the source, it seems that 
"makeCluster" is still called and the code is still run on a subprocess which 
creates large overheads for larger training datasets.

My understanding of the source leads me to think that avoiding this behavior 
would be fairly simple by conditioning on "n.cores==1" in just 1 or 2 places.

Please let me know what you think.

Original issue reported on code.google.com by [email protected] on 28 May 2013 at 10:42

carolssnz / gradientboostedmodels Goto Github PK

gradientboostedmodels's Introduction

carolssnz

gradientboostedmodels's People

Contributors

Watchers

gradientboostedmodels's Issues

Recommend Projects

Recommend Topics

Recommend Org