Giter VIP home page Giter VIP logo

nestedcv's Introduction

nestedcv

CRAN_Status_Badge Downloads Downloads

Nested cross-validation (CV) for the glmnet and caret packages. With glmnet this includes cross-validation of elastic net alpha parameter. A number of feature selection filter functions (t-test, Wilcoxon test, ANOVA, Pearson/Spearman correlation, random forest, ReliefF) for feature selection are provided and can be embedded within the outer loop of the nested CV. Nested CV can be also be performed with the caret package giving access to the large number of prediction methods available in caret.

Installation

Install from CRAN

install.packages("nestedcv")

Install from Github

devtools::install_github("myles-lewis/nestedcv")

Example

In this example using iris dataset (multinomial, 3 classes), we fit a glmnet model, tuning both lambda and alpha with 10 x 10-fold nested CV.

library(nestedcv)
data(iris)
y <- iris$Species
x <- as.matrix(iris[, -5])

cores <- parallel::detectCores(logical = FALSE)  # detect physical cores

res <- nestcv.glmnet(y, x, family = "multinomial", cv.cores = cores)
summary(res)

Use summary() to see full information from the nested model fitting. coef() can be used to show the coefficients of the final fitted model. Filters can be used by setting the filterFUN argument. Options for the filter function are passed as a list through filter_options.

Output from the nested CV with glmnet can be plotted to show how deviance is affected by alpha and lambda.

plot_alphas(res)
plot_lambdas(res)

The tuning of lambda and alpha for each outer CV fold can be plotted. Here we inspect outer CV fold 1.

plot(res$outer_result[[1]]$cvafit)

ROC curves from left-out folds from both outer and inner CV can be plotted for binary comparisons (see vignette).

Nested CV can also be performed using the caret package framework. Here we use caret for tuning random forest using the ranger package.

res <- nestcv.train(y, x, method = "ranger", cv.cores = cores)
summary(res)

nestedcv's People

Contributors

aspiliop avatar elisabettasciacca avatar myles-lewis avatar rlau0 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

nestedcv's Issues

nestcv.glmnet with custom filter fails

Error:
Performing 2-fold outer CV, using 1 core
Error in if (any(penalty.factor == Inf)) { :
missing value where TRUE/FALSE needed

I identified the bug and resolved it by replacing the else statement in nest_filter_balance.R, line 30 with:

else {
    args <- list(y = ytrain, x = xtrain)
    args <- append(args, filter_options)
    fset <- do.call(filterFUN, args)
    filt_xtrain <- xtrain[, fset, drop = FALSE]
    filt_xtest <- xtest[, fset, drop = FALSE]
    
    # Match indices of filtered set with original features
    fset_indices <- match(fset, colnames(x))
    filt_pen.factor <- penalty.factor[fset_indices]

Pretty sure the issue was that you were filtering on the gene names outputted by the custom filter, rather than their indexes.

Cheers,

Parallelisation of nestcv.train() fails indefinitely with xgboost on linux/windows

Parallelisation of the outer loops in nestcv.train() by setting cv.cores = 2 or more fails with method = "xgbTree" or method = "xgLinear" on linux/windows. This appears to be due to automatic multithreading invoked by xgboost using OpenMP, which is not available on Mac OS. Calling nestcv.train with cv.cores = 1 works fine, but hangs indefinitely if cv.cores = 2 or more.

Work around is to disable OpenMP multithreading using the following command before calling nestcv.train.

RhpcBLASctl::omp_set_num_threads(1L)

Aim is to incorporate this into nestcv.train in a future release.

Related to this issue:
topepo/caret#1106

Problem with caret model xgbTree and mclapply() with tuneLength = 5

Fitting a caret model with nestcv.train() and model = "xgbTree" causes a crash in caret::train() when fitting the model if parallelisation with mclapply() is used and tuneLength is 5 or more. Unknown cause. Can be fixed by switching to parLapply() by setting multicore_fork = FALSE.

Error message:

Error in `[.data.frame`(data, , "pred") : undefined columns selected

Reprex below:

## xgbTree tuneLength=5 mclapply bug reprex

library(nestedcv)
library(caret)
library(parallel)
library(mlbench)  # Boston housing dataset

data(BostonHousing2)
dat <- BostonHousing2
y <- dat$cmedv
x <- subset(dat, select = -c(cmedv, medv, town, chas))

# no error with tuneLength = 3 or 4
fit <- nestcv.train(y, x, method = "xgbTree",
                    tuneLength = 3, verbose = TRUE,
                    cv.cores = 8)

# produces an error in caret::train()
fit <- nestcv.train(y, x, method = "xgbTree",
                    tuneLength = 5, verbose = TRUE,
                    cv.cores = 8)

# runs fine
fit <- nestcv.train(y, x, method = "xgbTree",
                    tuneLength = 5, verbose = TRUE,
                    multicore_fork = FALSE,
                    cv.cores = 8)

The problem occurs in caret::train(..., method = "xgbTree") when wrapped in mclapply()

# single caret fit - ok
fit1 <- caret::train(x, y, method = "xgbTree", tuneLength = 5)

# source of problem is caret::train(method="xgbTree") when wrapped in mclapply()
# no error if tuneLength = 3
out <- mclapply(1:2, function(i) {
  fit1 <- caret::train(x, y, method = "xgbTree", tuneLength = 3)
  fit1
}, mc.cores = 2)

# multicore - causes an error if tuneLength = 5
out <- mclapply(1:2, function(i) {
  fit1 <- caret::train(x, y, method = "xgbTree", tuneLength = 5)
  fit1
}, mc.cores = 2)

I think this might relate to this issue in caret and multithreading:
topepo/caret#1106

For the moment the simplest workaround is to disable multicore forking by setting the multicore_fork argument to FALSE when using nestcv.train(). This switches to using parLapply() which works fine.

Breaking change(s) in new version of fastshap

Hi @myles-lewis,

I am preparing a new release of fastshap for CRAN, and it seems like it will break some functionality in your package. You can see a list of changes here. I suspect the biggest changes affecting your package iare that 1) fastshap >= 0.1.0 will no longer return a tibble, but rather a matrix, and 2) the autoplot() function have been removed in favor of using the much better shapviz package.

Just wanted to give you a heads up before I plan to submit in two weeks.

nestedcv

Run revdepcheck::revdep_details(, "nestedcv") for more info

Newly broken

  • checking examples ... ERROR
    Running examples in ‘nestedcv-Ex.R’ failed
    The error most likely occurred in:
    
    > ### Name: pred_nestcv_glmnet
    > ### Title: Prediction wrappers to use fastshap with nestedcv
    > ### Aliases: pred_nestcv_glmnet pred_nestcv_glmnet_class1
    > ###   pred_nestcv_glmnet_class2 pred_nestcv_glmnet_class3 pred_train
    > ###   pred_train_class1 pred_train_class2 pred_train_class3
    > 
    > ### ** Examples
    ...
    Error in `autoplot()`:
    ! Objects of class <explain> are not supported by autoplot.
    ℹ have you loaded the required package?
    Backtrace:
        ▆
     1. ├─ggplot2::autoplot(sh)
     2. └─ggplot2:::autoplot.default(sh)
     3.   └─cli::cli_abort(...)
     4.     └─rlang::abort(...)
    Execution halted
    

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.