myles-lewis / nestedcv Goto Github PK

Nested cross-validation with feature selection filters

License: Other

R 100.00%

nestedcv's Introduction

nestedcv

Nested cross-validation (CV) for the glmnet and caret packages. With glmnet this includes cross-validation of elastic net alpha parameter. A number of feature selection filter functions (t-test, Wilcoxon test, ANOVA, Pearson/Spearman correlation, random forest, ReliefF) for feature selection are provided and can be embedded within the outer loop of the nested CV. Nested CV can be also be performed with the caret package giving access to the large number of prediction methods available in caret.

Installation

Install from CRAN

install.packages("nestedcv")

Install from Github

devtools::install_github("myles-lewis/nestedcv")

Example

In this example using iris dataset (multinomial, 3 classes), we fit a glmnet model, tuning both lambda and alpha with 10 x 10-fold nested CV.

library(nestedcv)
data(iris)
y <- iris$Species
x <- as.matrix(iris[, -5])

cores <- parallel::detectCores(logical = FALSE)  # detect physical cores

res <- nestcv.glmnet(y, x, family = "multinomial", cv.cores = cores)
summary(res)

Use summary() to see full information from the nested model fitting. coef() can be used to show the coefficients of the final fitted model. Filters can be used by setting the filterFUN argument. Options for the filter function are passed as a list through filter_options.

Output from the nested CV with glmnet can be plotted to show how deviance is affected by alpha and lambda.

plot_alphas(res)
plot_lambdas(res)

The tuning of lambda and alpha for each outer CV fold can be plotted. Here we inspect outer CV fold 1.

plot(res$outer_result[[1]]$cvafit)

ROC curves from left-out folds from both outer and inner CV can be plotted for binary comparisons (see vignette).

Nested CV can also be performed using the caret package framework. Here we use caret for tuning random forest using the ranger package.

res <- nestcv.train(y, x, method = "ranger", cv.cores = cores)
summary(res)

nestedcv's People

Contributors

Stargazers

Watchers

Forkers

kevinluolk gharmata rlau0 aristocratefort elisabettasciacca

nestedcv's Issues

nestcv.glmnet with custom filter fails

Error:
Performing 2-fold outer CV, using 1 core
Error in if (any(penalty.factor == Inf)) { :
missing value where TRUE/FALSE needed

I identified the bug and resolved it by replacing the else statement in nest_filter_balance.R, line 30 with:

else {
    args <- list(y = ytrain, x = xtrain)
    args <- append(args, filter_options)
    fset <- do.call(filterFUN, args)
    filt_xtrain <- xtrain[, fset, drop = FALSE]
    filt_xtest <- xtest[, fset, drop = FALSE]
    
    # Match indices of filtered set with original features
    fset_indices <- match(fset, colnames(x))
    filt_pen.factor <- penalty.factor[fset_indices]

Pretty sure the issue was that you were filtering on the gene names outputted by the custom filter, rather than their indexes.

Cheers,

Parallelisation of nestcv.train() fails indefinitely with xgboost on linux/windows

Parallelisation of the outer loops in nestcv.train() by setting cv.cores = 2 or more fails with method = "xgbTree" or method = "xgLinear" on linux/windows. This appears to be due to automatic multithreading invoked by xgboost using OpenMP, which is not available on Mac OS. Calling nestcv.train with cv.cores = 1 works fine, but hangs indefinitely if cv.cores = 2 or more.

Work around is to disable OpenMP multithreading using the following command before calling nestcv.train.

RhpcBLASctl::omp_set_num_threads(1L)

Aim is to incorporate this into nestcv.train in a future release.

Related to this issue:
topepo/caret#1106

Problem with caret model xgbTree and mclapply() with tuneLength = 5

Fitting a caret model with nestcv.train() and model = "xgbTree" causes a crash in caret::train() when fitting the model if parallelisation with mclapply() is used and tuneLength is 5 or more. Unknown cause. Can be fixed by switching to parLapply() by setting multicore_fork = FALSE.

Error message:

Error in `[.data.frame`(data, , "pred") : undefined columns selected

Reprex below:

## xgbTree tuneLength=5 mclapply bug reprex

library(nestedcv)
library(caret)
library(parallel)
library(mlbench)  # Boston housing dataset

data(BostonHousing2)
dat <- BostonHousing2
y <- dat$cmedv
x <- subset(dat, select = -c(cmedv, medv, town, chas))

# no error with tuneLength = 3 or 4
fit <- nestcv.train(y, x, method = "xgbTree",
                    tuneLength = 3, verbose = TRUE,
                    cv.cores = 8)

# produces an error in caret::train()
fit <- nestcv.train(y, x, method = "xgbTree",
                    tuneLength = 5, verbose = TRUE,
                    cv.cores = 8)

# runs fine
fit <- nestcv.train(y, x, method = "xgbTree",
                    tuneLength = 5, verbose = TRUE,
                    multicore_fork = FALSE,
                    cv.cores = 8)

The problem occurs in caret::train(..., method = "xgbTree") when wrapped in mclapply()

# single caret fit - ok
fit1 <- caret::train(x, y, method = "xgbTree", tuneLength = 5)

# source of problem is caret::train(method="xgbTree") when wrapped in mclapply()
# no error if tuneLength = 3
out <- mclapply(1:2, function(i) {
  fit1 <- caret::train(x, y, method = "xgbTree", tuneLength = 3)
  fit1
}, mc.cores = 2)

# multicore - causes an error if tuneLength = 5
out <- mclapply(1:2, function(i) {
  fit1 <- caret::train(x, y, method = "xgbTree", tuneLength = 5)
  fit1
}, mc.cores = 2)

I think this might relate to this issue in caret and multithreading:
topepo/caret#1106

For the moment the simplest workaround is to disable multicore forking by setting the multicore_fork argument to FALSE when using nestcv.train(). This switches to using parLapply() which works fine.

Breaking change(s) in new version of fastshap

Hi @myles-lewis,

I am preparing a new release of fastshap for CRAN, and it seems like it will break some functionality in your package. You can see a list of changes here. I suspect the biggest changes affecting your package iare that 1) fastshap >= 0.1.0 will no longer return a tibble, but rather a matrix, and 2) the autoplot() function have been removed in favor of using the much better shapviz package.

Just wanted to give you a heads up before I plan to submit in two weeks.

nestedcv

Version: 0.6.1
GitHub: https://github.com/myles-lewis/nestedcv
Source code: https://github.com/cran/nestedcv
Date/Publication: 2023-04-16 08:20:03 UTC
Number of recursive dependencies: 174

Run revdepcheck::revdep_details(, "nestedcv") for more info

Newly broken

checking examples ... ERROR

Running examples in ‘nestedcv-Ex.R’ failed
The error most likely occurred in:

> ### Name: pred_nestcv_glmnet
> ### Title: Prediction wrappers to use fastshap with nestedcv
> ### Aliases: pred_nestcv_glmnet pred_nestcv_glmnet_class1
> ###   pred_nestcv_glmnet_class2 pred_nestcv_glmnet_class3 pred_train
> ###   pred_train_class1 pred_train_class2 pred_train_class3
> 
> ### ** Examples
...
Error in `autoplot()`:
! Objects of class <explain> are not supported by autoplot.
ℹ have you loaded the required package?
Backtrace:
    ▆
 1. ├─ggplot2::autoplot(sh)
 2. └─ggplot2:::autoplot.default(sh)
 3.   └─cli::cli_abort(...)
 4.     └─rlang::abort(...)
Execution halted

Possible subsetting bug in nest_filter_balance.R

In nest_filt_bal, some subsetting occurs, but the subsetting operation for xtrain on line 15 appears to be missing a drop = FALSE as seen 2 lines later for xtest.

myles-lewis / nestedcv Goto Github PK

nestedcv's Introduction

nestedcv

Installation

Example

nestedcv's People

Contributors

Stargazers

Watchers

Forkers

nestedcv's Issues

nestcv.glmnet with custom filter fails

Parallelisation of nestcv.train() fails indefinitely with xgboost on linux/windows

Problem with caret model xgbTree and mclapply() with tuneLength = 5

Breaking change(s) in new version of fastshap

nestedcv

Newly broken

Possible subsetting bug in nest_filter_balance.R

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent