ubc-mds / punisher Goto Github PK

An R package that performs stepwise forward and backward feature selection

License: BSD 3-Clause "New" or "Revised" License

R 100.00%

punisher's Introduction

PunisheR

punisheR is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection (see here). In order to measure model quality during the selection procedures, we have also implemented the Akaike and Bayesian Information Criterion (see below), both of which punish complex models -- hence this package's name.

As examined below, we recognize that well-designed versions of these tools already exist in R. This is acceptable to us because impetus for this project is primarily pedagogical, intended to improve our understanding of model selection techniques and collaborative software development.

Installation

devtools::install_github("UBC-MDS/punisheR")

If you would like to read a comprehensive documentation of punisheR, we recommend that you set build_vignettes = TRUE when you install the package.

Functions included:

punisheR has two stepwise feature selection techniques:

forward(): a feature selection method in which you start with a null model and iteratively add useful features
backward(): a feature selection method in which you start with a full model and iteratively remove the least useful feature at each step

This package also has three metrics to evaluate model performance:

aic(): computes the Akaike information criterion
bic(): computes the Bayesian information criterion
r_squared(): computes the coefficient of determination

These three criteria are used to measure the relative quality of models within forward() and backward(). In general, having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC add a penalty for the number of features in a model. The penalty term is larger in BIC than in AIC. The lower the AIC and BIC score, the better the model.

How does the package fit into the existing R ecosystem?

In the R ecosystem, the forward and backward selection is implemented in both the olsrr and MASS packages. The former provides ols_step_forward() and ols_step_backward() for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection. The latter, MASS, contains StepAIC(), which is complete with three modes: forward, backward or both. The selection procedure it uses is based on an information criterion (AIC), as we intend ours to be. Other packages that provide subset selection for regression models are leaps and bestglm.

In punisheR, users can select between metrics such as aic, bic and r-squared for forward and backward selections. Also, the number of features returned by these selection algorithms can be specified by using n_features or by using min_change; users can specify the minimum change in the criterion score for an additional feature to be selected.

Usage examples

Load data

library(punisheR)

data <- mtcars_data()
X_train <- data[[1]]
y_train <- data[[2]]
X_val <- data[[3]]
y_val <- data[[4]]

Forward selection

forward(X_train, y_train, X_val, y_val, min_change=0.5,
    n_features=NULL, criterion='r-squared', verbose=FALSE)
    
#> [1] 10

When implementing forward selection on the demo data, it returns a list of features for the best model. In this example, we use r-squared to determine the "best" model. Here it can be seen that the function correctly returns only 1 feature.

Backward selection

backward(X_train, y_train, X_val, y_val,
    n_features=1, min_change=NULL, criterion='r-squared',
    verbose=FALSE)
    
#> [1] 10

When implementing backward selection on the demo data, it returns a list of features for the best model. Here it can be seen that the function correctly returns only 1 feature.

Scoring a model with AIC, BIC, and r-squared

model <- lm(y_train ~ mpg + cyl + disp, data = X_train)

aic(model)
#> [1] 252.6288

bic(model)
#> [1] 258.5191

When scoring the model using AIC and BIC, we can see that the penalty when using bic is greater than the penalty obtained using aic.

r_squared(model, X_val, y_val)
#> [1] 0.7838625

The value returned by the function r_squared() will be between 0 and 1.

Vignette

For a more comprehensive guide of PunisheR, you can read the vignette here or html version here.

Contributors:

Avinash, @avinashkz
Tariq, @TariqAHassan
Jill, @topspinj

Instructions and guidelines on how to contribute can be found here. To contribute to this project, you must adhere to the terms outlined in our Contributor Code of Conduct

punisher's People

Contributors

Stargazers

Watchers

Forkers

akshi8

punisher's Issues

Integration Test Complete

Forward Selection Code Snippet Different Answer

I got the answer like this: 2 1 6 3 7 5 instead of 10 in the README.md

notes to address from running devtools::check()

You should properly document this using Roxygen if you think users of your package might ever use these. If not, I might add "helper" to the filename and/or function names to make it clear users will not be using these.

checking for missing documentation entries ... WARNING
Undocumented code objects:
  ‘fit_and_score’ ‘fitter’ ‘forward_break_criteria’ ‘input_checks’
  ‘input_data_checks’ ‘parse_n_features’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.

You can ignore the note below:

checking top-level files ... NOTE
Non-standard files/directories found at top level:
  ‘CONDUCT.md’ ‘CONTRIBUTING.md’

Take a look at these and fix any you think might be harmful

checking R code for possible problems ... NOTE
fit_and_score: warning in fitter(X = X_train_to_use, y = y_train):
  partial argument match of 'X' to 'X0'
fit_and_score: warning in fitter(X = X_train_to_use, y = y_train):
  partial argument match of 'y' to 'y0'
r_squared: no visible global function definition for ‘predict’
Undefined global functions or variables:
  predict
Consider adding
  importFrom("stats", "predict")
to your NAMESPACE file.

Test whether forward and backward work for returning multiple features

Current tests for correct return value from forward and backward only test for the case when there is one "truly" good feature in the model. What if there are > 1? You should have at least one test case for this.

Suggested issues from the `goodpractice` package

I ran goodpractice::gp() to identify likely sources of errors and style issues. I paste the output below as a png (only way I could easily share with you). I suggest downloading the goodpractice package and running it yourself to address these issues:

`covr` suggests you are not at 100% branch coverage

When running covr::report() it suggests you are not at 100% branch coverage:

Install the covr package and do this yourself - then when you get this table, click on the filenames in the File column for more information where you could be missing coverage.

Potential spelling errors identified by `devtools::spell_check()`

Potential spelling errors identified by devtools::spell_check(). Some are not errors (e.g., aic, bic), but others clearly are (e.g., compelx , flaot).

> devtools::spell_check()
  WORD          FOUND IN
aic           backward.Rd:33, forward.Rd:33
bic           backward.Rd:34, forward.Rd:34
compelx       aic.Rd:17, bic.Rd:17
flaot         bic.Rd:13
http          r_squared.Rd:20
https         aic.Rd:24, bic.Rd:24
iteratively   backward.Rd:40, forward.Rd:40
punisheR      description:1
scikit        r_squared.Rd:20
wikipedia     aic.Rd:24, bic.Rd:24

Better test for values from AIC & BIC

The current test for testing the values returned by your AIC & BIC functions are this:

test_that("test_metric_output_value", {
    # Test that the actual AIC and BIC values computed by
    # our functions match that computed by Base R.
    expect_equal(aic(model), AIC(model))
    expect_equal(bic(model), BIC(model))

This is not ideal because what if the Base R functions are wrong or have bugs? This method of testing will just propagate them into your code base. You should test simpler cases for your AIC & BIC functions where the math is simple enough you can do it by hand (lm with 2 or 3 observations maybe?).

Test Complete

devtools::check()
devtools::test()
covr::report()
goodpractice::gp()
devtools::spell_check()

Complete

Additions to how this package fits into the existing R ecosystem

Other packages that do forward and backward (and best subset) selection:

Also, can you clarify what you mean by "p-value-based methods of feature selection". P-values of what?

Finally, I think you should highlight how your package does more (e.g., BIC) or different things than these other packages. Think of selling users on why to use your package.

Why require matrix and vectors as input for `forward` and `backward` functions?

Why require matrix and vectors (you should use this term in the R docs, not array) as input for forward and backward functions? Users often load datasets into R using read_* or read.* and the object returned from these functions is a data frame. So by requiring your users to convert to matrix and vectors each time this will create a lot of code redundancy for your users. I suggest you do one of the following:

allow users to also give your function a dataframe for X_train, y_train, X_val, y_val and write the conversion to matrix and vector into your function if they do indeed pass your function a dataframe, OR
-write a function that takes a dataframe and based on arguments provided by the user gives you back X_train, y_train, X_val, y_val in the types your forward and backward functions currently require.

Vignette not rendered in repo

I would put a rendered .md and/or .pdf of the vignetter in your repo.

Function documentation incomplete

In both the function docs in the .R files and in the README it is not clear what your functions return. Is it a model object from forward and backward? Is it a single value of type double from AIC and BIC? Something else? Please add to the docs to clarify.

Add example of functions in README

Yes you have a vignette, but often times when someone is trying to decide whether or not to use your package they just look at the README. Thus, its very helpful to have one or two short examples of use cases and output from those use cases. You could directly copy an example or two for this from the rendered vignette.

no explanation of what the `test_data` function does in vignette

It is difficult to follow your vignette because there is no clear explanation of what test_data function does in vignette. Code comments or narration above its creation and use would greatly help.