Giter VIP home page Giter VIP logo

punisher's Introduction

PunisheR

Build Status Coverage status

punisheR is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection (see here). In order to measure model quality during the selection procedures, we have also implemented the Akaike and Bayesian Information Criterion (see below), both of which punish complex models -- hence this package's name.

As examined below, we recognize that well-designed versions of these tools already exist in R. This is acceptable to us because impetus for this project is primarily pedagogical, intended to improve our understanding of model selection techniques and collaborative software development.

Installation

devtools::install_github("UBC-MDS/punisheR")

If you would like to read a comprehensive documentation of punisheR, we recommend that you set build_vignettes = TRUE when you install the package.

Functions included:

punisheR has two stepwise feature selection techniques:

  • forward(): a feature selection method in which you start with a null model and iteratively add useful features
  • backward(): a feature selection method in which you start with a full model and iteratively remove the least useful feature at each step

This package also has three metrics to evaluate model performance:

These three criteria are used to measure the relative quality of models within forward() and backward(). In general, having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC add a penalty for the number of features in a model. The penalty term is larger in BIC than in AIC. The lower the AIC and BIC score, the better the model.

How does the package fit into the existing R ecosystem?

In the R ecosystem, the forward and backward selection is implemented in both the olsrr and MASS packages. The former provides ols_step_forward() and ols_step_backward() for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection. The latter, MASS, contains StepAIC(), which is complete with three modes: forward, backward or both. The selection procedure it uses is based on an information criterion (AIC), as we intend ours to be. Other packages that provide subset selection for regression models are leaps and bestglm.

In punisheR, users can select between metrics such as aic, bic and r-squared for forward and backward selections. Also, the number of features returned by these selection algorithms can be specified by using n_features or by using min_change; users can specify the minimum change in the criterion score for an additional feature to be selected.

Usage examples

Load data

library(punisheR)

data <- mtcars_data()
X_train <- data[[1]]
y_train <- data[[2]]
X_val <- data[[3]]
y_val <- data[[4]]

Forward selection

forward(X_train, y_train, X_val, y_val, min_change=0.5,
    n_features=NULL, criterion='r-squared', verbose=FALSE)
    
#> [1] 10

When implementing forward selection on the demo data, it returns a list of features for the best model. In this example, we use r-squared to determine the "best" model. Here it can be seen that the function correctly returns only 1 feature.

Backward selection

backward(X_train, y_train, X_val, y_val,
    n_features=1, min_change=NULL, criterion='r-squared',
    verbose=FALSE)
    
#> [1] 10

When implementing backward selection on the demo data, it returns a list of features for the best model. Here it can be seen that the function correctly returns only 1 feature.

Scoring a model with AIC, BIC, and r-squared

model <- lm(y_train ~ mpg + cyl + disp, data = X_train)

aic(model)
#> [1] 252.6288

bic(model)
#> [1] 258.5191

When scoring the model using AIC and BIC, we can see that the penalty when using bic is greater than the penalty obtained using aic.

r_squared(model, X_val, y_val)
#> [1] 0.7838625

The value returned by the function r_squared() will be between 0 and 1.

Vignette

For a more comprehensive guide of PunisheR, you can read the vignette here or html version here.

Contributors:

Instructions and guidelines on how to contribute can be found here. To contribute to this project, you must adhere to the terms outlined in our Contributor Code of Conduct

punisher's People

Contributors

avinashkz avatar tariqahassan avatar topspinj avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

akshi8

punisher's Issues

notes to address from running devtools::check()

  1. You should properly document this using Roxygen if you think users of your package might ever use these. If not, I might add "helper" to the filename and/or function names to make it clear users will not be using these.
checking for missing documentation entries ... WARNING
Undocumented code objects:
  ‘fit_and_score’ ‘fitter’ ‘forward_break_criteria’ ‘input_checks’
  ‘input_data_checks’ ‘parse_n_features’
All user-level objects in a package should have documentation entries.
See chapter ‘Writing R documentation files’ in the ‘Writing R
Extensions’ manual.
  1. You can ignore the note below:
checking top-level files ... NOTE
Non-standard files/directories found at top level:
  ‘CONDUCT.md’ ‘CONTRIBUTING.md’
  1. Take a look at these and fix any you think might be harmful
checking R code for possible problems ... NOTE
fit_and_score: warning in fitter(X = X_train_to_use, y = y_train):
  partial argument match of 'X' to 'X0'
fit_and_score: warning in fitter(X = X_train_to_use, y = y_train):
  partial argument match of 'y' to 'y0'
r_squared: no visible global function definition for ‘predict’
Undefined global functions or variables:
  predict
Consider adding
  importFrom("stats", "predict")
to your NAMESPACE file.

Suggested issues from the `goodpractice` package

I ran goodpractice::gp() to identify likely sources of errors and style issues. I paste the output below as a png (only way I could easily share with you). I suggest downloading the goodpractice package and running it yourself to address these issues:

screen shot 2018-03-15 at 9 05 20 pm

screen shot 2018-03-15 at 9 05 27 pm

screen shot 2018-03-15 at 9 05 40 pm

`covr` suggests you are not at 100% branch coverage

When running covr::report() it suggests you are not at 100% branch coverage:

screen shot 2018-03-15 at 9 00 09 pm

Install the covr package and do this yourself - then when you get this table, click on the filenames in the File column for more information where you could be missing coverage.

Potential spelling errors identified by `devtools::spell_check()`

Potential spelling errors identified by devtools::spell_check(). Some are not errors (e.g., aic, bic), but others clearly are (e.g., compelx , flaot).

> devtools::spell_check()
  WORD          FOUND IN
aic           backward.Rd:33, forward.Rd:33
bic           backward.Rd:34, forward.Rd:34
compelx       aic.Rd:17, bic.Rd:17
flaot         bic.Rd:13
http          r_squared.Rd:20
https         aic.Rd:24, bic.Rd:24
iteratively   backward.Rd:40, forward.Rd:40
punisheR      description:1
scikit        r_squared.Rd:20
wikipedia     aic.Rd:24, bic.Rd:24

Better test for values from AIC & BIC

The current test for testing the values returned by your AIC & BIC functions are this:

test_that("test_metric_output_value", {
    # Test that the actual AIC and BIC values computed by
    # our functions match that computed by Base R.
    expect_equal(aic(model), AIC(model))
    expect_equal(bic(model), BIC(model))

This is not ideal because what if the Base R functions are wrong or have bugs? This method of testing will just propagate them into your code base. You should test simpler cases for your AIC & BIC functions where the math is simple enough you can do it by hand (lm with 2 or 3 observations maybe?).

Test Complete

devtools::check()
devtools::test()
covr::report()
goodpractice::gp()
devtools::spell_check()

Complete

Additions to how this package fits into the existing R ecosystem

Other packages that do forward and backward (and best subset) selection:

Also, can you clarify what you mean by "p-value-based methods of feature selection". P-values of what?

Finally, I think you should highlight how your package does more (e.g., BIC) or different things than these other packages. Think of selling users on why to use your package.

Why require matrix and vectors as input for `forward` and `backward` functions?

Why require matrix and vectors (you should use this term in the R docs, not array) as input for forward and backward functions? Users often load datasets into R using read_* or read.* and the object returned from these functions is a data frame. So by requiring your users to convert to matrix and vectors each time this will create a lot of code redundancy for your users. I suggest you do one of the following:

  • allow users to also give your function a dataframe for X_train, y_train, X_val, y_val and write the conversion to matrix and vector into your function if they do indeed pass your function a dataframe, OR
    -write a function that takes a dataframe and based on arguments provided by the user gives you back X_train, y_train, X_val, y_val in the types your forward and backward functions currently require.

Function documentation incomplete

In both the function docs in the .R files and in the README it is not clear what your functions return. Is it a model object from forward and backward? Is it a single value of type double from AIC and BIC? Something else? Please add to the docs to clarify.

Add example of functions in README

Yes you have a vignette, but often times when someone is trying to decide whether or not to use your package they just look at the README. Thus, its very helpful to have one or two short examples of use cases and output from those use cases. You could directly copy an example or two for this from the rendered vignette.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.