ubc-mds / pypunisher Goto Github PK

View Code? Open in Web Editor NEW

16.0 6.0 5.0 2.28 MB

A Python package that performs stepwise forward and backward feature selection

Home Page: https://ubc-mds.github.io/PyPunisher/index.html

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python feature-selection aic bic

pypunisher's Introduction

PyPunisher

PyPunisher is a Python implementation of forward and backward feature selection. Feature selection, or stepwise regression, is a key step in the data science pipeline that reduces model complexity by selecting the most relevant features from the original dataset. This package implements two stepwise feature selection methods:

forward_selection(): starts with a null model and iteratively adds useful features
backward_elimination(): starts with a full model and iteratively removes the least useful feature at each step

These methods are greedy search algorithms that yield a nested subset of features. The size of the final feature subset depends on what you define as your "stopping criterion". The stopping criterion can be either a threshold that you define, or a pre-defined number of features to include in your model. For example, if you set your stopping criterion to be a threshold (min_change), then the feature selection process will stop when the AIC or BIC score no longer improves by that thresholded interval. Alternatively, if you want a specific number of features in your model, then the process will stop once it reaches n_features.

In order to measure model quality during the selection procedures, we have also implemented the Akaike and Bayesian Information Criterion, both of which punish complex models:

aic(): computes the Akaike Information Criterion (AIC)
bic(): computes the Bayesian Information Criterion (BIC)

In general, having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC add a penalty for the number of features in a model. This penalty term is larger in BIC than in AIC. A lower AIC or BIC score indicates a better fit for the data, relative to competing models.

Installation

pip3 install git+git://github.com/UBC-MDS/PyPunisher@master

Requires Python 3.6+.

Documentation

The documentation for PyPunisher can be viewed here.

How to run unit tests

From root directory, run all test files in terminal:

python -m pytest

You also have the option to run individual test files by referencing its path. For example:

python -m pytest tests/test_forward_selection.py

Contributions

Instructions and guidelines on how to contribute can be found here.

pypunisher's People

Contributors

Stargazers

Watchers

Forkers

lihengtianxia topspinj ericschles fengbotao119 jowike

pypunisher's Issues

Write unit tests for feature selection functions

Create vignette for punisheR package

punisher, the R version of PyPunisher, can be found here

Write up package proposal as readme

Proposal includes:

package description
function descriptions
how the package fits in the Python ecosystem

Milestone 3 objectives

General

set up Travis ✅
add Software License ✅
show branch coverage ✅

PyPunisher

exception handling ✅
integration tests ✅
render docs ✅
add example to README ✅

punisheR

exception handling ✅
integration tests ✅
make package name consistent (punisheR vs punishR) ✅
~~explore formatR (time permitting)~~

Issues in punisheR

Test whether forward and backward work for returning multiple features ✅
Better test for values from AIC & BIC ✅
Potential spelling errors identified by devtools::spell_check() ✅
Suggested issues from the goodpractice package ✅ [goodpractice() now passing with 0 errors/warnings]
covr suggests you are not at 100% branch coverage ✅ [100% coverage achieved]
notes to address from running devtools::check() ✅
no explanation of what the test_data function does in vignette ✅
Add example of functions in README ✅
Vignette not rendered in repo ✅
require matrix and vectors as input for forward and backward functions? ✅
Function documentation incomplete ✅
Additions to how this package fits into the existing R ecosystem ✅

Bonus

implement linter in our CI pipeline ✅

Travis Testing Complete

Where is integration test

Write forward & backward selection functions in Python

Code Snippet Run Error

[10]
10
[10]
Traceback (most recent call last):
File "a.py", line 15, in
print("AIC", aic(model, X_train=X_train, y_train=y_train))
File "/usr/local/lib/python3.6/site-packages/pypunisher/metrics/criterion.py", line 76, in aic
n, k, llf = _get_coeffs(model, X_train=X_train, y_train=y_train)
File "/usr/local/lib/python3.6/site-packages/pypunisher/metrics/criterion.py", line 36, in _get_coeffs
y_pred = model.predict(X_train)
File "/usr/local/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 256, in predict
return self._decision_function(X)
File "/usr/local/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 241, in decision_function
dense_output=True) + self.intercept
File "/usr/local/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
return np.dot(a, b)
ValueError: shapes (375,20) and (1,) not aligned: 20 (dim 1) != 1 (dim 0)

Sunday March 4th

@topspinj
@avinashkz

Tariq:

Update R code to match Python.
Replicate metrics tests in R (as they are in Python).

Avinash:

Replicate forwards test in R.

Jill:

Replicate backwards test in R.

Create a 2.0 Release

Due before 3pm on March 4th.

Write unit tests for aic and bic functions

Set up code scaffold

Milestone 2 objectives

PyPunisher

Write forward and backward selection functions - Tariq ✅
Write aic and bic functions - Jill ✅

punisheR

Write forward selection - Tariq ✅
Write backward selection function - Tariq ✅
Write aic and bic functions - Avinash ✅
Write up vignette - Avinash ✅
Enhance forward selection test documentation ✅
Add Installation instructions to the README ✅
Add coverage report to the README ✅
Enhance backward selection test documentation ✅

Feedback on Milestone 2

Hi All,

Nice work for milestone 2. I like your comprehensive designs for the entire package. Here is my comments:

Good Practice to state out installation requires python 3.6
I like your coverage part to detail your test coverage, excellent
For init.py, line 4 and 5, why not just list.append() to have version number?
For selection_engines/init.py, I like your comments for the issue in scipy
Please improve your style in Python programming, you can refer to https://google.github.io/styleguide/pyguide.html as in selection.py line 4, this is not professional, also the space you have between lines are not equal.
For your first line of your python file, it is suggested that you can include #!/usr/bin/env python just in case the user is running your code in Linux(like me)
for _fit_and_score(self, S, feature, algorithm), what if algorithm input is wrong input parameters? if you thought about that?
For function backward(), S is not a good naming for a list

Regards
Jason

Feedback from Jason

Hi All,

Congrats for your work and you did a good job for finding a series of functions to develop. I have some suggestions regarding your current proposal:

As is discussed in today's office hour, it would be better if you can specify and encouraged to think in advance for the detailed data type of input parameters for each function, the data type of output value of the function, any side-corner cases you need to consider when doing unit tests to exclude, how much big(o) notation of computational complexity the function would have? These are all important aspects when you design a thoughtful function.
Another issue is for the function complexity as I have discussed with Jill today if there exists multiple for loop, which may cause huge delay when dealing with large datasets. You can think some way to improve or optimize.
More description regarding aic() and bic() should be described in the README.MD
For contributor guideline, it is suggested that you can specify some detailed steps regarding how to make pull request, how to do Github push/commit operations, etc.

Regards
Jason

Create 3.0 Releases for PyPunisher and PunisheR

Due Sunday March 11, before 3pm.

More feedback on milestone 2

Overall, very well done.

Each of your test cases should be in its own test function that is named in a descriptive way. Then you can easily tell which test case has failed when you get failures.

The requirement was to show that your tests provide 100% branch coverage but you have provided the statement coverage. See https://github.ubc.ca/ubc-mds-2017/DSCI_524_collab-sw-dev_students/issues/9 for more info.

Please fix both of these issues before the next milestone.

ubc-mds / pypunisher Goto Github PK

pypunisher's Introduction

PyPunisher

Installation

Documentation

How to run unit tests

Contributions

pypunisher's People

Contributors

Stargazers

Watchers

Forkers

pypunisher's Issues

General

PyPunisher

punisheR

Issues in punisheR

Bonus

PyPunisher

punisheR

Recommend Projects

Recommend Topics

Recommend Org