Giter VIP home page Giter VIP logo

pypunisher's Introduction

PyPunisher

Build Status Coverage Status

PyPunisher is a Python implementation of forward and backward feature selection. Feature selection, or stepwise regression, is a key step in the data science pipeline that reduces model complexity by selecting the most relevant features from the original dataset. This package implements two stepwise feature selection methods:

  • forward_selection(): starts with a null model and iteratively adds useful features
  • backward_elimination(): starts with a full model and iteratively removes the least useful feature at each step

These methods are greedy search algorithms that yield a nested subset of features. The size of the final feature subset depends on what you define as your "stopping criterion". The stopping criterion can be either a threshold that you define, or a pre-defined number of features to include in your model. For example, if you set your stopping criterion to be a threshold (min_change), then the feature selection process will stop when the AIC or BIC score no longer improves by that thresholded interval. Alternatively, if you want a specific number of features in your model, then the process will stop once it reaches n_features.

In order to measure model quality during the selection procedures, we have also implemented the Akaike and Bayesian Information Criterion, both of which punish complex models:

In general, having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC add a penalty for the number of features in a model. This penalty term is larger in BIC than in AIC. A lower AIC or BIC score indicates a better fit for the data, relative to competing models.

Installation

pip3 install git+git://github.com/UBC-MDS/PyPunisher@master

Requires Python 3.6+.

Documentation

The documentation for PyPunisher can be viewed here.

How to run unit tests

From root directory, run all test files in terminal:

python -m pytest

You also have the option to run individual test files by referencing its path. For example:

python -m pytest tests/test_forward_selection.py

Contributions

Instructions and guidelines on how to contribute can be found here.

pypunisher's People

Contributors

avinashkz avatar tariqahassan avatar topspinj avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pypunisher's Issues

Milestone 3 objectives

General

  • set up Travis ✅
  • add Software License ✅
  • show branch coverage ✅

PyPunisher

  • exception handling ✅
  • integration tests ✅
  • render docs ✅
  • add example to README ✅

punisheR

  • exception handling ✅
  • integration tests ✅
  • make package name consistent (punisheR vs punishR) ✅
  • explore formatR (time permitting)

Issues in punisheR

  • Test whether forward and backward work for returning multiple features ✅
  • Better test for values from AIC & BIC ✅
  • Potential spelling errors identified by devtools::spell_check()
  • Suggested issues from the goodpractice package ✅ [goodpractice() now passing with 0 errors/warnings]
  • covr suggests you are not at 100% branch coverage ✅ [100% coverage achieved]
  • notes to address from running devtools::check() ✅
  • no explanation of what the test_data function does in vignette ✅
  • Add example of functions in README ✅
  • Vignette not rendered in repo ✅
  • require matrix and vectors as input for forward and backward functions? ✅
  • Function documentation incomplete ✅
  • Additions to how this package fits into the existing R ecosystem ✅

Bonus

  • implement linter in our CI pipeline ✅

Code Snippet Run Error

[10]
10
[10]
Traceback (most recent call last):
File "a.py", line 15, in
print("AIC", aic(model, X_train=X_train, y_train=y_train))
File "/usr/local/lib/python3.6/site-packages/pypunisher/metrics/criterion.py", line 76, in aic
n, k, llf = _get_coeffs(model, X_train=X_train, y_train=y_train)
File "/usr/local/lib/python3.6/site-packages/pypunisher/metrics/criterion.py", line 36, in _get_coeffs
y_pred = model.predict(X_train)
File "/usr/local/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 256, in predict
return self._decision_function(X)
File "/usr/local/lib/python3.6/site-packages/sklearn/linear_model/base.py", line 241, in decision_function
dense_output=True) + self.intercept

File "/usr/local/lib/python3.6/site-packages/sklearn/utils/extmath.py", line 140, in safe_sparse_dot
return np.dot(a, b)
ValueError: shapes (375,20) and (1,) not aligned: 20 (dim 1) != 1 (dim 0)

Sunday March 4th

@topspinj
@avinashkz

Tariq:

  • Update R code to match Python.
  • Replicate metrics tests in R (as they are in Python).

Avinash:

  • Replicate forwards test in R.

Jill:

  • Replicate backwards test in R.

Milestone 2 objectives

PyPunisher

  • Write forward and backward selection functions - Tariq ✅
  • Write aic and bic functions - Jill ✅

punisheR

  • Write forward selection - Tariq ✅
  • Write backward selection function - Tariq ✅
  • Write aic and bic functions - Avinash ✅
  • Write up vignette - Avinash ✅
  • Enhance forward selection test documentation ✅
  • Add Installation instructions to the README ✅
  • Add coverage report to the README ✅
  • Enhance backward selection test documentation ✅

Feedback on Milestone 2

Hi All,

Nice work for milestone 2. I like your comprehensive designs for the entire package. Here is my comments:

  1. Good Practice to state out installation requires python 3.6

  2. I like your coverage part to detail your test coverage, excellent

  3. For init.py, line 4 and 5, why not just list.append() to have version number?

  4. For selection_engines/init.py, I like your comments for the issue in scipy

  5. Please improve your style in Python programming, you can refer to https://google.github.io/styleguide/pyguide.html as in selection.py line 4, this is not professional, also the space you have between lines are not equal.

  6. For your first line of your python file, it is suggested that you can include #!/usr/bin/env python just in case the user is running your code in Linux(like me)

  7. for _fit_and_score(self, S, feature, algorithm), what if algorithm input is wrong input parameters? if you thought about that?

  8. For function backward(), S is not a good naming for a list

Regards
Jason

Feedback from Jason

Hi All,

Congrats for your work and you did a good job for finding a series of functions to develop. I have some suggestions regarding your current proposal:

  1. As is discussed in today's office hour, it would be better if you can specify and encouraged to think in advance for the detailed data type of input parameters for each function, the data type of output value of the function, any side-corner cases you need to consider when doing unit tests to exclude, how much big(o) notation of computational complexity the function would have? These are all important aspects when you design a thoughtful function.

  2. Another issue is for the function complexity as I have discussed with Jill today if there exists multiple for loop, which may cause huge delay when dealing with large datasets. You can think some way to improve or optimize.

  3. More description regarding aic() and bic() should be described in the README.MD

  4. For contributor guideline, it is suggested that you can specify some detailed steps regarding how to make pull request, how to do Github push/commit operations, etc.

Regards
Jason

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.