Giter VIP home page Giter VIP logo

regscorepy's Introduction

RegscorePy

Build Status codecov PyPI version

A python package that does model comparison between different regression models.

Installation

pip install git+https://github.com/UBC-MDS/RegscorePy.git

#or

pip install RegscorePy

Function Description And Usage

AIC

AIC stands for Akaike’s Information Criterion. It estimates the quality of a model, relative to each of other models. The lower AIC score is, the better the model is. Therefore, a model with lowest AIC - in comparison to others, is chosen.

AIC = n*log(residual sum of squares/n) + 2K

where:

  • n: number of observations
  • K: number of parameters (including intercept)

Function

aic(y, y_pred, p)

Parameters:

  • y: array-like of shape = (n_samples) or (n_samples, n_outputs)

    • True target variable(s)
  • y_pred: array-like of shape = (n_samples) or (n_samples, n_outputs)

    • Fitted target variable(s) obtained from your regression model
  • p: int

    • Number of predictive variable(s) used in the model

Return:

  • aic_score: int
    • AIC score of the model

BIC

BIC stands for Bayesian Information Criterion. Like AIC, it also estimates the quality of a model. When fitting models, it is possible to increase model fitness by adding more parameters. Doing this may result in model overfit. Both AIC and BIC help to resolve this problem by using a penalty term for the number of parameters in the model. This term is bigger in BIC than in AIC.

BIC = n*log(residual sum of squares/n) + K*log(n)

where:

  • n: number of observations
  • K: number of parameters (including intercept)

Function

bic(y, y_pred, p)

Parameters:

  • y: array-like of shape = (n_samples) or (n_samples, n_outputs)

    • True target variable(s)
  • y_pred: array-like of shape = (n_samples) or (n_samples, n_outputs)

    • Fitted target variable(s) obtained from your regression model
  • p: int

    • Number of predictive variable(s) used in the model

Return:

  • bic_score: int
    • BIC score of the model

Mallow's C_p

Introduction

Mallow's C_p is named for Colin Lingwood Mallows. It is used to assess the fit of regression model, finding the best model involving a subset of predictive variables available for predicting some outcome.

C_p = (SSE_p/MSE) - (n - 2p)

where:

  • SSE_k: residual sum of squares for the subset model containing p explanatory variables counting the intercept.
  • MSE: mean squared error for the full model (model containing all k explanatory variables of interest)
  • n: number of observations
  • p: number of subset explanatory variables

Function

mallow(y, y_pred, y_sub, k, p)

Parameters:

  • y: array-like of shape = (n_samples) or (n_samples, n_outputs)

    • True target variable(s)
  • y_pred: array-like of shape = (n_samples) or (n_samples, n_outputs)

    • Fitted target variable(s) obtained from your regression model
  • y_sub: array-like of shape = (n_samples) or (n_samples, n_outputs)

    • Fitted target variable(s) obtained from your subset regression model
  • k: int

    • Number of predictive variable(s) used in the model
  • p: int

    • Number of predictive variable(s) used in the subset model

Return:

  • mallow_score: int
    • Mallow's C_p score of the subset model

Usage

>> from RegscorePy import *
>> y = [1,2,3,4]
>> y_pred = [5,6,7,8]
>> p = 3
>> aic.aic(y, y_pred, p)
17.090354888959126
>>
>>
>> bic.bic(y, y_pred, p)
15.249237972318795
>>
>>
>> y_sub = [1,2,3,5]
>> k = 3
>> p = 2
>> mallow.mallow(y, y_pred, y_sub, k, p) 
>> 0.015625

  • This usage apply to python3. If you use python2, please run from __future__ import division before run the function.

How to run tests

From root directory, run all test files in terminal:

python -m pytest

You also have the option to run individual test files by referencing its path. For example, if you want to test aic function, you can use the command below:

python -m pytest RegscorePy/test/test_aic.py

Dependecies

Following versions were used to develop this package.

Python(v>=3.6)

Numpy(v>=1.13.3)

Pandas(v>=0.20.3)

License

MIT

Contributing

This is an open source project. Please follow the guidelines below for contribution.

  • Open an issue for any feedback and suggestions.
  • For contributing to the project, please refer to Contributing for details.

regscorepy's People

Contributors

hadinh1306 avatar rq1995 avatar simrnsethi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

regscorepy's Issues

User input

If it's possible to reduce the number of parameters the user has to enter, always do so. For example it's unnecessary to have the user enter "n", the number of data points, as that can be derived from the dimensions of X. If you think there might be edge cases where n != X.shape[0], consider setting n as a named parameter with sensible defaults and allowing the user to overwrite those defaults when necessary.

Feedback: Python2/3 bug

In python3:

>>> from RegscorePy import *
>>> y = [1,2,3,4]
>>> y_pred = [5,6,7,8]
>>> y_sub = [1,2,3,5]
>>> k = 3
>>> p = 2
>>> mallow.mallow(y, y_pred, y_sub, k, p)
0.015625
>>>

However, in python2:

>>> from RegscorePy import *
>>> y = [1,2,3,4]
>>> y_pred = [5,6,7,8]
>>> y_sub = [1,2,3,5]
>>> k = 3
>>> p = 2
>>> mallow.mallow(y, y_pred, y_sub, k, p)
0
>>>

This is because of the different ways python2 and python3 handle integer division. Be sure to cast one of your integers to float() to fix this bug. Also specify in the readme which python version you have tested your package on. On the plus side, your tests failed in the appropriate places when I ran them in python2 so well done :)

Small typo

In CONTRIBUTING.md, It's "Fork", not "Folk" :)

Wrapper functions

If you have additional time you can also write wrapper functions which take the raw data as originally proposed and return both the metric and the trained model.

Feedback

Nice! You have a well-defined project that will be manageable within the scope of the next three labs. There are a few modifications to the proposed functions you should make for the next lab.

See:

Main changes for next time:

  • Allow users to evaluate their own models
  • Make interfacing easy with existing ecosystem(s)

Feedback: Variable name

Naming a variable 'mallow' within a function named 'mallow' is bad practice. You'll run into weird bugs during development due to choosing a variable name that is the same as a function name.

Feedback (bonus)

This one isn't necessary, but if you have some extra time you might want to demo your functions on real data. It will impress future employers who may look through your repo :)

Match scikit-learn input structure

You should structure your functions to match the input of scikit-learn (i.e. metric(y_true, y_pred, k). This way they will work with pretrained models, and will be easily dropped into existing code. Same goes for R - design your functions so they are flexible and easily interface with the existing ecosystem.

Overall feedback

Great job! Everything works perfectly, I downloaded both packages in R with no issues, code is clean and well commented, you have a passing build stamp, the demo code works, all tests pass, exceptions are handled well, readme is well written and informative. I've left a few very minor suggestions in other issues which you should fix because the code is public.

Feedback: imports

Move imports outside of the functions in aic.py, bic.py, and mallow.py.

User input models

Your proposed functions take in the data, train a model, and return a metric. However, they don't allow a user to input their own models which is functionality most, if not all, users will want. Your functions also don't return the underlying trained model, so if the users were to find a model that was well-fit to the data there would have no way to access it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.