Giter VIP home page Giter VIP logo

bevel's Introduction

bevel Build Status

Ordinal regression refers to a number of techniques that are designed to classify inputs into ordered (or ordinal) categories. This type of data is common in social science research settings where the dependent variable often comes from opinion polls or evaluations. For example, ordinal regression can be used to predict the letter grades of students based on the time they spend studying, or Likert scale responses to a survey based on the annual income of the respondent.

In People Analytics at Shopify, we use ordinal regression to empower Shopify employees. Our annual engagement survey contains dozens of scale questions about wellness, team health, leadership and alignment. To better dig into this data we built bevel, a repository that contains simple, easy-to-use Python implementations of standard ordinal regression techniques.

Using bevel

Fitting

The API to bevel is very similar to scikit-learn's API. A class is instantiated that has a fit method that accepts the design matrix (also called the independent variables) and the outcome array (also called the dependent variable). For bevel, the outcome array contains values from a totally-orderable set (example: {0, 1, 2, ...}, {'A', 'B', 'C', ...}, {'01', '02', '03', ...}) representing your ordinal data. (This may require some pre-processing map, for example, encoding survey responses into integers.)

The design matrix can be a numpy array, or a pandas DataFrame. The benefit of using the latter is that the DataFrame column names are displayed in inference later.

Below is an example of fitting with the OrderedLogit model.

from bevel.linear_ordinal_regression import OrderedLogit

ol = OrderedLogit()
ol.fit(X, y)
Inference and prediction

After bevel fits the model to the data, additional methods are available to use. To see the coefficients of the fitted linear model, including their standard errors and confidence intervals, use the print_summary method. Below is the output of the UCLA dataset.

ol.print_summary()
"""
                   beta  se(beta)      p  lower 0.95  upper 0.95
attribute names
pared            1.0477    0.2658 0.0001      0.5267      1.5686  ***
public          -0.0587    0.2979 0.8439     -0.6425      0.5251
gpa              0.6157    0.2606 0.0182      0.1049      1.1266    *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Somers' D = 0.158
"""

These values as a pandas DataFrame are available on the summary property of the fitted class. The Somers' D value is a measure of the goodness-of-fit of the model, analogous to the R² value in ordinary linear regression. However, unlike R², it can vary between -1 (totally discordant) and 1 (totally concordant).

Another goal of fitting is predicting outcomes from new datasets. For this, bevel has three prediction methods, depending on your goal.

ol.predict_probabilities(X)  # returns a array with the probabilities of being in each class.
ol.predict_class(X)  # returns the class with the highest probability
ol.predict_linear_product(X)  # returns the dot product of X and the fitted coefficients

bevel's People

Contributors

camdavidsonpilon avatar cursedcoder avatar rossdiener avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bevel's Issues

Test indicator functions

There should be a test of the indicator matrices in TestLinearOrdinalRegression. Important to check that arbitrary ordered y values are correctly handled (e.g. [0, 1, 4, 5, 7])

`predict_linear_product` throws ValueError due to deprecation

In linear_ordinal_regression.py, predict_linear_product throws the following ValueError.

152 if X.ndim == 1:
153 X = X[None, :]
--> 154 return X.dot(self.beta_)[:, None]

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\series.py:1072, in Series.getitem(self, key)
1069 key = np.asarray(key, dtype=bool)
1070 return self._get_rows_with_mask(key)
-> 1072 return self._get_with(key)

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\series.py:1082, in Series._get_with(self, key)
1077 raise TypeError(
1078 "Indexing a Series with DataFrame is not "
1079 "supported, use the appropriate DataFrame column"
1080 )
1081 elif isinstance(key, tuple):
-> 1082 return self._get_values_tuple(key)
1084 elif not is_list_like(key):
1085 # e.g. scalars that aren't recognized by lib.is_scalar, GH#32684
1086 return self.loc[key]

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\series.py:1122, in Series._get_values_tuple(self, key)
1117 if com.any_none(*key):
1118 # mpl compat if we look up e.g. ser[:, np.newaxis];
1119 # see tests.series.timeseries.test_mpl_compat_hack
1120 # the asarray is needed to avoid returning a 2D DatetimeArray
1121 result = np.asarray(self._values[key])
-> 1122 disallow_ndim_indexing(result)
1123 return result
1125 if not isinstance(self.index, MultiIndex):

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\indexers\utils.py:341, in disallow_ndim_indexing(result)
333 """
334 Helper function to disallow multi-dimensional indexing on 1D Series/Index.
335
(...)
338 in GH#30588.
339 """
340 if np.ndim(result) > 1:
--> 341 raise ValueError(
342 "Multi-dimensional indexing (e.g. obj[:, None]) is no longer "
343 "supported. Convert to a numpy array before indexing instead."
344 )

ValueError: Multi-dimensional indexing (e.g. obj[:, None]) is no longer supported. Convert to a numpy array before indexing instead.

My environment.yml is below.

  - numdifftools=0.9.41=pyhd8ed1ab_0
  - numpy=1.24.3=py310h2d48daa_5
  - numpy-base=1.24.3=py310h0a02333_5
  - pandas=2.1.4=py310hecd3228_0
  - python=3.10.13=h9bd319b_0
  - scipy=1.10.1=py310hec14841_8

Compatibility Issue: Deprecated 'np.int' Alias in bevel Library with NumPy 1.20

Issue Summary

I encountered an issue while using the bevel library in conjunction with the latest version of NumPy. The library appears to have a dependency on the deprecated np.int alias, which leads to an AttributeError due to the changes introduced in NumPy 1.20.

Steps to Reproduce

  1. Installed bevel library.
  2. Attempted to use OrderedLogit class with the latest NumPy version.
  3. Encountered AttributeError: module 'numpy' has no attribute 'int'.

Expected Behavior

I expected the library to handle the latest NumPy version without issues.

Workaround

As a temporary workaround, I installed NumPy version 1.19.5 and explicitly called np.int in my code. However, this is not a sustainable solution.

Additional Information

  • Python Version: 3.10.12
  • bevel Library Version: 0.1.0
  • NumPy Version: 1.25.2

Suggestions

It would be great if the library was updated to address the deprecated np.int alias and ensure compatibility with the latest versions of NumPy.

Thank you!

Handle boolean input types

Remove burden/surprise from user to convert the type. Scikitlearn automatically converts it - I think bevel should too

Regularization

It should be straightforward to implement an L2 regularization of linear ordinal regression. Doing so for L1 and/or then writing tests will be more challenging.

adding weights to the regression

ex:

from bevel import OrderedLogit

w = np.array([1, 2., 3.4, 1.1, 0.5])

orf = OrderedLogit()
orf.fit(X, y, weights=w)

And defaults to an array of 1s if not provided.

Use sklearn nomenclature

Variables like X_scale have a different meaning in sklearn. It would be nice to be consistent with the industry standard where possible (or at least not inconsistent).

Binary classification for ordinal regression

This paper talks about the benefit of using binary classification for ordinal regression: https://papers.nips.cc/paper/3125-ordinal-regression-by-extended-binary-classification.pdf
The major benefits are: 1) no need to invent new loss function and optimization; 2) whatever optimization goes to Logistic Regression, the implementation gets it for free.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.