shopify / bevel Goto Github PK

Ordinal regression in Python

License: MIT License

Python 100.00%

ordinal-regression prediction pandas-dataframe inference python

bevel's Introduction

bevel

Ordinal regression refers to a number of techniques that are designed to classify inputs into ordered (or ordinal) categories. This type of data is common in social science research settings where the dependent variable often comes from opinion polls or evaluations. For example, ordinal regression can be used to predict the letter grades of students based on the time they spend studying, or Likert scale responses to a survey based on the annual income of the respondent.

In People Analytics at Shopify, we use ordinal regression to empower Shopify employees. Our annual engagement survey contains dozens of scale questions about wellness, team health, leadership and alignment. To better dig into this data we built bevel, a repository that contains simple, easy-to-use Python implementations of standard ordinal regression techniques.

Using bevel

Fitting

The API to bevel is very similar to scikit-learn's API. A class is instantiated that has a fit method that accepts the design matrix (also called the independent variables) and the outcome array (also called the dependent variable). For bevel, the outcome array contains values from a totally-orderable set (example: {0, 1, 2, ...}, {'A', 'B', 'C', ...}, {'01', '02', '03', ...}) representing your ordinal data. (This may require some pre-processing map, for example, encoding survey responses into integers.)

The design matrix can be a numpy array, or a pandas DataFrame. The benefit of using the latter is that the DataFrame column names are displayed in inference later.

Below is an example of fitting with the OrderedLogit model.

from bevel.linear_ordinal_regression import OrderedLogit

ol = OrderedLogit()
ol.fit(X, y)

Inference and prediction

After bevel fits the model to the data, additional methods are available to use. To see the coefficients of the fitted linear model, including their standard errors and confidence intervals, use the print_summary method. Below is the output of the UCLA dataset.

ol.print_summary()
"""
                   beta  se(beta)      p  lower 0.95  upper 0.95
attribute names
pared            1.0477    0.2658 0.0001      0.5267      1.5686  ***
public          -0.0587    0.2979 0.8439     -0.6425      0.5251
gpa              0.6157    0.2606 0.0182      0.1049      1.1266    *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Somers' D = 0.158
"""

These values as a pandas DataFrame are available on the summary property of the fitted class. The Somers' D value is a measure of the goodness-of-fit of the model, analogous to the R² value in ordinary linear regression. However, unlike R², it can vary between -1 (totally discordant) and 1 (totally concordant).

Another goal of fitting is predicting outcomes from new datasets. For this, bevel has three prediction methods, depending on your goal.

ol.predict_probabilities(X)  # returns a array with the probabilities of being in each class.
ol.predict_class(X)  # returns the class with the highest probability
ol.predict_linear_product(X)  # returns the dot product of X and the fitted coefficients

bevel's People

Contributors

Stargazers

Watchers

Forkers

jodorning acampello jackkaplan97 chihhsuanlin isabella232 mykrass qpc-github quantum-platinum-cloud luna1291 meganms03 mhz28 ecomidzu zain13337

bevel's Issues

Test indicator functions

There should be a test of the indicator matrices in TestLinearOrdinalRegression. Important to check that arbitrary ordered y values are correctly handled (e.g. [0, 1, 4, 5, 7])

Publicity

Post solution here: https://stats.stackexchange.com/questions/168262/ordinal-logistic-regression-in-python

Error is thrown when X is a single column because of .shape()

before open source, enable cla

Shopify/pyreferrer#43

`predict_linear_product` throws ValueError due to deprecation

In linear_ordinal_regression.py, predict_linear_product throws the following ValueError.

152 if X.ndim == 1:
153 X = X[None, :]
--> 154 return X.dot(self.beta_)[:, None]

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\series.py:1072, in Series.getitem(self, key)
1069 key = np.asarray(key, dtype=bool)
1070 return self._get_rows_with_mask(key)
-> 1072 return self._get_with(key)

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\series.py:1082, in Series._get_with(self, key)
1077 raise TypeError(
1078 "Indexing a Series with DataFrame is not "
1079 "supported, use the appropriate DataFrame column"
1080 )
1081 elif isinstance(key, tuple):
-> 1082 return self._get_values_tuple(key)
1084 elif not is_list_like(key):
1085 # e.g. scalars that aren't recognized by lib.is_scalar, GH#32684
1086 return self.loc[key]

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\series.py:1122, in Series._get_values_tuple(self, key)
1117 if com.any_none(*key):
1118 # mpl compat if we look up e.g. ser[:, np.newaxis];
1119 # see tests.series.timeseries.test_mpl_compat_hack
1120 # the asarray is needed to avoid returning a 2D DatetimeArray
1121 result = np.asarray(self._values[key])
-> 1122 disallow_ndim_indexing(result)
1123 return result
1125 if not isinstance(self.index, MultiIndex):

File c:\ProgramData\anaconda3\envs\intel\lib\site-packages\pandas\core\indexers\utils.py:341, in disallow_ndim_indexing(result)
333 """
334 Helper function to disallow multi-dimensional indexing on 1D Series/Index.
335
(...)
338 in GH#30588.
339 """
340 if np.ndim(result) > 1:
--> 341 raise ValueError(
342 "Multi-dimensional indexing (e.g. obj[:, None]) is no longer "
343 "supported. Convert to a numpy array before indexing instead."
344 )

ValueError: Multi-dimensional indexing (e.g. obj[:, None]) is no longer supported. Convert to a numpy array before indexing instead.

My environment.yml is below.

  - numdifftools=0.9.41=pyhd8ed1ab_0
  - numpy=1.24.3=py310h2d48daa_5
  - numpy-base=1.24.3=py310h0a02333_5
  - pandas=2.1.4=py310hecd3228_0
  - python=3.10.13=h9bd319b_0
  - scipy=1.10.1=py310hec14841_8

Upload on pip

pypi allows for markdown descriptions now, so pandoc isn't needed

In setup.py, there is

try:
    import pypandoc
    readme_rst = pypandoc.convert_file(readme_md, 'rst')
except(ImportError):
    readme_rst = open(readme_md).read()

that will convert the markdown readme to an rst file. This was because historically PyPI only allowed rst docs as descriptions. However, PyPI now allows md files: https://packaging.python.org/guides/making-a-pypi-friendly-readme/

Compatibility Issue: Deprecated 'np.int' Alias in bevel Library with NumPy 1.20

Issue Summary

I encountered an issue while using the bevel library in conjunction with the latest version of NumPy. The library appears to have a dependency on the deprecated np.int alias, which leads to an AttributeError due to the changes introduced in NumPy 1.20.

Steps to Reproduce

Installed bevel library.
Attempted to use OrderedLogit class with the latest NumPy version.
Encountered AttributeError: module 'numpy' has no attribute 'int'.

Expected Behavior

I expected the library to handle the latest NumPy version without issues.

Workaround

As a temporary workaround, I installed NumPy version 1.19.5 and explicitly called np.int in my code. However, this is not a sustainable solution.

Additional Information

Python Version: 3.10.12
bevel Library Version: 0.1.0
NumPy Version: 1.25.2

Suggestions

It would be great if the library was updated to address the deprecated np.int alias and ensure compatibility with the latest versions of NumPy.

Thank you!

Implement different links

... like probit and exponential. See here: https://en.wikipedia.org/wiki/Ordinal_regression

Open source

https://vault.shopify.com/OpenSource#2-0-starting-a-new-open-source-project-with-shopify-code_2-7-licenses-and-important-development-artifacts

Paper on lots of arguments why ordinal models > metric models with ordinal data

Overall, it's a deep and thorough article on why to use ordinal regression when you have ordinal data.

https://osf.io/9h3et/download

Arguing that it is okay to use a metric model instead of an ordered probit model because a difference in conclusions may be rare is like arguing it’s okay to drive drunk because accidents rarely happen

from bevel import OrderedLogit

w = np.array([1, 2., 3.4, 1.1, 0.5])

orf = OrderedLogit()
orf.fit(X, y, weights=w)

And defaults to an array of 1s if not provided.

One or two pedagogical examples on toy datasets would be very useful.
Upload notes that establish math conventions

Binary classification for ordinal regression

This paper talks about the benefit of using binary classification for ordinal regression: https://papers.nips.cc/paper/3125-ordinal-regression-by-extended-binary-classification.pdf
The major benefits are: 1) no need to invent new loss function and optimization; 2) whatever optimization goes to Logistic Regression, the implementation gets it for free.

Microsoft uses this implementation:
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/ordinal-regression
Amazon also uses this implementation: https://www.researchgate.net/publication/318108212_Ranking_and_Calibrating_Click-Attributed_Purchases_in_Performance_Display_Advertising