madrury / py-glm Goto Github PK

View Code? Open in Web Editor NEW

128.0 7.0 30.0 586 KB

Generalized Linear Models in Sklearn Style

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python statistical-learning machine-learning parametric regularization regression

py-glm's Introduction

py-glm: Generalized Linear Models in Python

py-glm is a library for fitting, inspecting, and evaluating Generalized Linear Models in python.

Installation

The py-glm library can be installed directly from github.

pip install git+https://github.com/madrury/py-glm.git

Features

Model Fitting

py-glm supports models from various exponential families:

from glm.glm import GLM
from glm.families import Gaussian, Bernoulli, Poisson, Exponential

linear_model = GLM(family=Gaussian())
logistic_model = GLM(family=Bernoulli())
poisson_model = GLM(family=Poisson())
exponential_model = GLM(family=Exponential())

Models with dispersion parameters are also supported. The dispersion parameters in these models are estimated using the deviance.

from glm.families import QuasiPoisson, Gamma

quasi_poisson_model = GLM(family=QuasiPoisson())
gamma_model = GLM(family=Gamma())

Fitting a model proceeds in sklearn style, and uses the Fisher scoring algorithm:

logistic_model.fit(X, y_logistic)

If your data resides in a pandas.DataFrame, you can pass this to fit along with a model formula.

logistic_model.fit(X, formula="y ~ Moshi + SwimSwim")

Offsets and sample weights are supported when fitting:

linear_model.fit(X, y_linear, sample_weights=sample_weights)
poisson_nmodel.fit(X, y_poisson, offset=np.log(expos))

Predictions are also made in sklearn style:

logistic_model.predict(X)

Note: There is one major place we deviate from the sklearn interface. The predict method on a GLM object always returns an estimate of the conditional expectation E[y | X]. This is in contrast to sklearn behavior for classification models, where it returns a class assignment. We make this choice so that the py-glm library is consistent with its use of predict. If the user would like class assignments from a model, they will need to threshold the probability returned by predict manually.

Inference

Once the model is fit, parameter estimates, parameter covariance estimates, and p-values from a standard z-test are available:

logistic_model.coef_
logistic_model.coef_covariance_matrix_
logistic_model.coef_standard_error_
logistic_model.p_values_

To get a quick summary, use the summary method:

logistic_model.summary()

Binomial GLM Model Summary.
===============================================
Name         Parameter Estimate  Standard Error
-----------------------------------------------
Intercept                  1.02            0.01
Moshi                     -2.00            0.02
SwimSwim                   1.00            0.02

Re-sampling methods are also supported in the simulation subpackage: the parametric and non-parametric bootstraps:

from glm.simulation import Simulation

sim = Simulation(logistic_model)
sim.parametric_bootstrap(X, n_sim=1000)
sim.non_parametric_bootstrap(X, n_sim=1000)

Regularization

Ridge regression is supported for each model (note, the regularization parameter is called alpha instead of lambda due to lambda being a reserved word in python):

logistic_model.fit(X, y_logistic, alpha=1.0)

References

Marlene Müller (2004). Generalized Linear Models.

Warning

The glmnet code included in glm.glmnet is experimental. Please use at your own risk.

py-glm's People

Contributors

Stargazers

Watchers

py-glm's Issues

Add negative binomial family

I wouldn't normally feign the authority to make requests of you, but you asked me to do this, so here we go. Please consider adding the negative binomial family.

Thank you! This repo is badass.

Deviance Bernoulli distribution

Isn't there a mistake in the deviance of the Bernoulli distribution?

The deviance should be ... log(y_i/mu_i) ... log((1-y_i) / (1 - mu_i))

Support for Tweedie distribution

If/when I have time I'll fork the repo and add it.

Add formula support.

Import Error - from glm.glm import GLM

Traceback (most recent call last):
File "C:\Python27\lib\site-packages\IPython\core\interactiveshell.py", line 2878, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "", line 1, in
from glm.glm import GLM
File "C:\Users\Ben.p2\pool\plugins\org.python.pydev.core_7.0.3.201811082356\pysrc_pydev_bundle\pydev_import_hook.py", line 21, in do_import
module = self._system_import(name, *args, **kwargs)
File "C:\Python27\lib\site-packages\py_glm-0.0.1-py2.7.egg\glm\glm.py", line 104
def fit(self, X, y=None, formula=None, *,

NaN coefficients

Hi,

I run many models using your library and in one run I got back "nan" coefficients.
I solved that by checking in the glm.py file if the returned values are nan

# line 219 py-glm/glm/glm.py
diff = coef - np.linalg.solve(ddbeta, dbeta)
if np.isnan(diff).any():
    break

I was wondering if it is a valid solution or we could use some different check based on tolerance?

I also checked the julia package which performs a finite check

What are your thoughts on that?
Could we integrate that change in the package?

failing comparison of elastic net with scikit-learn

I tried to compare pyglmnet.GLM with ElasticNet from scikit-learn and could not get it work. Code to reproduce:

import numpy as np
from sklearn.datasets.samples_generator import make_regression
from sklearn.linear_model import ElasticNet, GeneralizedLinearRegressor
from pyglmnet import GLM

def rmse(a, b):
    return np.sqrt(np.mean((a - b) ** 2))

X, Y, coef_ = make_regression(
    n_samples=1000, n_features=1000,
    noise=0.1, n_informative=10, coef=True,
    random_state=42)

alpha = 0.1
l1_ratio=0.5

sk = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, tol = 1e-5).fit(X, Y)
pg = GLM(distr='gaussian', alpha=l1_ratio, reg_lambda=alpha, solver='cdfast', tol = 1e-5).fit(X, Y)
print('in-sample rmse sklearn = {}, pyglmnet = {}'.format(rmse(Y, sk.predict(X)), rmse(Y, pg.predict(X))))

Result:
in-sample rmse sklearn = 12.756054997216442, pyglmnet = 161.03496460055504

Support ordinal regression

No really opinion on how you do it, but it seems an interesting challenge.