jzluo / firthlogist Goto Github PK

View Code? Open in Web Editor NEW

5.0 1.0 3.0 51 KB

A Python implementation of Firth Logistic Regression

License: MIT License

Python 100.00%

firthlogist's Introduction

firthlogist

A Python implementation of Logistic Regression with Firth's bias reduction.

Installation

pip install firthlogist

Usage

firthlogist is sklearn compatible and follows the sklearn API.

>>> from firthlogist import FirthLogisticRegression, load_sex2
>>> fl = FirthLogisticRegression()
>>> X, y, feature_names = load_sex2()
>>> fl.fit(X, y)
FirthLogisticRegression()
>>> fl.summary(xname=feature_names)
                 coef    std err     [0.025      0.975]      p-value
---------  ----------  ---------  ---------  ----------  -----------
age        -1.10598     0.42366   -1.97379   -0.307427   0.00611139
oc         -0.0688167   0.443793  -0.941436   0.789202   0.826365
vic         2.26887     0.548416   1.27304    3.43543    1.67219e-06
vicl       -2.11141     0.543082  -3.26086   -1.11774    1.23618e-05
vis        -0.788317    0.417368  -1.60809    0.0151846  0.0534899
dia         3.09601     1.67501    0.774568   8.03028    0.00484687
Intercept   0.120254    0.485542  -0.818559   1.07315    0.766584

Log-Likelihood: -132.5394
Newton-Raphson iterations: 8

Parameters

max_iter: int, default=25

The maximum number of Newton-Raphson iterations.

max_halfstep: int, default=25

The maximum number of step-halvings in one Newton-Raphson iteration.

max_stepsize: int, default=5

The maximum step size - for each coefficient, the step size is forced to be less than max_stepsize.

pl_max_iter: int, default=100

The maximum number of Newton-Raphson iterations for finding profile likelihood confidence intervals.

pl_max_halfstep: int, default=25

The maximum number of step-halvings in one iteration for finding profile likelihood confidence intervals.

pl_max_stepsize: int, default=5

The maximum step size while finding PL confidence intervals - for each coefficient, the step size is forced to be less than max_stepsize.

tol: float, default=0.0001

Convergence tolerance for stopping.

fit_intercept: bool, default=True

Specifies if intercept should be added.

skip_pvals: bool, default=False

If True, p-values will not be calculated. Calculating the p-values can be expensive if wald=False since the fitting procedure is repeated for each coefficient.

skip_ci: bool, default=False

If True, confidence intervals will not be calculated. Calculating the confidence intervals via profile likelihoood is time-consuming.

alpha: float, default=0.05

Significance level (confidence interval = 1-alpha). 0.05 as default for 95% CI.

wald: bool, default=False

If True, uses Wald method to calculate p-values and confidence intervals.

test_vars: Union[int, List[int]], default=None

Index or list of indices of the variables for which to calculate confidence intervals and p-values. If None, calculate for all variables. This option has no effect if wald=True.

Attributes

bse_

Standard errors of the coefficients.

classes_

A list of the class labels.

ci_

The fitted profile likelihood confidence intervals.

coef_

The coefficients of the features.

intercept_

Fitted intercept. If fit_intercept = False, the intercept is set to zero.

loglik_

Fitted penalized log-likelihood.

n_iter_

Number of Newton-Raphson iterations performed.

pvals_

p-values calculated by penalized likelihood ratio tests.

References

Firth, D (1993). Bias reduction of maximum likelihood estimates. Biometrika 80, 27–38.

Heinze G, Schemper M (2002). A solution to the problem of separation in logistic regression. Statistics in Medicine 21: 2409-2419.

firthlogist's People

Contributors

Stargazers

Watchers

Forkers

xies hoeze maxkloucek

firthlogist's Issues

add a result summary method

A summary method to present in a prettier format would be nice. I don't think it needs to be as detailed as the summary in statsmodels. This is what summary in logistf looks like:

                   coef  se(coef) lower 0.95  upper 0.95       Chisq            p method
(Intercept)  0.12025405 0.4763429 -0.8185591  1.07315122  0.06286298 8.020268e-01      2
age         -1.10598131 0.4149021 -1.9737884 -0.30742514  7.50773092 6.143472e-03      2
oc          -0.06881673 0.4344026 -0.9414363  0.78920202  0.02467044 8.751911e-01      2
vic          2.26887464 0.5384872  1.2730216  3.43543273 22.93139022 1.678877e-06      2
vicl        -2.11140817 0.5320395 -3.2608611 -1.11773495 19.10407252 1.237805e-05      2
vis         -0.78831694 0.4089620 -1.6080879  0.01518468  3.69740975 5.449701e-02      2
dia          3.09601166 1.5052197  0.7745682  8.03029352  7.89693139 4.951873e-03      2

Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None

Likelihood ratio test=49.09064 on 6 df, p=7.15089e-09, n=239
Wald test = 31.96835 on 6 df, p = 1.654713e-05

Should be called like:

firth = FirthLogisticRegression()
firth.fit(X, y)
firth.summary()

Maybe an optional var_names parameter in .summary() so that the feature names can be printed in the summary as well, otherwise x0, x1, etc.

Probably the most straightforward way is to just stick everything in a Pandas dataframe and print (and perhaps return?) that. Pretty printing with tabulate can be considered, maybe with tabulate as just an optional dependency?

Python 3.11 support

Hi @jzluo ! I tried to install it through pip but I am already on 3.11, since fedora and a few other distros are already on 3.11, what is missing to get 3.11 support? A fast scan over your code seems everything is ok at least syntax wise.

Wald test and confidence intervals

Add a mode to use Wald method for tests and confidence intervals instead of LRT and profile likelihood

implement profile penalized likelihood confidence intervals

Implement penalized profile likelihood to calculate confidence intervals for the coefficients

Implement FLIC and FLAC modifications

Paper for FLIC and FLAC: https://onlinelibrary.wiley.com/doi/10.1002/sim.7273

Whereas it reduces the bias in maximum likelihood estimates of coefficients, bias towards one-half is introduced
in the predicted probabilities. The stronger the imbalance of the outcome, the more severe is the bias in the
predicted probabilities. We propose two simple modifications of Firth’s logistic regression resulting in unbiased
predicted probabilities.

#! unzip sample.zip
with open(r"X.pickle", "rb") as fd:
    X = pickle.load(fd)
with open(r"y.pickle", "rb") as fd:
    y = pickle.load(fd)

from firthlogist import FirthLogisticRegression
FirthLogisticRegression().fit(X, y)

This fails after the second round of optimization because all coefficients get NaN.

Detailed logs + stack trace:

/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:153: ConvergenceWarning: Firth logistic regression failed to converge. Try increasing max_iter.
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:322: RuntimeWarning: invalid value encountered in log
/opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:2154: RuntimeWarning: invalid value encountered in det

---------------------------------------------------------------------------
LinAlgError                               Traceback (most recent call last)
Cell In[164], line 2
      1 from firthlogist import FirthLogisticRegression
----> 2 FirthLogisticRegression().fit(X, y)

File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:161, in FirthLogisticRegression.fit(self, X, y)
    159 if not self.skip_ci:
    160     if not self.wald:
--> 161         self.ci_ = _profile_likelihood_ci(
    162             X=X,
    163             y=y,
    164             fitted_coef=self.coef_,
    165             full_loglik=self.loglik_,
    166             max_iter=self.pl_max_iter,
    167             max_stepsize=self.pl_max_stepsize,
    168             max_halfstep=self.pl_max_halfstep,
    169             tol=self.tol,
    170             alpha=self.alpha,
    171             test_vars=self.test_vars,
    172         )
    173     else:
    174         self.ci_ = _wald_ci(self.coef_, self.bse_, self.alpha)

File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/firthlogist/firthlogist.py:440, in _profile_likelihood_ci(X, y, fitted_coef, full_loglik, max_iter, max_stepsize, max_halfstep, tol, alpha, test_vars)
    438 U_star = np.matmul(X.T, y - preds + np.multiply(hat, 0.5 - preds))
    439 # https://github.com/georgheinze/logistf/blob/master/src/logistf.c#L780-L781
--> 440 inv_fisher = np.linalg.pinv(fisher_info_mtx)
    441 tmp1x1 = U_star @ np.negative(inv_fisher) @ U_star
    442 underRoot = (
    443     -2
    444     * ((LL0 - loglike) + 0.5 * tmp1x1)
    445     / (inv_fisher[coef_idx, coef_idx])
    446 )

File <__array_function__ internals>:180, in pinv(*args, **kwargs)

File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:1998, in pinv(a, rcond, hermitian)
   1996     return wrap(res)
   1997 a = a.conjugate()
-> 1998 u, s, vt = svd(a, full_matrices=False, hermitian=hermitian)
   2000 # discard small singular values
   2001 cutoff = rcond[..., newaxis] * amax(s, axis=-1, keepdims=True)

File <__array_function__ internals>:180, in svd(*args, **kwargs)

File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:1657, in svd(a, full_matrices, compute_uv, hermitian)
   1654         gufunc = _umath_linalg.svd_n_s
   1656 signature = 'D->DdD' if isComplexType(t) else 'd->ddd'
-> 1657 u, s, vh = gufunc(a, signature=signature, extobj=extobj)
   1658 u = u.astype(result_t, copy=False)
   1659 s = s.astype(_realType(result_t), copy=False)

File /opt/anaconda/envs/firthlogist/lib/python3.9/site-packages/numpy/linalg/linalg.py:98, in _raise_linalgerror_svd_nonconvergence(err, flag)
     97 def _raise_linalgerror_svd_nonconvergence(err, flag):
---> 98     raise LinAlgError("SVD did not converge")

LinAlgError: SVD did not converge