Giter VIP home page Giter VIP logo

pyfixest's Introduction

PyFixest: Fast High-Dimensional Fixed Effects Regression in Python

License PyPI -Version PyPI - Python Version Downloads Downloads image Ruff All Contributors

PyFixest is a Python implementation of the formidable fixest package for fast high-dimensional fixed effects regression.

The package aims to mimic fixest syntax and functionality as closely as Python allows: if you know fixest well, the goal is that you won't have to read the docs to get started! In particular, this means that all of fixest's defaults are mirrored by PyFixest - currently with only one small exception.

Nevertheless, for a quick introduction, you can take a look at the documentation or the regression chapter of Arthur Turrell's book on Coding for Economists.

Features

  • OLS, WLS and IV Regression
  • Poisson Regression following the pplmhdfe algorithm
  • Multiple Estimation Syntax
  • Several Robust and Cluster Robust Variance-Covariance Estimators
  • Wild Cluster Bootstrap Inference (via wildboottest)
  • Difference-in-Differences Estimators:
  • Multiple Hypothesis Corrections following the Procedure by Romano and Wolf and Simultaneous Confidence Intervals using a Multiplier Bootstrap
  • Fast Randomization Inference as in the ritest Stata package
  • The Causal Cluster Variance Estimator (CCV) following Abadie et al.

Installation

You can install the release version from PyPI by running

# inside an active virtual environment
python -m pip install pyfixest

or the development version from github by running

python -m pip install git+https://github.com/py-econometrics/pyfixest

Benchmarks

All benchmarks follow the fixest benchmarks. All non-pyfixest timings are taken from the fixest benchmarks.

Quickstart

import pyfixest as pf

data = pf.get_data()
pf.feols("Y ~ X1 | f1 + f2", data=data).summary()
###

Estimation:  OLS
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.919 |        0.065 |   -14.057 |      0.000 | -1.053 |  -0.786 |
---
RMSE: 1.441   R2: 0.609   R2 Within: 0.2

Multiple Estimation

You can estimate multiple models at once by using multiple estimation syntax:

# OLS Estimation: estimate multiple models at once
fit = pf.feols("Y + Y2 ~X1 | csw0(f1, f2)", data = data, vcov = {'CRV1':'group_id'})
# Print the results
fit.etable()
                           est1               est2               est3               est4               est5               est6
------------  -----------------  -----------------  -----------------  -----------------  -----------------  -----------------
depvar                        Y                 Y2                  Y                 Y2                  Y                 Y2
------------------------------------------------------------------------------------------------------------------------------
Intercept      0.919*** (0.121)   1.064*** (0.232)
X1            -1.000*** (0.117)  -1.322*** (0.211)  -0.949*** (0.087)  -1.266*** (0.212)  -0.919*** (0.069)  -1.228*** (0.194)
------------------------------------------------------------------------------------------------------------------------------
f2                            -                  -                  -                  -                  x                  x
f1                            -                  -                  x                  x                  x                  x
------------------------------------------------------------------------------------------------------------------------------
R2                        0.123              0.037              0.437              0.115              0.609              0.168
S.E. type          by: group_id       by: group_id       by: group_id       by: group_id       by: group_id       by: group_id
Observations                998                999                997                998                997                998
------------------------------------------------------------------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)

Adjust Standard Errors "on-the-fly"

Standard Errors can be adjusted after estimation, "on-the-fly":

fit1 = fit.fetch_model(0)
fit1.vcov("hetero").summary()
Model:  Y~X1
###

Estimation:  OLS
Dep. var.: Y
Inference:  hetero
Observations:  998

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      0.919 |        0.112 |     8.223 |      0.000 |  0.699 |   1.138 |
| X1            |     -1.000 |        0.082 |   -12.134 |      0.000 | -1.162 |  -0.838 |
---
RMSE: 2.158   R2: 0.123

Poisson Regression via fepois()

You can estimate Poisson Regressions via the fepois() function:

poisson_data = pf.get_data(model = "Fepois")
pf.fepois("Y ~ X1 + X2 | f1 + f2", data = poisson_data).summary()
###

Estimation:  Poisson
Dep. var.: Y, Fixed effects: f1+f2
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -0.007 |        0.035 |    -0.190 |      0.850 | -0.075 |   0.062 |
| X2            |     -0.015 |        0.010 |    -1.449 |      0.147 | -0.035 |   0.005 |
---
Deviance: 1068.169

IV Estimation via three-part formulas

Last, PyFixest also supports IV estimation via three part formula syntax:

fit_iv = pf.feols("Y ~ 1 | f1 | X1 ~ Z1", data = data)
fit_iv.summary()
###

Estimation:  IV
Dep. var.: Y, Fixed effects: f1
Inference:  CRV1
Observations:  997

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| X1            |     -1.025 |        0.115 |    -8.930 |      0.000 | -1.259 |  -0.790 |
---

Call for Contributions

Thanks for showing interest in contributing to pyfixest! We appreciate all contributions and constructive feedback, whether that be reporting bugs, requesting new features, or suggesting improvements to documentation.

If you'd like to get involved, but are not yet sure how, please feel free to send us an email. Some familiarity with either Python or econometrics will help, but you really don't need to be a numpy core developer or have published in Econometrica =) We'd be more than happy to invest time to help you get started!

Contributors โœจ

Thanks goes to these wonderful people:

styfenschaer
styfenschaer

๐Ÿ’ป
Niall Keleher
Niall Keleher

๐Ÿš‡ ๐Ÿ’ป
Wenzhi Ding
Wenzhi Ding

๐Ÿ’ป
Apoorva Lal
Apoorva Lal

๐Ÿ’ป ๐Ÿ›
Juan Orduz
Juan Orduz

๐Ÿš‡ ๐Ÿ’ป
Alexander Fischer
Alexander Fischer

๐Ÿ’ป ๐Ÿš‡
aeturrell
aeturrell

โœ… ๐Ÿ“– ๐Ÿ“ฃ
leostimpfle
leostimpfle

๐Ÿ’ป ๐Ÿ›
baggiponte
baggiponte

๐Ÿ“–
Sanskriti
Sanskriti

๐Ÿš‡
Jaehyung
Jaehyung

๐Ÿ’ป
Alex
Alex

๐Ÿ“–
Hayden Freedman
Hayden Freedman

๐Ÿ’ป ๐Ÿ“–
Aziz Mamatov
Aziz Mamatov

๐Ÿ’ป
rafimikail
rafimikail

๐Ÿ’ป
Benjamin Knight
Benjamin Knight

๐Ÿ’ป
Dirk Sliwka
Dirk Sliwka

๐Ÿ’ป ๐Ÿ“–

This project follows the all-contributors specification. Contributions of any kind welcome!

pyfixest's People

Contributors

allcontributors[bot] avatar apoorvalal avatar asteves avatar b-knight avatar baggiponte avatar dependabot[bot] avatar dsliwka avatar greenguy33 avatar jayhyung avatar juanitorduz avatar leostimpfle avatar nkeleher avatar pre-commit-ci[bot] avatar rafimikail avatar s3alfisc avatar saidamir avatar sanskriti2005 avatar styfenschaer avatar wenzhi-ding avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

pyfixest's Issues

vcov() method does not update summary info

Example:

import pyfixest as pf
import numpy as np
from pyfixest.utils import get_data

fixest = pf.Fixest(data = data)
fixest.feols("Y~X1 | csw0(X2, X3)", vcov = {'CRV1':'id'})
fixest.summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.220649 30.130262   0.00000
#        X1 -0.141200    0.211081 -0.668937   0.50369
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274    0.210556 -0.675707  0.499383
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.096317    0.204801 -0.470296  0.638247
fixest.vcov({'CRV3':'group_id'}).summary()
>>> fixest.vcov({'CRV3':'group_id'}).summary()
# ###
# 
# ---
# ###
# 
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#            Estimate  Std. Error   t value  Pr(>|t|)
# Intercept  6.648203    0.229614 28.953831  0.000000
#        X1 -0.141200    0.207516 -0.680428  0.502745
# ---
# ###
# 
# Fixed effects:  X2
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error   t value  Pr(>|t|)
# X1 -0.142274     0.20774 -0.684867   0.49999
# ---
# ###
# 
# Fixed effects:  X2+X3
# Dep. var.:  Y 
# Inference:  {'CRV1': 'id'}
# Observations:  998
# 
#     Estimate  Std. Error  t value  Pr(>|t|)
# X1 -0.096317    0.206282 -0.46692  0.644768
# 

IV support

Required Steps:

  • allow three-part formulas
  • implement IV estimation & inference. Start with just-identified models.
  • update summary() and tidy() methods. For tidy(), simply add index "stage1" and "stage2". Split summary() into two parts. Check out what fixest does.

Nice to have's (for now):

  • common IV diagnostics
  • AR confidence intervals

In Practice:

  • implement IV estimator (Z'X)^(-1) Z'Y.
  • always create X and Z. For OLS, set X = Z
  • pass everything through, done (at least for models with only one endog variable)
  • after this is implemented, allow for over identified models. This requires implementation of the 2SLS estimator + updates to inference procedures

Enable cluster jackknife inference

Fairly straightforward:

  • simply add clustering variant CRV3-jackknife, e.g. as function argument {'CRV3-jackknife':'group_id'}
  • uncomment code here
  • add tests, done!

Add requirements.txt

Currently there is no indication of what dependencies are required for the package. At the moment I'm repeatedly trying to load Fixest and installing the missing packages it notices one by one.

CRV3 inference

  • currently blocked by default -> unblock
  • update with code from statsmodels PR

Bug for interactions

KeyError: "['X1:X2'] not in index". Happens because variable "X1:X2" is not contained in the input data. Likely requires even more reshuffling of the formula parsing - demeaning - model matrix pipeline.

Unit Tests

  • test feols front end against fixest::feols
  • test parsing of formula syntax
  • add continuous integration
  • add even more tests

Handling of Categorical Variables

  • Build around formulaic's C(..., levels = ...) option.
  • Need to pass levels information to formula.
  • Rename levels to "ref" so that users can only provide a reference str instead of a full list.
  • Is there a need for a dedicated i() option to interact with a categorical variable? The key advantage of the i() method seems custom tooling around it, e.g. iplot. But that could be handled with plotting method with decent string regex for coef names (i.e. for variables included in i()). Maybe not 100% error prone?

Mimic fixest vcov defaults

  • drop requirement to set vcov argument in .feols()
  • for no specified vcov without fixed effects: iid inference
  • for no specified vcov with fixed effects: CRV1 clustered at the first fixed effects level

See here.

Bug in i() method

For some reason, signs of integers are swapped, i.e. for event studies with time-to-treatment (ttt) -27, ...., 0, ..., 21, the estimated coefs are labeled as -21, ..., 0, ..., 27.

Custom fixed effects demeaning function

  • Write a custom demean() function that allows for NA values in either X or Y
  • further, allow for weights (for WLS)
  • implement dropping of singletons
  • note that there are a few errors in the current version of demean() (i.e. the handling of missings + multiple fixed effects)

Support for datetime variables

Was inspired to try using a datetime in a regression, given that statsmodels handles these incorrectly. Currently these do not appear to be supported in pyfixest

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

res['DT'] = [pd.to_datetime(str(y)+'-01-01 00:00:00') for y in res['Year']]

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ DT', vcov = 'iid')
fixest.summary()
# UFuncTypeError: ufunc 'multiply' cannot use operands with types dtype('int32') and dtype('<M8[ns]')

Note that this also occurs without the 00:00:00, or if using a datetime.date() instead of a Pandas datetime. Compare to R:

library(fixest)
library(causaldata)

data(restaurant_inspections)

restaurant_inspections$Time = lubridate::ymd_hms(paste0(restaurant_inspections$Year, '-01-01 00:00:00'))

feols(Year~Time, data = restaurant_inspections)
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: IID 
# Estimate   Std. Error  t value  Pr(>|t|)    
# (Intercept) 1.970001e+03 3.323279e-05 59278828 < 2.2e-16 ***
#   Time        3.170000e-08 2.580000e-14  1226887 < 2.2e-16 ***
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 7.994e-4   Adj. R2: 1

That's probably enough tinkering for me today! Sorry to load up the Issue page.

Improve performance

  • See here for faster solving of the least squares problem.
  • faster demeaning algo via numba / c++
  • clean up code, drop redundancies, etc
  • for a given column in the design matrix, only project out fe's when not already done (for multiple estimation)

Implement CRV3 for arbitrary fixed effects

Should be fairly straightforward: for no clustering, run MNW's summclust algo, else do what sandwich does. E.g. do along these lines:

                if self.has_fixef == False:
                    # inverse hessian precomputed?
                    tXX = np.transpose(self.X) @ self.X
                    tXy = np.transpose(self.X) @ self.Y

                    # compute leave-one-out regression coefficients (aka clusterjacks')
                    for ixg, g in enumerate(clusters):

                        Xg = self.X[np.equal(ixg, group)]
                        Yg = self.Y[np.equal(ixg, group)]
                        tXgXg = np.transpose(Xg) @ Xg
                        # jackknife regression coefficient
                        beta_jack[ixg,:] = (
                            np.linalg.pinv(tXX - tXgXg) @ (tXy - np.transpose(Xg) @ Yg)
                        ).flatten()

                else:

                    for ixg, g in enumerate(clusters):
                        data = self.data[np.equal(ixg, group)]
                        model = Fixest(data)
                        model.feols(self.formula, vcov = "iid")
                        beta_jack[ixg,:] = model.beta_hat

Issues with specifying vcov

  1. Currently, vcov is a required positional argument in feols. Note this also means you can't specify the vcov using the .vcov method without also specifying it in the feols call as a positional argument. Suggest matching the defaults from the R package of vcov = 'iid' with no fixed effects specified, or clustered at the level of the first fixed effect if fixed effects are specified., or overwriting that with whatever was set in .vcov.
  2. help(fixest.feols) suggests using dict("CRV1":"clustervar") to specify a cluster variable but this is improper syntax.
  3. In my example data I'm having trouble setting any sort of clustering variable alongside a set of fixed effects (pip install causaldata; note there are no missing values in this data):
from causaldata import restaurant_inspections
import pandas as pd
import pyfixest as pf
import numpy as np

res = restaurant_inspections.load_pandas().data
fixest = pf.Fixest(data = res)

fixest.feols('inspection_score ~ Weekend | business_name', vcov = 'iid')
# runs fine

fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

# maybe it prefers numbers?
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'Year'}))
# IndexError: index 27134 is out of bounds for axis 0 with size 27076

# Or integers?
res['random_category'] = np.random.randint(0, 10, res.shape[0])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'random_category'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# or categoricals?
res['categorical'] = pd.Categorical(res['random_category'])
fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'categorical'}))
# IndexError: index 27087 is out of bounds for axis 0 with size 27076

# Works fine without the FEs
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'categorical'}))

# although still not for business_name
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'business_name'}))
# TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''


# Maybe it can't handle single-row FEs alongside clusters?
res['counts'] = (res
		   .groupby('business_name')['business_name']
		   .transform('size'))
fixest = pf.Fixest(data = res.query('counts > 1'))
fixest.feols('inspection_score ~ Weekend | business_name', vcov = dict({'CRV1':'business_name'}))
# ValueError: matrix should have the same number of rows as fixed effect IDs.

p-value and some t-stat calculations differ from R fixest

Some more toying around, and I'm getting different p-value calculations from the original fixest, even in cases where estimates and t-stats match:

With clustering, the t-stats match but p-values are different:

from causaldata import restaurant_inspections
import pandas as pd
import pyfixest.pyfixest as pf

res = restaurant_inspections.load_pandas().data

fixest = pf.Fixest(data = res)
fixest.feols('inspection_score ~ Weekend', vcov = dict({'CRV1':'Year'}))
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: inspection_score 
# 
#            Estimate  Std. Error    t value  Pr(>|t|)
# Intercept 93.627262    0.321288 291.412047  0.000000
#   Weekend  2.096548    0.862544   2.430656  0.015072
library(fixest)
library(causaldata)

data(restaurant_inspections)

feols(inspection_score ~ Weekend , data = restaurant_inspections, vcov = ~Year)

# OLS estimation, Dep. Var.: inspection_score
# Observations: 27,178 
# Standard-errors: Clustered (Year) 
# Estimate Std. Error   t value  Pr(>|t|)    
# (Intercept) 93.62726   0.321288 291.41205 < 2.2e-16 ***
#   WeekendTRUE  2.09655   0.862544   2.43066  0.029106 *  
#   ---
#   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# RMSE: 6.25408   Adj. R2: 8.241e-4

With IID and HC1, the t-stats and p-values differ by very small amounts

fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'iid')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036224 55497.059216  0.000000
# Weekend   -0.862863    0.412097    -2.093833  0.036275
feols(Year~Weekend, data = restaurant_inspections)
OLS estimation, Dep. Var.: Year
Observations: 27,178 
Standard-errors: IID 
               Estimate Std. Error     t value  Pr(>|t|)    
(Intercept) 2010.343815   0.036226 55495.01719 < 2.2e-16 ***
WeekendTRUE   -0.862863   0.412112    -2.09376  0.036291 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 5.94874   Adj. R2: 1.245e-4
fixest = pf.Fixest(data = res)
fixest.feols('Year ~ Weekend', vcov = 'HC1')
fixest.summary()
# ### Fixed-effects: 0
# Dep. var.: Year 
# 
# Estimate  Std. Error      t value  Pr(>|t|)
# Intercept 2010.343815    0.036181 55562.889684  0.000000
# Weekend   -0.862863    0.471032    -1.831856  0.066973
feols(Year~Weekend, data = restaurant_inspections, vcov = 'hc1')
# OLS estimation, Dep. Var.: Year
# Observations: 27,178 
# Standard-errors: Heteroskedasticity-robust 
# Estimate Std. Error     t value  Pr(>|t|)    
# (Intercept) 2010.343815   0.036182 55561.86747 < 2.2e-16 ***
#   WeekendTRUE   -0.862863   0.471041    -1.83182  0.066989 .  

Tests

  • compare against fixest
  • internal benchmarks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.