mmeierer / rendo Goto Github PK

REndo - A R package to control for endogeneity by using internal instrumental variable models

R 98.94% C++ 1.06%

rendo's Introduction

See our publication in JSS for more details on the methods and usage: doi:10.18637/jss.v107.i03

The REndo Package

Endogeneity arises when the independence assumption between an explanatory variable and the error in a statistical model is violated. Among its most common causes are omitted variable bias (e.g. like ability in the returns to education estimation), measurement error (e.g. survey response bias), or simultaneity (e.g. advertising and sales).

Instrumental variable estimation is a common treatment when endogeneity is of concern. However valid, strong external instruments are difficult to find. Consequently, statistical methods to correct for endogeneity without external instruments have been advanced. They are called internal instrumental variable models (IIV).

REndo implements the following instrument-free methods:

latent instrumental variables approach (Ebbes, Wedel, Boeckenholt, and Steerneman 2005)
higher moments estimation (Lewbel 1997)
heteroskedastic error approach (Lewbel 2012)
joint estimation using copula (Park and Gupta 2012)
multilevel GMM (Kim and Frees 2007)

The new version - REndo 2.0.0

The new version of REndo comes with a lot of improvements in terms of code optimization as well as different syntax for all functions.

Walk-Through

Below, we present the syntax for each of the 5 implemented instrument-free methods:

Latent Instrumental Variables

latentIV(y ~ P, data, start.params=c())

The first argument is the formula of the model to be estimated, y ~ P, where y is the response and P is the endogenous regressor. The second argument is the name of the dataset used and the last one, start.params=c(), which is optional, is a vector with the initial parameter values. When not indicated, the initial parameter values are taken to be the coefficients returned by the OLS estimator of y on P.

Copula Correction

copulaCorrection( y ~ X1 + X2 + P1 + P2 | continuous(P1) + discrete(P2), data, start.params=c(), num.boots)

The first argument is a two-part formula of the model to be estimated, with the second part of the RHS defining the endogenous regressor, here continuous(P1) + discrete(P2). The second argument is the name of the data, the third argument of the function, start.params, is optional and represents the initial parameter values supplied by the user (when missing, the OLS estimates are considered); while the fourth argument, num.boots, also optional, is the number of bootstraps to be performed (the default is 1000). Of course, defining the endogenous regressors depends on the number of endogenous regressors and their assumed distribution. Transformations of the explanatory variables, such as I(X), ln(X) are supported.

Higher Moments

higherMomentsIV(y ~ X1 + X2 + P | P | IIV(iiv = gp, g= x2, X1, X2) + IIV(iiv = yp) | Z1, data)

Here, y is the response; the first RHS of the formula, X1 + X2 + P, is the model to be estimated; the second part, P, specifies the endogenous regressors; the third part, IIV(), specifies the format of the internal instruments; the fourth part, Z1, is optional, allowing the user to add any external instruments available.

Regarding the third part of the formula, IIV(), it has a set of three arguments:

iiv - specifies the form of the instrument,
g - specifies the transformation to be done on the exogenous regressors,
the set of exogenous variables from which the internal instruments should be built (it can be one or all of the exogenous variables).

A set of six instruments can be constructed, which should be specified in the iiv argument of IIV():

g - for $(G_{t} - \bar{G})$ ,
gp - for $(G_{t} - \bar{G})(P_{t}-\bar{P})$ ,
gy - for $(G_{t} - \bar{G})(Y_{t}-\bar{Y})$ ,
yp - for $(Y_{t} - \bar{Y})(P_{t}-\bar{P})$ ,
p2 - for $(P_{t} - \bar{P})^2$ ,
y2 - for $(Y_{t} - \bar{Y})^2$ .

where $G=G(X_{t})$ can be either , , or $\frac{1}{x}$ and should be specified in the g argument of the third RHS of the formula, as x2, x3, lnx or 1/x. In case of internal instruments built only from the endogenous regressor, e.g. p2, or from the response and the endogenous regressor, like for example in yp, there is no need to specify g or the set of exogenous regressors in the IIV() part of the formula. The function returns a set of tests for checking the validity of the instruments and the endogeneity assumption.

Heteroskedastic Errors

 hetErrorsIV(y ~ X1 + X2 + X3 + P | P | IIV(X1,X2) | Z1, data)

Here, y is the response variable, X1 + X2 + X3 + P represents the model to be estimated; the second part, P, specifies the endogenous regressors, the third part, IIV(X1, X2), specifies the exogenous heteroskedastic variables from which the instruments are derived, while the final part Z1 is optional, allowing the user to include additional external instrumental variables. Like in the higher moments approach, allowing the inclusion of additional external variables is a convenient feature of the function, since it increases the efficiency of the estimates. Transformation of the explanatory variables, such as I(X), ln(X) are possible both in the model specification as well as in the IIV() specification.

Multilevel GMM

multilevelIV(y ~ X11 + X12 + X21 + X22 + X23 + X31 + X33 + X34 + (1|CID) + (1|SID) | endo(X12), data)

The call of the function has a two-part formula and an argument for data specification. In the formula, the first part is the model specification, with fixed and random parameter specification, and the second part which specifies the regressors assumed endogenous, here endo(X12). The function returns the parameter estimates obtained with fixed effects, random effects and the GMM estimator proposed by Kim and Frees (2007), such that a comparison across models can be done.

Installation Instructions

Install the stable version from CRAN:

install.packages("REndo")

Install the development version from GitHub:

devtools::install_github("mmeierer/REndo", ref = "development")

rendo's People

Contributors

Stargazers

Watchers

Forkers

pschil hannesdatta suraj-adewale allisterh

rendo's Issues

Questions regarding copula correction method

First off, I wanted to thank you for taking the time to write and maintain the REndo package on CRAN. This was my first introduction to instrument free methods and am interested/excited to learn more! Specifically, I took an interest in the implementation of your copula correction method and had a couple of questions that I was wondering if you could answer

Are there acceptable guidelines to the implementation of copula correction methods for panel data? As a work around, I had been mean differencing/first differencing the data on my own prior to running the copula correction, but I was curious if there were better methods? Should the generated regressor used to deal with endogeneity be generated in level (undifferenced) space or first or mean differenced space? Or does it matter?
2)I noticed the ML implementation vastly improves the efficiency of the estimates (e.g. the standard errors shrink) compared to the OLS implementation. Is this normal? Do you know of a way to improve the efficiency of the OLS estimate? Further, I took a peak at the Park’s paper that formulates copula correction methods, and I was curious if you knew of a log likelihood written out for multiple endogenous covariates?

Any help you would be able to provide would be greatly appreciated. Thanks once again!

Add vignettes to better illustrate how to use REndo

liv call

Call of the liv function of the type:
liv(yp)
should work as well (meaning without param and data). When param = NULL, default values are given, as programmed already in liv.R
At the moment the call liv(yP) gives the following error:

Error in model.frame.default(formula = y ~ P, drop.unused.levels = TRUE) :
invalid type (list) for variable 'y'.

Patrick, could you please have a look at this. Thank you

NOTES when compiling iwth Windows

After I fixed the Windows compiling error regarding the "formula" definition, I get now the following two notes:

checking R code for possible problems ... NOTE
liv: no visible global function definition for 'get_all_vars'
liv: no visible global function definition for 'model.frame'
liv: no visible global function definition for 'coefficients'
liv: no visible global function definition for 'lm'
liv: no visible global function definition for 'sd'
liv: no visible global function definition for 'new'
liv: no visible global function definition for 'slot<-'
print.liv: no visible global function definition for 'str'
summary.liv: no visible global function definition for 'printCoefmat'
Undefined global functions or variables:
coefficients get_all_vars lm model.frame new printCoefmat sd slot<-
str
Consider adding
importFrom("methods", "new", "slot<-")
importFrom("stats", "coefficients", "get_all_vars", "lm",
"model.frame", "printCoefmat", "sd")
importFrom("utils", "str")
to your NAMESPACE (and ensure that your DESCRIPTION Imports field
contains 'methods').

and

checking top-level files ... NOTE
File
LICENSE
is not mentioned in the DESCRIPTION file.

I attache you the file that made this errors...

REndo_1.0.tar.gz

Add Rcpp implementation of ecdf

Currently, bootstrapping for the copulaCorrection can be very slow. Profiling reveals that the base::ecdf function makes up about half the required computing time. To speed up computation, it could be implemented in Rcpp.

Various implementation of the ecdf exist, such as https://github.com/dmbates/ecdfExample.

However, it will require to setup the package accordingly. Currently, correct functionality is more of a priority and it is therefore left for later.

Implement copula approach by Haschka (2021)

Implement panel data extension of Park/Gupta (2012)'s copula method proposed by Haschka (2021)

Ask for replication materials for the paper
Define required data input format and command structure (e.g., possible adaptations of the formula interface)

See here:
https://journals.sagepub.com/doi/abs/10.1177/00222437211070820?casa_token=okQH-wYLXWsAAAAA:GMTY59OBQFtIpwe0SBIRIevzRD_sOKWgxhhe8hAMaBYdEAezHEl653mPMHx2YnGWxEtW_92vs2BqQw

hmlewbel of class ivreg

hmlewbel.R should return an object of class ivreg
Thanks

Problems with intercept estimation with the Copula correction method

Does the copula correction method work for models where an intercept is estimated with an intercept? After skimming the Park and Gupta paper proposing the copula correction method, I noticed the data generating process does not add an intercept into the data generating process. I tried imitating their data generating process without an intercept and with an intercept and got less than satisfying results (see the code below) which makes me think the data has to be mean centered and estimated with without an intercept.

Any feedback you all have would be appreciated!
Thanks!

## create an endogenous regressor
Sigma <- matrix(c(1.8,1.65,1.65,2),2,2)
x1 <-mvtnorm::rmvnorm(n = 100, c(2, 0), Sigma)
cor(x1)## check correlation
error <- x1[,2]
x1<- x1[,1]
## create an exogenous regressor
x2 <- rnorm(100, 2, 3)
## generate some data
y <-3*x1 + 4*x2 + error

#mle methodd
model <- REndo::copulaCorrection(y ~0+ x1+x2| continuous(x1),
                                 data=data.frame(y, x1, x2))
summary(model)
## ols method
p_star <- REndo:::copulaCorrectionContinuous_pstar(x1)
summary(lm(y ~ 0+x1+x2))
summary(lm(y ~ 0+x1+x2+p_star))

### allow the models to estimate an intercept
#mle methodd
model <- REndo::copulaCorrection(y ~ x1+x2| continuous(x1),
                                 data=data.frame(y, x1, x2))
summary(model)
## ols method
p_star <- REndo:::copulaCorrectionContinuous_pstar(x1)
summary(lm(y ~ x1+x2))
summary(lm(y ~ x1+x2+p_star))

### add intercept to the data
y <- y + 3

#mle methodd
model <- REndo::copulaCorrection(y ~ x1+x2| continuous(x1),
                                 data=data.frame(y, x1, x2))
summary(model)
## ols method
p_star <- REndo:::copulaCorrectionContinuous_pstar(x1)
summary(lm(y ~ x1+x2))
summary(lm(y ~ x1+x2+p_star))

add multilevelIV back

Add again support for the multilevelIV function which is currently defunct.

Implement the paper by Kim & Frees (2007) and generalize their example code.

Possibly use data.table for the grouping and sparse matrices from Matrix package to deal with the large cov matrices.

Non-matching matrix dimensions

Discussed in #62

^{Originally posted by pinson06 August 18, 2021}
Hi everyone,

I'm trying to use this very nice package but I obtain a error message for which I have no clue how to solve. Here's what i obtain in R when using the MultiLevelIV with my dataset and this code :
formula <- y ~ e + a + f + w + l + w2 + w_l + f_l + e_l + e_w + a_w + a_l + a_e + e_w_l + f_w_l + f_a_l + f_e_l + a_w_l + f_a_e_l + v + pr + t + (1 | album_id) | endo(e,a,f)
multilevel <- multilevelIV(formula = formula, data = data, verbose=TRUE)

Message obtained via R's console:
"Detected multilevel model with 2 levels.
For album_id (Level 2), 181 groups were found.
Error: Matrices must have same dimensions in diag2Tsmart(e1, e2, "d") - e2"

Any ideas?

Warning message shows when adding more than 1 endogenous variables in the copulaCorrection().

When I include more than 1 endogenous variables in the copulaCorrection(), it shows the following message and runs much faster: "Warning: Additional parameters given in the ... argument are ignored because they are not needed". What does this message mean?

generic function for S3 class "copulaEndo"

I have incorporated your comments regarding the S3 class "copulaEndo". But I do not know how to create a generics function for this class. Currently I get the following warning message when I run devtools::document():

Warning message:
S3 methods ‘coef.copulaEndo’, ‘print.copulaEndo’, ‘summary.copulaEndo’ were declared in NAMESPACE but not found

Thanks

variable names

in hmlewbel.R the names of the variables returned are not the ones given in "formula" - see example below (the formula was y1~ M + P1, where M has two columns X1 and X2):

AER::ivreg(formula = y~ X + P | X + IV, x = TRUE)

Coefficients:
(Intercept) XX1 XX2 P
3.4331 0.9484 3.1407 0.5571

could you please indicate how could I change that.

Thanks!

Compatibility with R 4.0.2

Recently upgraded from R 3.6.3 and the multilevelIV function has stopped working. The only message returned is "Error: The above errors were encountered!" Are you aware of any compatibility issues?

latentIV: support for multiple endogenous instruments

As part a research project, I've extended the latentIV log-likelihood function to support multiple endogenous regressors with multiple levels. It's quite fast.

Is this something you'd be potentially interested in integrating?

I'm a bit weary throwing out yet another package with incremental functionality, while you guys have already made tackling endogeneity more accessible.

Let me know about your interest, and then I can check how the function may fit in the existing structure of the package.

Cheers,
Hannes

Update REndo homepage

Set up REndo homepage based on GitHub Actions instead of Travis.

Rd files need updates for HTML5

The current development version of R has recently switched to use HTML5
for documentation pages (PR#18149). Now validation using HTML tidy
finds the following problems in your Rd files (problems only shown once
per Rd file):

REndo::copulaCorrection.Rd: Warning:

element removed from HTML5
REndo::hetErrorsIV.Rd: Warning: element removed from HTML5
REndo::higherMomentsIV.Rd: Warning: element removed from HTML5
REndo::higherMomentsIV.Rd: Warning:

attribute "align" not allowed for HTML5
REndo::latentIV.Rd: Warning: element removed from HTML5
REndo::multilevelIV.Rd: Warning: element removed from HTML5
REndo::vcov.rendo.boots.Rd: Warning: element removed from HTML5
REndo::vcov.rendo.boots.Rd: Warning: unescaped & or unknown entity "̅-"

Can you please fix as necessary?

The problems reported are one of

Warning:

element removed from HTML5
Warning: element removed from HTML5
Warning: element removed from HTML5

Warning:

attribute "align" not allowed for HTML5
Warning:

attribute "align" not allowed for HTML5

See https://html.spec.whatwg.org/#obsolete-but-conforming-features for
info on these: in principle, all can be fixed by using style attributes,
e.g.

style='text-align: right;'

instead of align='right' etc., which will work for both the new and old
ways of converting Rd to HTML.

Please fix before 2022-03-15 to safely retain your package on CRAN.

copulaEndo class

I want to create a "copulaEndo" class for the copulaEndo function. I already tried something, as you can see in the commented lines in the copulaEndo.R file.

The problem is that copulaEndo function calls 3 other functions, each of them returning a different type of result, as follows:

when function copulaCont1 is called, the returned slots should be:

coefEndovar
coefExoVar
value
convCode

when function copulaMethod2 is called, the returned slots should be:

coefEndovar
coefExoVar
coefPStar
seCoefficients

when copulaDiscrete is called, the object returned should be of "lm" class, since copulaDiscrete returns the result of a "lm" call.

Would something like that be possible to implement? Or what solution would you suggest?

Verify model assumptions

For any method, it should be checked whether the assumptions underlying the theoretical model are met for the given data.

If the assumptions are violated, the user should be warned about it.

Still NOTES when compiling on Windows

checking R code for possible problems ... NOTE

liv: no visible global function definition for ‘get_all_vars’
liv: no visible global function definition for ‘model.frame’
liv: no visible global function definition for ‘coefficients’
liv: no visible global function definition for ‘lm’
liv: no visible global function definition for ‘sd’
liv: no visible global function definition for ‘new’
liv: no visible global function definition for ‘slot<-’
print.liv: no visible global function definition for ‘str’
summary.liv: no visible global function definition for ‘printCoefmat’
Undefined global functions or variables:
coefficients get_all_vars lm model.frame new printCoefmat sd slot<-
str
Consider adding
importFrom("methods", "new", "slot<-")
importFrom("stats", "coefficients", "get_all_vars", "lm",
"model.frame", "printCoefmat", "sd")
importFrom("utils", "str")
to your NAMESPACE (and ensure that your DESCRIPTION Imports field
contains 'methods').

I will try to fix it, but if you have time, I would appreciate if you have a look too.

Best

Add support for interactions terms in Copula method

If an endogenous variable is included in an interaction with an exogenous variable, an additional instrument has to included for the interaction term. The procedure is analogous to regular IV estimation. This could be handled automatically by copulaCorrection(),

summary.copulaEndo

Hi Patrik,

It has been a while... Could you please have a look at the syntax of the copulaEndo class. Does it make sense that the class declaration is S4 format but then the summary and coef methods are in S3? Will that be acceptable by CRAN?

Also, at this moment the summary.copulaEndo method does not function well. Could you have a look ? Thank you very much.

Sample weights

I want to use sample weights for the observations. Will the package allows to incorporate the sampling weights.

Allow users access to the Pstar vectors in copulaCorrection so users can use panel data methods with them

(1) Please export the functions copulaCorrectionDiscrete_pstar and copulaCorrectionContinuous_pstar so users can call them to generate Pstar vectors.

(2) Please return the pstar vectors with the return object of the copulaCorrection() function (summary function does not need to show them by default)

(3) Also please return the generated moments (g, gp, gy, etc) for the higherMomentsIV() function with the return object or provide access to functions to that generate them.

This will allow users to use pstar vectors with further panel data analysis using the plm package, or conduct further tests that may require the pstar vector that was used in the copula regression.

Thank you!

failed R CMD check for Windows

I have just submitted to CRAN a first version of the package with only two functions: liv and hmlewbel.

However, when I am asked to check if the package can be installed on Windows as well, the check gives me the following error :

installing source package 'REndo' ...
** R
** data
** preparing package for lazy loading
Error in makePrototypeFromClassDef(properties, ClassDef, immediate, where) :
in making the prototype for class "liv" elements of the prototype failed to match the corresponding slot class: formula (class "formula" )
Error : unable to load R code in package 'REndo'
ERROR: lazy loading failed for package 'REndo'
removing 'd:/RCompile/CRANguest/R-devel/lib/REndo'

Do you have any idea what can be wrong ? I see it has smth to do with the liv function....but what can be changed? The package passes the CRAN check on MAC...

Enhanced OLS for copulaCorrection single continuous endogenous regressor

Currently, the case of a single continuous endogenous regressor in copulaCorrection can only be estimated using LL. Because bootstrapping with LL takes much longer it would be convenient to also provide the possibility to use enhanced OLS. As by Raluca, this would theoretically be correct as shown in the paper by Park and Gupta.

Suggested syntax: use continuousLL() and continuousOLS() to mark the regressors in favour of continuous() which should be depreciated.

This is not expected to be implemented in the near future.

problems summary function class ivreg

When I am applying hmlewbel and then trying to run a summary on it, it gives me the error below:

h1 <- hmlewbel(y ~ X1 + X2 + P, IIV = "yp")
summary(h1)
Error in UseMethod("coeftest") :
no applicable method for 'coeftest' applied to an object of class "ivreg"

h1 is of class "ivreg" so summary(h1) should be working. Can you point me to where is the error?