declaredesign / estimatr Goto Github PK

estimatr: Fast Estimators for Design-Based Inference

Home Page: https://declaredesign.org/r/estimatr

License: Other

R 91.14% C++ 6.46% Stata 1.21% TeX 1.19%

estimatr's Introduction

DeclareDesign: Declare and Diagnose Research Designs

DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. For more, see declaredesign.org.

Installation

To install the latest stable release of DeclareDesign, please ensure that you are running version 3.5 or later of R and run the following code:

install.packages("DeclareDesign")

Usage

Designs are declared by adding together design elements. Here’s a minimal example that describes a 100 unit randomized controlled trial with a binary outcome. Half the units are assigned to treatment and the remainder to control. The true value of the average treatment effect is 0.05 and it will be estimated with the difference-in-means estimator. The diagnosis shows that the study is unbiased but underpowered.

library(DeclareDesign)

design <-
  declare_model(
    N = 100, 
    potential_outcomes(Y ~ rbinom(N, size = 1, prob = 0.5 + 0.05 * Z))
  ) +
  declare_inquiry(ATE = 0.05) +
  declare_assignment(Z = complete_ra(N, m = 50)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) + 
  declare_estimator(Y ~ Z, .method = lm_robust, inquiry = "ATE")

diagnosands <-
  declare_diagnosands(bias = mean(estimate - estimand),
                      power = mean(p.value <= 0.05))

diagnosis <- diagnose_design(design, diagnosands = diagnosands)

diagnosis

Inquiry	Estimator	Outcome	Bias	SE(Bias)	Power	SE(Power)	n sims
ATE	estimator	Y	-0.004	0.004	0.076	0.01	500

Companion software

The core DeclareDesign package relies on four companion packages, each of which is useful in its own right.

randomizr: Easy to use tools for common forms of random assignment and sampling.
fabricatr: Imagine your data before you collect it.
estimatr: Fast estimators for social scientists.
DesignLibrary: Templates to quickly adopt and adapt common research designs.

Learning DeclareDesign

To get started, have a look at this vignette on the idea behind DeclareDesign, which covers the main functionality of the software.
For an explanation of the philosophy behind DeclareDesign, examples in code and words of declaring and diagnosing common research designs in the social sciences, as well as examples of how to incorporate DeclareDesign into your own research, see the book Research Design in the Social Sciences (Blair, Coppock, Humphreys, 2023).

Package structure

Each of these declare_*() functions returns a function.

declare_model() (describes dimensions and distributions over the variables, including potential outcomes)
declare_inquiry() (takes variables in the model and calculates estimand value)
declare_sampling() (takes a population and selects a sample)
declare_assignment() (takes a population or sample and adds treatment assignments)
declare_measurement() (takes data and adds measured values)
declare_estimator() (takes data produced by sampling, assignment, and measurement and returns estimates linked to inquiries)
declare_test() (takes data produced by sampling, assignment, and measurement and returns the result of a test)

To declare a design, connect the components of your design with the + operator.

Once you have declared your design, there are four core post-design-declaration commands used to modify or diagnose your design:

diagnose_design() (takes a design and returns simulations and diagnosis)
draw_data() (takes a design and returns a single draw of the data)
draw_estimates() (takes a design and returns a single simulation of estimates)
draw_estimands() (takes a design and returns a single simulation of estimands)

A few other features:

A designer is a function that takes parameters (e.g., N) and returns a design. expand_design() is a function of a designer and parameters that return a design.
You can change the diagnosands with declare_diagnosands().

This project was generously supported by a grant from the Laura and John Arnold Foundation and seed funding from EGAP.

estimatr's People

Contributors

Stargazers

Watchers

Forkers

medewitt benlistyg bowers-illinois-edu lucyxiaoluwang kuriwaki nfultz rvlenth apoorvalal jaeyk yangzhao98 anhnguyendepocen rossellhayes gedevan-aleksizde potterzot waynelockon galexandros royalts statwu etiennebacher

estimatr's Issues

Create mean wrapper for lm_robust_fit

To estimate means and their variances for various sampling designs.

confusing clean_model_data error

test <- structure(list(catchment_area = c("01", "02", "03", "04", "05", 
                                          "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", 
                                          "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", 
                                          "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", 
                                          "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", 
                                          "50"), infant_alive = c(0.709677419354839, 1, 0.821428571428571, 
                                                                  0.925925925925926, 0.923076923076923, 0.714285714285714, 0.84375, 
                                                                  0.791666666666667, 0.714285714285714, 0.958333333333333, 0.709677419354839, 
                                                                  0.774193548387097, 0.961538461538462, 0.85, 0.888888888888889, 
                                                                  0.869565217391304, 0.846153846153846, 0.875, 0.818181818181818, 
                                                                  0.88, 0.869565217391304, 0.733333333333333, 0.821428571428571, 
                                                                  0.970588235294118, 0.769230769230769, 0.8, 0.958333333333333, 
                                                                  0.695652173913043, 0.88, 0.80952380952381, 0.952380952380952, 
                                                                  0.931034482758621, 0.8, 0.92, 0.807692307692308, 1, 1, 0.892857142857143, 
                                                                  0.848484848484849, 0.875, 0.896551724137931, 0.863636363636364, 
                                                                  0.8, 0.838709677419355, 0.84, 0.903225806451613, 0.736842105263158, 
                                                                  0.925925925925926, 0.761904761904762, 0.96), Z = c(0, 1, 1, 1, 
                                                                                                                     1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 
                                                                                                                     0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 
                                                                                                                     0, 1, 0, 1)), .Names = c("catchment_area", "infant_alive", "Z"
                                                                                                                     ), row.names = c(NA, -50L), class = "data.frame")

lm_robust(infant_alive ~ Z, data = test)

Error in clean_model_data(formula = infant_alive ~ Z, data = test, subset = , :
unused argument (block = NULL)

Default coefficient names

Shouldn't the default be to return all coefficients rather than just "Z"?

Explore getting lm_robust working with default S3 methods from margins

See PR #97

Rstudio yellow squigglies

When I'm using lm_robust, everything works great!

But rstudio is upset. It has a yellow triangle and a yellow squiggly underline.

When I hover over it says

argument 'weights' is missing, with no default
argument 'subset' is missing, with no default
argument 'clusters' is missing, with no default

maybe some NSE thing?

lm_lin

A new function which implements the estimator proposed in Lin (2013). It is basically a thin wrapper that pre-processes the formula to do the transformation suggested in the paper and then sends to lm_robust. The options should be:

lm_lin(formula = Y ~ Z, covariates = ~ x1 + x2 + x3)

where formula is like in difference_in_means, just including the outcome and treatment indicator.

Make output stargazable

While possibly keeping the output as a dataframe, try to make it so that it can be easily passed to stargazer.

horvitz_thompson

We'd like to implement horvitz_thompson as a second estimator, mimicking the functionality of difference_in_means.

Weighted estimators in difference_in_means

None of our weighted DIM estimators have the same standard errors as lm_robust(), except for the clustered and block-clustered cases which match because they farm all the work out to lm_robust (there may still be degrees of freedom discrepancies anyways).

Currently, I am able to replicate weights::wtd.t.test() and fits with the estimates I've been able to find online, but this doesn't match lm_robust(..., se_type = "HC2")

devtools::install_github("DeclareDesign/estimatr", ref = "weightdim")
n <- 8
dat <- data.frame(y = rnorm(n), z = c(0, 1), w = runif(8))
lm_robust(y ~ z, data = dat, weights = w)
difference_in_means(y ~ z, data = dat, weights = w)

with(dat, weights::wtd.t.test(y[z==1], y[z==0], w[z==1], w[z==0])

What underlies weights::wtd.t.test() are the same weighted variance and means we get from SDMtools that @acoppock linked in Slack.

We can:

a) figure this out
b) farm all weighted estimation to lm_robust, erroring if there are matched pairs and weights as lm_robust can't accomadate this case (Imai et al talk about weights, but for specific estimands)
c) basically error whenever someone tries to do weighted d-i-m for now and (i) do nothing or (ii) transparently tell them they should use lm_robust

Build vignette for estimatr

Should show different SEs, different choices about how many coefficients to return, perhaps show speed gains from not calculating CIs with BM SEs so that users who only want point estimates are aware of these gains.

warning in windows on package load

Macartan is getting this warning when he loads estimatr on windows even after a full, clean reinstall of all the packages via github:

Warning in serialize(data, node$con): 'package:randomizr' may not be

available when loading

This may be windows specific; I haven't seen it on mac.

Implement sharp null and constant effects estimators for Horvitz-Thompson

Along with simulations and discussions of their properties.

"outcome" missing when calling lm_robust_fit directly

happy new year!

MWE:

X <- matrix(c(1, 1, 1, 1, 2, 3), ncol = 2)
y <- c(3,  4, 3)
test <- lm_robust_fit(y = y, X = X, 
              cluster = NULL, 
              se_type = "none", 
              coefficient_name = "X2", 
              weights = NULL, ci = FALSE,
              alpha = 0.05, 
              return_vcov = FALSE, trychol = FALSE)

test$est
test$outcome

this means that tidy(test) errors.

I didn't know where to fix because of lm_return

I think that the "outcome" should just be whatever variable is supplied to y

THANKS

benchmarking

Testing/benchmarking: for a 10-unit experiment, can you show in a test that various designs (as many as you can think of) are unbiased for the true SATE? You could start with one or two to get us started.

Since there are only 252 ways to assign 5 of 10 units to treatment (you can get them all from randomizr::obtain_permutation_matrix) this shouldn't be too expensive in terms of computation time. Make really heterogeneous and weird potential outcomes, then loop or apply through a function that reveals outcomes, runs the estimators and records the output.

implement cluster-robust SEs

It would be great if lm_robust could please implement the SEs described in this document: https://github.com/acoppock/Green-Lab-SOP.

Right now, the BMlmSE function described there is pretty slow, but I bet that we could Rcpp it up real nice!

Print and summary methods for lm_robust

I think they should probably be the same as
tidy(fit)
for right now.

am I right in thinking that confint() and coef() methods are in the works too?

weights not working in difference_in_means

I've added a new branch with a broken test that should work once this is fixed. Likely has to do with quoted name of weights.

Re commit 2b97eb6

Error in appveyor when checking PRs

There's an error that only exists in a PR, and not the branch itself or the resulting merged branch (which I suppose the pr check is supposed to be doing). Seems to be a problem with the drat repo?

See it around like 400 here: https://ci.appveyor.com/project/DeclareDesign/estimatr/build/1.0.201/job/cw717ysnrufj9pn2

I'm not sure of the nitty gritty here but I'll get to this around when prepping for CRAN submission if no one takes it on before then, or feel free to close this if this is part of a larger issue that people are aware of.

dataset too large

I'm using lm_robust on a dataset with 391,112 observations. My model includes 3 variables. I am receiving the following error:

Error in lm_robust_helper(y = outcome, X = design_matrix, cluster = cluster,  : 
  SpMat::init(): requested size is too large; suggest to compile in C++11 mode or enable ARMA_64BIT_WORD

I have no issues when I use a dataset of reduced size (it runs with 78,000 observations but does not with 100,000).

Allow input to be a model fit

Argument for clusters should take a vector of the same length as the data in the model matrix.

We should split the lm_robust function in to:

Two functions, both called lm_robust. A generic function that prepares the model matrix and such and a function that recognizes a model fit.
A function that takes the result of the above functions, does all fitting + inference as necessary and returns our usual output

Better argument error handling

Check that arguments being passed to lm_robust are correct, and return informative errors if not:

se_type that is not one of the allowed
coefficient name that isn't in the model
weights or cluster variable that aren't in the data frame
weights that are non-numeric
weights must be non-negative (?)

Mimic check_randomizr_arguments

error installing from GitHub

Hello,
I'm having trouble installing the latest version of estimatr.

This is what I ran to install the latest version (I have an older version installed):
devtools::install_github('DeclareDesign/estimatr')

This is the output:


Downloading GitHub repo DeclareDesign/estimatr@master
from URL https://api.github.com/repos/DeclareDesign/estimatr/zipball/master
Installing estimatr
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  '/private/var/folders/kp/29_s7k2n4pj3tl2fllng_ncc0000gp/T/Rtmp3ZLeQa/devtools387b48ca81e7/DeclareDesign-estimatr-c7dc222'  \
  --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' --install-tests 

* installing *source* package ‘estimatr’ ...
Warning in as.POSIXlt.POSIXct(x, tz) :
  unknown timezone 'zone/tz/2017c.1.0/zoneinfo/America/Los_Angeles'
** libs
/usr/local/clang4/bin/clang++ -std=gnu++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -DARMA_64BIT_WORD=1 -I"/Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.4/Resources/library/RcppArmadillo/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
In file included from RcppExports.cpp:4:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/RcppArmadillo/include/RcppArmadillo.h:31:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/RcppArmadillo/include/RcppArmadilloForward.h:26:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include/RcppCommon.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include/Rcpp/r/headers.h:48:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include/Rcpp/platform/compiler.h:100:
In file included from /usr/local/clang4/bin/../include/c++/v1/cmath:305:
/usr/local/clang4/bin/../include/c++/v1/math.h:301:15: fatal error: 'math.h' file not found
#include_next <math.h>
              ^~~~~~~~
1 error generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘estimatr’
* removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/estimatr’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/estimatr’
Installation failed: Command failed (1)

Master Progress Tracker for CRAN Complete Submission

This is a roadmap to CRAN bliss.

Estimators

Package wide

CRAN Submission Prep

Look up any checks needed besides the default R build/check process
- WinBuilder
Confirm licenses, copyright statements, and other submission metadata

Performance with BM clustered standard errors

BM standard errors can take a very long time to estimate or even hang R, as reported by akb13. Test with 300,000 observations with many 20 clusters and with many more clusters (3+ covariates)

broom integration

is apparently BROKE.

MWE:

library(estimatr)
library(broom)

X <- rnorm(100)
Y <- rnorm(100)

fit <- lm_robust(Y ~ X)
tidy(fit) # errors
estimatr::tidy(fit)

iv_robust

A new function that mimics the functionality of lm_robust i.e. using Rcpp but for IV regression. The idea is to have a faster, simpler version of the IV functions that are out there and to do the right things re: standard errors. Tricky bit will be the formula.

vignette on output using texreg, xtable, and apsrtable

(also, not doing stargazer for now)

bug in coefficient_name

The NSE seems to be working well with cluster names and weight names, but not with coefficient. name. Is this intentional?

This works:

> dat <- data.frame(x = runif(10), y = runif(10))
> 
> lm_robust(y ~ x, data = dat,  coefficient_name = "x")
  coefficient_name        est        se         p  ci_lower  ci_upper df
2                x -0.2791966 0.3253278 0.4157517 -1.029404 0.4710105  8

But this does not:

> lm_robust(y ~ x, data = dat,  coefficient_name = x)
Error in lm_robust_fit(y = model_data$outcome, X = model_data$design_matrix,  : 
  object 'x' not found

Thanks!

lm_lin for multi-valued treatments

lm_lin supports binary treatments; seems like wouldn't be too hard to do same formula editing for multiple-valued treatments. should we? @acoppock @lukesonnet

benchmarking clustered SEs

Not sure where we are on benchmarking our clustered standard errors. But it would be great to test against stata?

I also came across this http://rforpublichealth.blogspot.com/2014/10/easy-clustered-standard-errors-in-r.html

which summarizes the main ways that ppl currently do clustered SEs in R.

lm_robust breaks when cluster_variable_name is a character

... this is a problem, because fabricatr gives the ids as characters by default.

rm(list = ls())
library(DeclareDesign)
library(dplyr)

population <-
  declare_population(
    cluster = level(N = 5, 
                    individuals_per_cluster = 5,
                    cluster_effect = rnorm(N)),
    individual = level(N = individuals_per_cluster,
                       noise = rnorm(N)))

potential_outcomes <- declare_potential_outcomes(
  formula = Y ~ Z + cluster_effect + noise)

assignment <- declare_assignment(clust_var = cluster, 
                                 m = 2)

estimator_factor <- declare_estimator(formula = Y ~ Z,
                               model = lm_robust,
                               cluster_variable_name = as.factor(cluster))

# Fine:
population() %>% potential_outcomes() %>% assignment() %>% reveal_outcomes() %>%  estimator_factor()

estimator_integer <- declare_estimator(formula = Y ~ Z,
                               model = lm_robust,
                               cluster_variable_name = as.integer(cluster))

# Fine:
population() %>% potential_outcomes() %>% assignment() %>% reveal_outcomes() %>%  estimator_integer()

estimator_character <- declare_estimator(formula = Y ~ Z,
                               model = lm_robust,
                               cluster_variable_name = cluster)

# Regression machine broke:
population() %>% potential_outcomes() %>% assignment() %>% reveal_outcomes() %>%  estimator_character()

Undocumented default for draw_rs

What is this doing? Let's add an explanation to the Rd.

draw_rs(N=10, clust_var=gl(5,2))
 [1] 0 0 1 1 1 1 1 1 0 0

duplicate coefficient_name option in DD & estimatr

We have the option coefficient_name in declare_estimator and lm_robust (and lm_lin). In both cases, it just subsets the output df. We definitely need it in declare_, we don't strictly need it in lm_robust. They don't (I think) conflict, because declare_estimator will just not pass on the coefficient_name option.

Dealing with rank-deficient X matrices (and speed improvements to lm_robust)

Currently, lm_robust() fails when people pass a rank-deficient matrix. model.matrix() takes care of a lot of these problems, but not all of them, and often with messy data, lots of dummy variables, or some other edge cases, users will still get design matrices that are rank-deficient.

lm() currently handles this using a rank-revealing QR decomposition, and allows it to go ahead with a note. See how it does on the following:

n <- 10
X <- cbind(1, 2, rnorm(n))
y <- rnorm(n)
summary(lm(y~X+0))

Call:
lm(formula = y ~ X + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2680 -0.6358 -0.1516  0.2116  2.6752 

Coefficients: (1 not defined because of singularities)
    Estimate Std. Error t value Pr(>|t|)
X1 -0.163576   0.383550  -0.426    0.681
X2        NA         NA      NA       NA
X3 -0.003863   0.521907  -0.007    0.994

Residual standard error: 1.212 on 8 degrees of freedom
Multiple R-squared:  0.02224,	Adjusted R-squared:  -0.2222 
F-statistic: 0.09098 on 2 and 8 DF,  p-value: 0.914

It warns that 1 was not defined because of singularities, and puts NA for all of those values, and ignores them in all other calculations (i.e. vcov(lmo) will just return a 2x2 in this case and ignore the dropped variable).

Question: I think we should essentially mimic this functionality. lm() has a singular.ok = TRUE default, that I think we should copy with the singular_ok or strict. However, a question is whether we should default to allowing singularities or erroring instead. Defaulting to erroring might be annoying for some, but it's a little less sneaky that your data aren't well behaved. I vote to default to pass, but I'd like to warn in the call, not just the summary like lm does. Thoughts?

Currently, we have no support for rank-deficient X matrices. I've written code that does the following:

Try LLT cholesky decomposition (fastest)
Does cholesky decomposition thinks that the matrix is full rank?
- If yes, proceed with solution using cholesky decomposition
- If no, proceed with solution using rank-revealing QR decomposition

Now, if X is full-rank, we are as fast as we can be. If X is rank deficient, it would be best to just go straight to the RRQR decomposition.

Therefore, a question: Do we allow people to let us know that something is rank-deficient, and so they can pass a flag telling us this and we go straight to QR?

Re: whether LLT can tell whether something is full rank, it seems like NO, from here. However, we don't actually need to know whether is full rank, but whether LLt is worried it might not be, and it passes a flag that covers just that. It seems like the Eigen people think it will warn in that case, and it works in some toy examples.

Nonetheless, is it too risky to try the cholesky? Should we go straight for QR? I can't really evaluate these claims.

Here are some speed comparisons to help you out (N = 1000, p = 20),

Rank-deficient matrix (two vars):
Eigen LLT -> QR: 381ms
Eigen QR: 276ms
Old: NA (can't do it)

Full rank matrix:
Eigen LLT -> QR (basically just LLT): 188ms
Eigen QR: 333ms
Old: 414ms

Clarify difference with Stata's HC2 and HC3 weighted estimator

variable_names -> coefficient_name

In all the return objects from d-i-m and lm_robust etc.,

the return frame should have a column named "coefficient_name" rather than the current "variable_names"
right before returning, it should clobber the row names of the return frame via rownames(obj) <- NULL

Transition to outputting "model fits" and building S3 methods

There should be a suite of classes named after the estimators that created them:

horvitz_thompson
difference_in_means
lm_robust

Each of these should have their own s3 methods for:

summary
print
coef
vcov
confint
tidy
predict (only for lm_robust? Actually I think we need a separate lm_lin class so we know to construct the centered and interacted data.)

Also Alex asked about anova, plot, and residuals. I think no on these, except maybe residuals on lm_robust?

Thoughts from alex that I didn't incorporate into the above:

We discussed this week that lm_robust should return a “model fit”. Normal model fits appear to be extremely heavy objects. lm fits store the data!

The value probably has to be a list (though we could have this be a dataframe with a vcov attribute!)
I don’t think we want to ape lm like syntax with summary(robust_fit)$coefficients.
I think that all computations on the data should be done when the model fit is created. Though I admit that this intuition is just coming from my aversion to carrying around the data, which seems computationally expensive.

Luke: how expensive this is remains a ram vs. cpu problem, I'm not convinced it will be that expensive but I'll try and avoid it.

I’d also love it if the outputs of difference_in_means, horvitz_thompson and lm_robust were all unified somehow. Maybe they all have to be lists.

Luke: I think they all have to be lists, but they can be similarly structured lists and tidy() will return that data frame we have always wanted <3

I think it would behoove us to take a peek around at other modelling functions in R (I guess those handled by stargazer is a good list?) and try to understand the variation in “model objects.” Maybe inspection of the broom package would also lead us in the right direction.

tests

The tests should verify for lm_robust, lm_lin, difference_in_means, and iv_robust that the following works:

(1) all combinations of weights, subsets, cluster variable, and block variable (if defined);
(2) faster than their base R equivalent with small data;
(3) faster than their base R equivalent with big data (something like 5000 observations?) -- this could have the flag not run on CRAN so that they are only run on our computers during testing;
(4) check against the answer for as many as possible;
(5) all outputs in all four functions have exact same column names.

Missingness in data causes many parts of lm_robust to fail

Here is a MWE:

df <- data.frame(x = rnorm(100),
                 y = rnorm(100),
                 cluster = sample(1:3, size = 100, replace = T),
                 weights = runif(100))

# all work
estimatr::lm_robust(y ~ x, data = df)
estimatr::lm_robust(y ~ x, data = df, cluster_variable_name = cluster)
estimatr::lm_robust(y ~ x, data = df, weights = weights)

# add missingness, failures
df$cluster[23] <- NA
estimatr::lm_robust(y ~ x, data = df, cluster_variable_name = cluster)
df$x[23] <- NA
estimatr::lm_robust(y ~ x, data = df)

The problem lies in us using the original data instead of model.response to get the outcome, and in doing the same with the cluster.

Ad DV name to tidy output

It'd be awesome if there were a column in the output called dv that reported the name of the dependent variable.

hashtag feature request.

Add residuals and weights S3 methods

These methods would be nice, but they involve storing data of length n in the output lists, which we have so far avoided.

implement clustered difference-in-means

Suggest the following syntax:

difference_in_means(Y ~ Z, cluster_variable_name = my_clusters)

i.e. bare, unquoted.

This would first collapse appropriately using by the mean of the outcome (I'm thinking no reason to allow user to specify a custom collapse function). It would error if the treatment variable is not actually clustered (i.e. multiple treatment values within cluster).

Thoughts? We deleted this functionality before because we thought you would put in a step to collapse the data by clusters into the design, but if you want to have two estimators one that collapses and one that e.g. uses our clustered lm estimator that might be a pain.

Implement weighting with cluster-robust SEs

Check the R code here: https://github.com/jepusto/clubSandwich for how to implement the same small-sample correction for clustered errors with WLS.

Changes to difference_in_means

The difference_in_means function should have two paths. The first path is for pair-matched designs, which don't work in our setup right now because there's no variance with the treated group (or control group) within a pair. The second path is the existing estimators.

For pair-matched designs that are not clustered, the solution in Section 3.6.1.1 (pg. 78-79) in Gerber-Green should be implemented. For clustered pair-matched designs, the estimators for SATE from Imai-Nall-King should be implemented.

For both the pair-matched and existing estimators, we need to add the ability to handle clusters using the bare (unquoted) name of the cluster variable.

For the existing estimators, we need to add the ability to calculate clustered standard errors using that cluster variable, from eq. 3.2.3 in Gerber and Green. To handle the block-and-clustered design this should be added to the internal difference in means function (and thus it will be automatically handled for the blocked design by doing the estimator within each block).

In addition, we need to add a check that all clusters are in the same block. There is fast code in randomizr to do this, can point you to it if you can't find it.

Could you verify (1) that as cluster size converges to 1 that the pair-matched clustered design standard errors are similar or identical to the Gerber-Green pair-matched standard errors?

my matrix is too large!

df <- data.frame(Y = rnorm(100), Z = rbinom(100, 1, .5), X = rnorm(100), W = runif(100))
lm_robust(Y ~ Z, data = df)

error: Mat::init(): requested size is too large
 Error in lm_robust_helper(y = y, X = design_matrix, cluster = cluster,  : 
  Mat::init(): requested size is too large

https://media.giphy.com/media/EYJjKIDi5FKEg/giphy.gif

(this is from the tests...)

Allow declarations from `declare_ra()` to inform estimators about study design

We should allow users to pass a declaration created by declare_ra() that will learn whether the design is a simple random sample, a complete random sample, blocked, clustered, or blocked and clustered. Furthermore, if there is variance in the probability of treatment across the sample, then our estimators should default to using IPW in analysis.

A few choices:

We might consider adding a declaration_options argument that can take a list. It should take the following arguments to begin with:
- ipw = TRUE governing whether or not to take the probability of treatment and set weights to be the inverse propensity weights for difference in means or lm_robust
- treatment_condition = 2 governing which of the arms (column position in the declaration as well as condition name should work) to take as the treatment arm. The next argument would do the same for the control condition. This means that if someone declares a three-armed trial, they just have to tell us which arm to take as the treatment and which as the control. I think default behavior should be NULL so that it errors and says you have a multi-armed trial to please specify the two arms to compare. Another consideration, is whether we should allow this and the next argument to be numeric vectors. in that case if treatment_condition = c(2, 3) then we could treat both treatment conditions 2 and 3 as the same condition and group them together. Of course, it would be nice if there was a way to validate that the Z we are passed was the same as treatment_condition but there isn't a way that works across common use cases that I can see.
- control_condition = 1

The main hiccup I see with this is handling missing data and subsetting. I think if pre-subsetting nrow(data) is different from length(declaration$block_var) or length(declaration$clust_var) that we should error, and we should go ahead and apply subsetting to the declaration variables.

Another question: if someone passes declare_ra, it is clear what the implications are for horvitz_thompson and difference_in_means but not lm_robust. We cluster SEs at the cluster level, do we add dummy variables for the blocks if they aren't there? I think that's a little too paternalistic.

Error in Travis-CI when building vignettes

Because this vignette will have a lot of math, I think it's best to build it as a PDF unless there are objections. If there are, perhaps we can discuss them in #10.

However, because it is a PDF, it is built using LaTeX and there is an error in the Mac set-up on Travis-CI (seen here) where we basically only need to add texlive-extras to the install unless there are other suggestions.

However, we don't currently use a .travis.yml so I'm not sure where these settings exist. @graemeblair?

Error when adding lots of dummy variables

In some standard FE models, lm() doesn't fail (or even return NAs for coefficients) while lm_robust() will fail saying the (X'X) is singular.

I can't come up with a simple MWE, but I have an example in my own data to work off of.

Travis-CI build error downloading BasicTex

I'm getting a build error caused by a 403 error to download BasicTex.

Build log: https://travis-ci.org/DeclareDesign/estimatr/jobs/262133520

Any ideas would be helpful, but I'll just leave this here until I can find a solution.

Names of key variables

Currently, the estimators here have a bunch of arguments for clusters, blocks, weights, and more. Currently, many of them are cluster_variable_name or block_variable_name. While I understand the intention behind making it clear these are names and not the variables themselves, they are really unwieldy. Furthermore, we currently have the weights variable passed as weights. I propose everything be changed to clusters, blocks, and weights, or we change weights to weight_variable_name for consistency.

I obviously prefer the former, but please let me know your thoughts.