Giter VIP home page Giter VIP logo

estimatr's Introduction

DeclareDesign: Declare and Diagnose Research Designs

CRAN status CRAN RStudio mirror downloads Build status Code coverage Replications

DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. For more, see declaredesign.org.

Installation

To install the latest stable release of DeclareDesign, please ensure that you are running version 3.5 or later of R and run the following code:

install.packages("DeclareDesign")

Usage

Designs are declared by adding together design elements. Here’s a minimal example that describes a 100 unit randomized controlled trial with a binary outcome. Half the units are assigned to treatment and the remainder to control. The true value of the average treatment effect is 0.05 and it will be estimated with the difference-in-means estimator. The diagnosis shows that the study is unbiased but underpowered.

library(DeclareDesign)

design <-
  declare_model(
    N = 100, 
    potential_outcomes(Y ~ rbinom(N, size = 1, prob = 0.5 + 0.05 * Z))
  ) +
  declare_inquiry(ATE = 0.05) +
  declare_assignment(Z = complete_ra(N, m = 50)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) + 
  declare_estimator(Y ~ Z, .method = lm_robust, inquiry = "ATE")

diagnosands <-
  declare_diagnosands(bias = mean(estimate - estimand),
                      power = mean(p.value <= 0.05))

diagnosis <- diagnose_design(design, diagnosands = diagnosands)
diagnosis
Inquiry Estimator Outcome Bias SE(Bias) Power SE(Power) n sims
ATE estimator Y -0.004 0.004 0.076 0.01 500

Companion software

The core DeclareDesign package relies on four companion packages, each of which is useful in its own right.

  1. randomizr: Easy to use tools for common forms of random assignment and sampling.
  2. fabricatr: Imagine your data before you collect it.
  3. estimatr: Fast estimators for social scientists.
  4. DesignLibrary: Templates to quickly adopt and adapt common research designs.

Learning DeclareDesign

  1. To get started, have a look at this vignette on the idea behind DeclareDesign, which covers the main functionality of the software.

  2. For an explanation of the philosophy behind DeclareDesign, examples in code and words of declaring and diagnosing common research designs in the social sciences, as well as examples of how to incorporate DeclareDesign into your own research, see the book Research Design in the Social Sciences (Blair, Coppock, Humphreys, 2023).

Package structure

Each of these declare_*() functions returns a function.

  1. declare_model() (describes dimensions and distributions over the variables, including potential outcomes)
  2. declare_inquiry() (takes variables in the model and calculates estimand value)
  3. declare_sampling() (takes a population and selects a sample)
  4. declare_assignment() (takes a population or sample and adds treatment assignments)
  5. declare_measurement() (takes data and adds measured values)
  6. declare_estimator() (takes data produced by sampling, assignment, and measurement and returns estimates linked to inquiries)
  7. declare_test() (takes data produced by sampling, assignment, and measurement and returns the result of a test)

To declare a design, connect the components of your design with the + operator.

Once you have declared your design, there are four core post-design-declaration commands used to modify or diagnose your design:

  1. diagnose_design() (takes a design and returns simulations and diagnosis)
  2. draw_data() (takes a design and returns a single draw of the data)
  3. draw_estimates() (takes a design and returns a single simulation of estimates)
  4. draw_estimands() (takes a design and returns a single simulation of estimands)

A few other features:

  1. A designer is a function that takes parameters (e.g., N) and returns a design. expand_design() is a function of a designer and parameters that return a design.
  2. You can change the diagnosands with declare_diagnosands().

This project was generously supported by a grant from the Laura and John Arnold Foundation and seed funding from EGAP.

estimatr's People

Contributors

aaronrudkin avatar acoppock avatar elbersb avatar gedevan-aleksizde avatar graemeblair avatar jaspercooper avatar katrinleinweber avatar kuriwaki avatar lilymedina avatar lukesonnet avatar nfultz avatar nick-rivera avatar royalts avatar rvlenth avatar vincentarelbundock avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

estimatr's Issues

confusing clean_model_data error

test <- structure(list(catchment_area = c("01", "02", "03", "04", "05", 
                                          "06", "07", "08", "09", "10", "11", "12", "13", "14", "15", "16", 
                                          "17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", 
                                          "28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", 
                                          "39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", 
                                          "50"), infant_alive = c(0.709677419354839, 1, 0.821428571428571, 
                                                                  0.925925925925926, 0.923076923076923, 0.714285714285714, 0.84375, 
                                                                  0.791666666666667, 0.714285714285714, 0.958333333333333, 0.709677419354839, 
                                                                  0.774193548387097, 0.961538461538462, 0.85, 0.888888888888889, 
                                                                  0.869565217391304, 0.846153846153846, 0.875, 0.818181818181818, 
                                                                  0.88, 0.869565217391304, 0.733333333333333, 0.821428571428571, 
                                                                  0.970588235294118, 0.769230769230769, 0.8, 0.958333333333333, 
                                                                  0.695652173913043, 0.88, 0.80952380952381, 0.952380952380952, 
                                                                  0.931034482758621, 0.8, 0.92, 0.807692307692308, 1, 1, 0.892857142857143, 
                                                                  0.848484848484849, 0.875, 0.896551724137931, 0.863636363636364, 
                                                                  0.8, 0.838709677419355, 0.84, 0.903225806451613, 0.736842105263158, 
                                                                  0.925925925925926, 0.761904761904762, 0.96), Z = c(0, 1, 1, 1, 
                                                                                                                     1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 
                                                                                                                     0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 
                                                                                                                     0, 1, 0, 1)), .Names = c("catchment_area", "infant_alive", "Z"
                                                                                                                     ), row.names = c(NA, -50L), class = "data.frame")

lm_robust(infant_alive ~ Z, data = test) 

Error in clean_model_data(formula = infant_alive ~ Z, data = test, subset = , :
unused argument (block = NULL)

Rstudio yellow squigglies

When I'm using lm_robust, everything works great!

But rstudio is upset. It has a yellow triangle and a yellow squiggly underline.

When I hover over it says

argument 'weights' is missing, with no default
argument 'subset' is missing, with no default
argument 'clusters' is missing, with no default

maybe some NSE thing?

lm_lin

A new function which implements the estimator proposed in Lin (2013). It is basically a thin wrapper that pre-processes the formula to do the transformation suggested in the paper and then sends to lm_robust. The options should be:

lm_lin(formula = Y ~ Z, covariates = ~ x1 + x2 + x3)

where formula is like in difference_in_means, just including the outcome and treatment indicator.

Make output stargazable

While possibly keeping the output as a dataframe, try to make it so that it can be easily passed to stargazer.

horvitz_thompson

We'd like to implement horvitz_thompson as a second estimator, mimicking the functionality of difference_in_means.

Weighted estimators in difference_in_means

None of our weighted DIM estimators have the same standard errors as lm_robust(), except for the clustered and block-clustered cases which match because they farm all the work out to lm_robust (there may still be degrees of freedom discrepancies anyways).

Currently, I am able to replicate weights::wtd.t.test() and fits with the estimates I've been able to find online, but this doesn't match lm_robust(..., se_type = "HC2")

devtools::install_github("DeclareDesign/estimatr", ref = "weightdim")
n <- 8
dat <- data.frame(y = rnorm(n), z = c(0, 1), w = runif(8))
lm_robust(y ~ z, data = dat, weights = w)
difference_in_means(y ~ z, data = dat, weights = w)

with(dat, weights::wtd.t.test(y[z==1], y[z==0], w[z==1], w[z==0])

What underlies weights::wtd.t.test() are the same weighted variance and means we get from SDMtools that @acoppock linked in Slack.

We can:

a) figure this out
b) farm all weighted estimation to lm_robust, erroring if there are matched pairs and weights as lm_robust can't accomadate this case (Imai et al talk about weights, but for specific estimands)
c) basically error whenever someone tries to do weighted d-i-m for now and (i) do nothing or (ii) transparently tell them they should use lm_robust

Build vignette for estimatr

Should show different SEs, different choices about how many coefficients to return, perhaps show speed gains from not calculating CIs with BM SEs so that users who only want point estimates are aware of these gains.

warning in windows on package load

Macartan is getting this warning when he loads estimatr on windows even after a full, clean reinstall of all the packages via github:

Warning in serialize(data, node$con): 'package:randomizr' may not be

available when loading

This may be windows specific; I haven't seen it on mac.

"outcome" missing when calling lm_robust_fit directly

happy new year!

MWE:

X <- matrix(c(1, 1, 1, 1, 2, 3), ncol = 2)
y <- c(3,  4, 3)
test <- lm_robust_fit(y = y, X = X, 
              cluster = NULL, 
              se_type = "none", 
              coefficient_name = "X2", 
              weights = NULL, ci = FALSE,
              alpha = 0.05, 
              return_vcov = FALSE, trychol = FALSE)

test$est
test$outcome

this means that tidy(test) errors.

I didn't know where to fix because of lm_return

I think that the "outcome" should just be whatever variable is supplied to y

THANKS

benchmarking

Testing/benchmarking: for a 10-unit experiment, can you show in a test that various designs (as many as you can think of) are unbiased for the true SATE? You could start with one or two to get us started.

Since there are only 252 ways to assign 5 of 10 units to treatment (you can get them all from randomizr::obtain_permutation_matrix) this shouldn't be too expensive in terms of computation time. Make really heterogeneous and weird potential outcomes, then loop or apply through a function that reveals outcomes, runs the estimators and records the output.

Error in appveyor when checking PRs

There's an error that only exists in a PR, and not the branch itself or the resulting merged branch (which I suppose the pr check is supposed to be doing). Seems to be a problem with the drat repo?

See it around like 400 here: https://ci.appveyor.com/project/DeclareDesign/estimatr/build/1.0.201/job/cw717ysnrufj9pn2

I'm not sure of the nitty gritty here but I'll get to this around when prepping for CRAN submission if no one takes it on before then, or feel free to close this if this is part of a larger issue that people are aware of.

dataset too large

I'm using lm_robust on a dataset with 391,112 observations. My model includes 3 variables. I am receiving the following error:

Error in lm_robust_helper(y = outcome, X = design_matrix, cluster = cluster,  : 
  SpMat::init(): requested size is too large; suggest to compile in C++11 mode or enable ARMA_64BIT_WORD

I have no issues when I use a dataset of reduced size (it runs with 78,000 observations but does not with 100,000).

Allow input to be a model fit

Argument for clusters should take a vector of the same length as the data in the model matrix.

We should split the lm_robust function in to:

  • Two functions, both called lm_robust. A generic function that prepares the model matrix and such and a function that recognizes a model fit.
  • A function that takes the result of the above functions, does all fitting + inference as necessary and returns our usual output

Better argument error handling

Check that arguments being passed to lm_robust are correct, and return informative errors if not:

  • se_type that is not one of the allowed
  • coefficient name that isn't in the model
  • weights or cluster variable that aren't in the data frame
  • weights that are non-numeric
  • weights must be non-negative (?)

Mimic check_randomizr_arguments

error installing from GitHub

Hello,
I'm having trouble installing the latest version of estimatr.

This is what I ran to install the latest version (I have an older version installed):
devtools::install_github('DeclareDesign/estimatr')

This is the output:


Downloading GitHub repo DeclareDesign/estimatr@master
from URL https://api.github.com/repos/DeclareDesign/estimatr/zipball/master
Installing estimatr
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  '/private/var/folders/kp/29_s7k2n4pj3tl2fllng_ncc0000gp/T/Rtmp3ZLeQa/devtools387b48ca81e7/DeclareDesign-estimatr-c7dc222'  \
  --library='/Library/Frameworks/R.framework/Versions/3.4/Resources/library' --install-tests 

* installing *source* package ‘estimatr’ ...
Warning in as.POSIXlt.POSIXct(x, tz) :
  unknown timezone 'zone/tz/2017c.1.0/zoneinfo/America/Los_Angeles'
** libs
/usr/local/clang4/bin/clang++ -std=gnu++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -DARMA_64BIT_WORD=1 -I"/Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.4/Resources/library/RcppArmadillo/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
In file included from RcppExports.cpp:4:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/RcppArmadillo/include/RcppArmadillo.h:31:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/RcppArmadillo/include/RcppArmadilloForward.h:26:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include/RcppCommon.h:29:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include/Rcpp/r/headers.h:48:
In file included from /Library/Frameworks/R.framework/Versions/3.4/Resources/library/Rcpp/include/Rcpp/platform/compiler.h:100:
In file included from /usr/local/clang4/bin/../include/c++/v1/cmath:305:
/usr/local/clang4/bin/../include/c++/v1/math.h:301:15: fatal error: 'math.h' file not found
#include_next <math.h>
              ^~~~~~~~
1 error generated.
make: *** [RcppExports.o] Error 1
ERROR: compilation failed for package ‘estimatr’
* removing ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/estimatr’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.4/Resources/library/estimatr’
Installation failed: Command failed (1)

Master Progress Tracker for CRAN Complete Submission

This is a roadmap to CRAN bliss.

Estimators

  • horvitz_thompson() issue #18
    • Implement Young's inequality and constant effects variance estimators
    • Implement support for blocked and clustered designs
    • ~~~Implement support for complete randomized designs where there are some joint probabilities == 0 within clusters~~~ Warn if variance is < 0, which is possible!
    • Implement function to take any arbitrary permutation matrix and build condition probability matrix
    • Change how data is processed from declare_ra() or data, depends partially on DeclareDesign/randomizr#26
      • Need to add support for learning condition_pr_mat from blocked designs
    • Macartan feedback
    • Resolving "constant" effects estimator
    • Documentation
    • Tests
  • difference_in_means() issue #12
    • Support for clustered designs
    • Not supporting Support for learning from declare_ra() issue #29
    • Confirm weighted variance is correct, see #15
    • Confirm dof corrections for clustered and blocked designs
    • Documentation
    • Tests
  • lm_robust()
    • Implement clubSandwich variance (support for cluster-level dummies + weights in cluster robust inference)
    • Not supporting Support for learning from declare_ra() issue #29
    • Check weights with the stata cluster robust variance estimation
    • Move over to RcppEigen and support rank-deficient X #53
    • Documentation
    • Tests
  • lm_lin()
    • Multi-valued treatments - issue #66
    • Documentation

Package wide

  • Vignette - issue #10
  • Benchmarking vignette
  • Style
  • Stargazable output - issue #7 post-CRAN
  • Margins support - issue #63
  • Predict method for lm_robust and lm_lin
  • Jasper feedback
  • Overall coverage
  • Overall error handling - issue #11
  • Check tidy collision with tidyverse/broom
  • Naming of cluster_variable_name arguments - issue #47

CRAN Submission Prep

  • Look up any checks needed besides the default R build/check process
    • WinBuilder
  • Confirm licenses, copyright statements, and other submission metadata

Performance with BM clustered standard errors

BM standard errors can take a very long time to estimate or even hang R, as reported by akb13. Test with 300,000 observations with many 20 clusters and with many more clusters (3+ covariates)

broom integration

is apparently BROKE.

MWE:

library(estimatr)
library(broom)

X <- rnorm(100)
Y <- rnorm(100)

fit <- lm_robust(Y ~ X)
tidy(fit) # errors
estimatr::tidy(fit)

iv_robust

A new function that mimics the functionality of lm_robust i.e. using Rcpp but for IV regression. The idea is to have a faster, simpler version of the IV functions that are out there and to do the right things re: standard errors. Tricky bit will be the formula.

bug in coefficient_name

The NSE seems to be working well with cluster names and weight names, but not with coefficient. name. Is this intentional?

This works:

> dat <- data.frame(x = runif(10), y = runif(10))
> 
> lm_robust(y ~ x, data = dat,  coefficient_name = "x")
  coefficient_name        est        se         p  ci_lower  ci_upper df
2                x -0.2791966 0.3253278 0.4157517 -1.029404 0.4710105  8

But this does not:

> lm_robust(y ~ x, data = dat,  coefficient_name = x)
Error in lm_robust_fit(y = model_data$outcome, X = model_data$design_matrix,  : 
  object 'x' not found

Thanks!

lm_robust breaks when cluster_variable_name is a character

... this is a problem, because fabricatr gives the ids as characters by default.

rm(list = ls())
library(DeclareDesign)
library(dplyr)

population <-
  declare_population(
    cluster = level(N = 5, 
                    individuals_per_cluster = 5,
                    cluster_effect = rnorm(N)),
    individual = level(N = individuals_per_cluster,
                       noise = rnorm(N)))

potential_outcomes <- declare_potential_outcomes(
  formula = Y ~ Z + cluster_effect + noise)

assignment <- declare_assignment(clust_var = cluster, 
                                 m = 2)

estimator_factor <- declare_estimator(formula = Y ~ Z,
                               model = lm_robust,
                               cluster_variable_name = as.factor(cluster))

# Fine:
population() %>% potential_outcomes() %>% assignment() %>% reveal_outcomes() %>%  estimator_factor()

estimator_integer <- declare_estimator(formula = Y ~ Z,
                               model = lm_robust,
                               cluster_variable_name = as.integer(cluster))

# Fine:
population() %>% potential_outcomes() %>% assignment() %>% reveal_outcomes() %>%  estimator_integer()

estimator_character <- declare_estimator(formula = Y ~ Z,
                               model = lm_robust,
                               cluster_variable_name = cluster)

# Regression machine broke:
population() %>% potential_outcomes() %>% assignment() %>% reveal_outcomes() %>%  estimator_character()

duplicate coefficient_name option in DD & estimatr

We have the option coefficient_name in declare_estimator and lm_robust (and lm_lin). In both cases, it just subsets the output df. We definitely need it in declare_, we don't strictly need it in lm_robust. They don't (I think) conflict, because declare_estimator will just not pass on the coefficient_name option.

Dealing with rank-deficient X matrices (and speed improvements to lm_robust)

Currently, lm_robust() fails when people pass a rank-deficient matrix. model.matrix() takes care of a lot of these problems, but not all of them, and often with messy data, lots of dummy variables, or some other edge cases, users will still get design matrices that are rank-deficient.

lm() currently handles this using a rank-revealing QR decomposition, and allows it to go ahead with a note. See how it does on the following:

n <- 10
X <- cbind(1, 2, rnorm(n))
y <- rnorm(n)
summary(lm(y~X+0))
Call:
lm(formula = y ~ X + 0)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.2680 -0.6358 -0.1516  0.2116  2.6752 

Coefficients: (1 not defined because of singularities)
    Estimate Std. Error t value Pr(>|t|)
X1 -0.163576   0.383550  -0.426    0.681
X2        NA         NA      NA       NA
X3 -0.003863   0.521907  -0.007    0.994

Residual standard error: 1.212 on 8 degrees of freedom
Multiple R-squared:  0.02224,	Adjusted R-squared:  -0.2222 
F-statistic: 0.09098 on 2 and 8 DF,  p-value: 0.914

It warns that 1 was not defined because of singularities, and puts NA for all of those values, and ignores them in all other calculations (i.e. vcov(lmo) will just return a 2x2 in this case and ignore the dropped variable).

Question: I think we should essentially mimic this functionality. lm() has a singular.ok = TRUE default, that I think we should copy with the singular_ok or strict. However, a question is whether we should default to allowing singularities or erroring instead. Defaulting to erroring might be annoying for some, but it's a little less sneaky that your data aren't well behaved. I vote to default to pass, but I'd like to warn in the call, not just the summary like lm does. Thoughts?


Currently, we have no support for rank-deficient X matrices. I've written code that does the following:

  1. Try LLT cholesky decomposition (fastest)
  2. Does cholesky decomposition thinks that the matrix is full rank?
    • If yes, proceed with solution using cholesky decomposition
    • If no, proceed with solution using rank-revealing QR decomposition

Now, if X is full-rank, we are as fast as we can be. If X is rank deficient, it would be best to just go straight to the RRQR decomposition.

Therefore, a question: Do we allow people to let us know that something is rank-deficient, and so they can pass a flag telling us this and we go straight to QR?

Re: whether LLT can tell whether something is full rank, it seems like NO, from here. However, we don't actually need to know whether is full rank, but whether LLt is worried it might not be, and it passes a flag that covers just that. It seems like the Eigen people think it will warn in that case, and it works in some toy examples.

Nonetheless, is it too risky to try the cholesky? Should we go straight for QR? I can't really evaluate these claims.


Here are some speed comparisons to help you out (N = 1000, p = 20),

Rank-deficient matrix (two vars):
Eigen LLT -> QR: 381ms
Eigen QR: 276ms
Old: NA (can't do it)

Full rank matrix:
Eigen LLT -> QR (basically just LLT): 188ms
Eigen QR: 333ms
Old: 414ms

variable_names -> coefficient_name

In all the return objects from d-i-m and lm_robust etc.,

  • the return frame should have a column named "coefficient_name" rather than the current "variable_names"
  • right before returning, it should clobber the row names of the return frame via rownames(obj) <- NULL

Transition to outputting "model fits" and building S3 methods

There should be a suite of classes named after the estimators that created them:

  • horvitz_thompson
  • difference_in_means
  • lm_robust

Each of these should have their own s3 methods for:

  • summary
  • print
  • coef
  • vcov
  • confint
  • tidy
  • predict (only for lm_robust? Actually I think we need a separate lm_lin class so we know to construct the centered and interacted data.)

Also Alex asked about anova, plot, and residuals. I think no on these, except maybe residuals on lm_robust?

Thoughts from alex that I didn't incorporate into the above:

We discussed this week that lm_robust should return a “model fit”. Normal model fits appear to be extremely heavy objects. lm fits store the data!

The value probably has to be a list (though we could have this be a dataframe with a vcov attribute!)
I don’t think we want to ape lm like syntax with summary(robust_fit)$coefficients.
I think that all computations on the data should be done when the model fit is created. Though I admit that this intuition is just coming from my aversion to carrying around the data, which seems computationally expensive.

Luke: how expensive this is remains a ram vs. cpu problem, I'm not convinced it will be that expensive but I'll try and avoid it.

I’d also love it if the outputs of difference_in_means, horvitz_thompson and lm_robust were all unified somehow. Maybe they all have to be lists.

Luke: I think they all have to be lists, but they can be similarly structured lists and tidy() will return that data frame we have always wanted <3

I think it would behoove us to take a peek around at other modelling functions in R (I guess those handled by stargazer is a good list?) and try to understand the variation in “model objects.” Maybe inspection of the broom package would also lead us in the right direction.

tests

The tests should verify for lm_robust, lm_lin, difference_in_means, and iv_robust that the following works:

(1) all combinations of weights, subsets, cluster variable, and block variable (if defined);
(2) faster than their base R equivalent with small data;
(3) faster than their base R equivalent with big data (something like 5000 observations?) -- this could have the flag not run on CRAN so that they are only run on our computers during testing;
(4) check against the answer for as many as possible;
(5) all outputs in all four functions have exact same column names.

Missingness in data causes many parts of lm_robust to fail

Here is a MWE:

df <- data.frame(x = rnorm(100),
                 y = rnorm(100),
                 cluster = sample(1:3, size = 100, replace = T),
                 weights = runif(100))

# all work
estimatr::lm_robust(y ~ x, data = df)
estimatr::lm_robust(y ~ x, data = df, cluster_variable_name = cluster)
estimatr::lm_robust(y ~ x, data = df, weights = weights)

# add missingness, failures
df$cluster[23] <- NA
estimatr::lm_robust(y ~ x, data = df, cluster_variable_name = cluster)
df$x[23] <- NA
estimatr::lm_robust(y ~ x, data = df)

The problem lies in us using the original data instead of model.response to get the outcome, and in doing the same with the cluster.

Ad DV name to tidy output

It'd be awesome if there were a column in the output called dv that reported the name of the dependent variable.

hashtag feature request.

implement clustered difference-in-means

Suggest the following syntax:

difference_in_means(Y ~ Z, cluster_variable_name = my_clusters)

i.e. bare, unquoted.

This would first collapse appropriately using by the mean of the outcome (I'm thinking no reason to allow user to specify a custom collapse function). It would error if the treatment variable is not actually clustered (i.e. multiple treatment values within cluster).

Thoughts? We deleted this functionality before because we thought you would put in a step to collapse the data by clusters into the design, but if you want to have two estimators one that collapses and one that e.g. uses our clustered lm estimator that might be a pain.

Changes to difference_in_means

The difference_in_means function should have two paths. The first path is for pair-matched designs, which don't work in our setup right now because there's no variance with the treated group (or control group) within a pair. The second path is the existing estimators.

For pair-matched designs that are not clustered, the solution in Section 3.6.1.1 (pg. 78-79) in Gerber-Green should be implemented. For clustered pair-matched designs, the estimators for SATE from Imai-Nall-King should be implemented.

For both the pair-matched and existing estimators, we need to add the ability to handle clusters using the bare (unquoted) name of the cluster variable.

For the existing estimators, we need to add the ability to calculate clustered standard errors using that cluster variable, from eq. 3.2.3 in Gerber and Green. To handle the block-and-clustered design this should be added to the internal difference in means function (and thus it will be automatically handled for the blocked design by doing the estimator within each block).

In addition, we need to add a check that all clusters are in the same block. There is fast code in randomizr to do this, can point you to it if you can't find it.

Could you verify (1) that as cluster size converges to 1 that the pair-matched clustered design standard errors are similar or identical to the Gerber-Green pair-matched standard errors?

my matrix is too large!

df <- data.frame(Y = rnorm(100), Z = rbinom(100, 1, .5), X = rnorm(100), W = runif(100))
lm_robust(Y ~ Z, data = df)
error: Mat::init(): requested size is too large
 Error in lm_robust_helper(y = y, X = design_matrix, cluster = cluster,  : 
  Mat::init(): requested size is too large 

https://media.giphy.com/media/EYJjKIDi5FKEg/giphy.gif

(this is from the tests...)

Allow declarations from `declare_ra()` to inform estimators about study design

We should allow users to pass a declaration created by declare_ra() that will learn whether the design is a simple random sample, a complete random sample, blocked, clustered, or blocked and clustered. Furthermore, if there is variance in the probability of treatment across the sample, then our estimators should default to using IPW in analysis.

A few choices:

  • We might consider adding a declaration_options argument that can take a list. It should take the following arguments to begin with:
    • ipw = TRUE governing whether or not to take the probability of treatment and set weights to be the inverse propensity weights for difference in means or lm_robust
    • treatment_condition = 2 governing which of the arms (column position in the declaration as well as condition name should work) to take as the treatment arm. The next argument would do the same for the control condition. This means that if someone declares a three-armed trial, they just have to tell us which arm to take as the treatment and which as the control. I think default behavior should be NULL so that it errors and says you have a multi-armed trial to please specify the two arms to compare. Another consideration, is whether we should allow this and the next argument to be numeric vectors. in that case if treatment_condition = c(2, 3) then we could treat both treatment conditions 2 and 3 as the same condition and group them together. Of course, it would be nice if there was a way to validate that the Z we are passed was the same as treatment_condition but there isn't a way that works across common use cases that I can see.
    • control_condition = 1

The main hiccup I see with this is handling missing data and subsetting. I think if pre-subsetting nrow(data) is different from length(declaration$block_var) or length(declaration$clust_var) that we should error, and we should go ahead and apply subsetting to the declaration variables.

Another question: if someone passes declare_ra, it is clear what the implications are for horvitz_thompson and difference_in_means but not lm_robust. We cluster SEs at the cluster level, do we add dummy variables for the blocks if they aren't there? I think that's a little too paternalistic.

Error in Travis-CI when building vignettes

Because this vignette will have a lot of math, I think it's best to build it as a PDF unless there are objections. If there are, perhaps we can discuss them in #10.

However, because it is a PDF, it is built using LaTeX and there is an error in the Mac set-up on Travis-CI (seen here) where we basically only need to add texlive-extras to the install unless there are other suggestions.

However, we don't currently use a .travis.yml so I'm not sure where these settings exist. @graemeblair?

Error when adding lots of dummy variables

In some standard FE models, lm() doesn't fail (or even return NAs for coefficients) while lm_robust() will fail saying the (X'X) is singular.

I can't come up with a simple MWE, but I have an example in my own data to work off of.

Names of key variables

Currently, the estimators here have a bunch of arguments for clusters, blocks, weights, and more. Currently, many of them are cluster_variable_name or block_variable_name. While I understand the intention behind making it clear these are names and not the variables themselves, they are really unwieldy. Furthermore, we currently have the weights variable passed as weights. I propose everything be changed to clusters, blocks, and weights, or we change weights to weight_variable_name for consistency.

I obviously prefer the former, but please let me know your thoughts.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.