declaredesign / fabricatr Goto Github PK

fabricatr: Imagine Your Data Before You Collect It

Home Page: https://declaredesign.org/r/fabricatr

License: Other

R 100.00%

fabricatr's Introduction

DeclareDesign: Declare and Diagnose Research Designs

DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. For more, see declaredesign.org.

Installation

To install the latest stable release of DeclareDesign, please ensure that you are running version 3.5 or later of R and run the following code:

install.packages("DeclareDesign")

Usage

Designs are declared by adding together design elements. Here’s a minimal example that describes a 100 unit randomized controlled trial with a binary outcome. Half the units are assigned to treatment and the remainder to control. The true value of the average treatment effect is 0.05 and it will be estimated with the difference-in-means estimator. The diagnosis shows that the study is unbiased but underpowered.

library(DeclareDesign)

design <-
  declare_model(
    N = 100, 
    potential_outcomes(Y ~ rbinom(N, size = 1, prob = 0.5 + 0.05 * Z))
  ) +
  declare_inquiry(ATE = 0.05) +
  declare_assignment(Z = complete_ra(N, m = 50)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) + 
  declare_estimator(Y ~ Z, .method = lm_robust, inquiry = "ATE")

diagnosands <-
  declare_diagnosands(bias = mean(estimate - estimand),
                      power = mean(p.value <= 0.05))

diagnosis <- diagnose_design(design, diagnosands = diagnosands)

diagnosis

Inquiry	Estimator	Outcome	Bias	SE(Bias)	Power	SE(Power)	n sims
ATE	estimator	Y	-0.004	0.004	0.076	0.01	500

Companion software

The core DeclareDesign package relies on four companion packages, each of which is useful in its own right.

randomizr: Easy to use tools for common forms of random assignment and sampling.
fabricatr: Imagine your data before you collect it.
estimatr: Fast estimators for social scientists.
DesignLibrary: Templates to quickly adopt and adapt common research designs.

Learning DeclareDesign

To get started, have a look at this vignette on the idea behind DeclareDesign, which covers the main functionality of the software.
For an explanation of the philosophy behind DeclareDesign, examples in code and words of declaring and diagnosing common research designs in the social sciences, as well as examples of how to incorporate DeclareDesign into your own research, see the book Research Design in the Social Sciences (Blair, Coppock, Humphreys, 2023).

Package structure

Each of these declare_*() functions returns a function.

declare_model() (describes dimensions and distributions over the variables, including potential outcomes)
declare_inquiry() (takes variables in the model and calculates estimand value)
declare_sampling() (takes a population and selects a sample)
declare_assignment() (takes a population or sample and adds treatment assignments)
declare_measurement() (takes data and adds measured values)
declare_estimator() (takes data produced by sampling, assignment, and measurement and returns estimates linked to inquiries)
declare_test() (takes data produced by sampling, assignment, and measurement and returns the result of a test)

To declare a design, connect the components of your design with the + operator.

Once you have declared your design, there are four core post-design-declaration commands used to modify or diagnose your design:

diagnose_design() (takes a design and returns simulations and diagnosis)
draw_data() (takes a design and returns a single draw of the data)
draw_estimates() (takes a design and returns a single simulation of estimates)
draw_estimands() (takes a design and returns a single simulation of estimands)

A few other features:

A designer is a function that takes parameters (e.g., N) and returns a design. expand_design() is a function of a designer and parameters that return a design.
You can change the diagnosands with declare_diagnosands().

This project was generously supported by a grant from the Laura and John Arnold Foundation and seed funding from EGAP.

fabricatr's People

Contributors

Stargazers

Watchers

Forkers

amirmasoudabdol wade1990 nfultz zie225 kuriwaki anhnguyendepocen lionel- jfontestad davidchall galexandros sablokgaurav

fabricatr's Issues

draw_discrete - shouldn't have to pre/append +-Inf

currently for draw_discrete you have to manually pre/append +- Inf to the breaks - this is non-ergonomic/ugly, and also means you have to set names separately on a different argument.

Time series data

@graemeblair suggested that Fabricatr should support the construction of time series data. I agree. I think at this juncture it is probably best to mark this as a post-CRAN submission task, in part because I need to think about what constructing time series data even looks like.

@nfultz wisely suggested that random walk data could be generated by doing i.i.d. random draws and then setting the observed value to the cumSum of the values up until that point -- but that more complicated time series models with actual structure would be slower.

I'm going to leave this as an open issue so people can suggest what they think time series data generation ought actually do. In other words, what are the ideal forms that the data takes. I have vaguely covered exponential smoothing and I've had a very cursory overview of AR/I/MA models but in part it's not clear to me whether these are convenient modelling assumptions or if we actually believe data is generated this way. What kinds of assumptions do users need to make about the memory of the process in order to generate data which flexibly resembles real-world data?

I would love anyone to suggest a MWE in base R of what some data might look like so I could adapt it to fabricatr and see what functions we need to do.

draw_discrete syntax

@graemeblair expressed concern about the syntax of draw_discrete (specifically, us having overloaded "x"). I'm not sure there's an easy way around this -- if we disambiguate x, people will need to learn multiple names. We could use the splat to forcibly de-alias the argument and require users to go by position, but this is messy too. We're going to discuss this later, offering an issue for now so it doesn't get lost in the shuffle.

variables replacing eachother bug

This correctly replaces A: fabricate(fabricate(N = 2, Y = rnorm(N)), Y = 1:2)

This leaves A as c(1,2): fabricate(N=2, A = 1:2, A = 3)

Should replace A with 3

cross_level with multiple external data frames

Another option instead of fabricating the levels above is bringing in multiple data.frames and then doing cross_level. h/t @acoppock

Resampling: Allow users to provide names for N instead of N and ID_labels

Dummy issue for the feature discussed in Neal's suggestion on the passthrough resampling issue, will close when implemented.

argument name for icc

Should we name the args ICC rather than rho?

error when doing math on ID vars

fabricate( cities = level(N = 3), 
           citizens = level(N = 1:3, T = cities + citizens) )
## Error in cities + citizens: non-numeric argument to binary operator

Need to make these numeric?

bug when level above is modify_level

I'm not sure exactly what's going on here (if modify_level being above matters, having data provided, or if just not having an add_level is the problem), but this works:

fabricate(
  city = add_level(
    N = 10,
    city_var = rnorm(N),
    new_city_var = rnorm(N),
    city_name = letters[1:10]
  ),
  neighborhood = add_level(
    N = sample(20:50, N, replace = TRUE)
  )
)

And the same neighborhood specification with same size level above it does not work:

cities <- data.frame(city_var = rnorm(10), city = letters[1:10])

fabricate(
  data = cities,
  city = modify_level(
    new_city_var = rnorm(N)
  ),
  neighborhood = add_level(
    N = sample(20:50, N, replace = TRUE)
  )
)

need to find a way to allow outcomes i.e. in declare_potential_outcomes() not to be at the lowest level

Hierarchy is created nicely in declare_population, but currently the way we recommend using declare_potential_outcomes() just creates outcomes at the lowest level. If it's using only covariates used in higher levels, the outcome may respect the hierarchy. However, if it has a stochastic component such as rbinom(), it will implicitly be drawing that outcome at the bottom of the hierarchy even if the probabilities are created using hierarchical variables (i.e. if you set the probability to vary by village, you would want rbinom() in some cases to be a village-level binary outcome -- but instead it will draw at the individual level). Let's make a way to do this neatly! Should be an easy extension of the use of level().

bug with no variables

This works:

fabricate_data(
  regions = level(N = 5, gdp = rnorm(N)),
  cities = level(N = sample(1:5), subways = rnorm(N, mean = gdp)))

This does not:

fabricate_data(
  regions = level(N = 5),
  cities = level(N = sample(1:5), subways = rnorm(N, mean = gdp)))

Diagnosing problem with resample_data

I have some test nested data -- ignore how ugly this generation function is, it was a quick hack to test something, regardless you can verify that the data itself is sane:

set.seed(19861108)

data_gen = function()
{
  countries = 1:200
  provinces = lapply(1:200, function(i) { 1:runif(1,30,50) })
  c_set = unlist(lapply(1:200, function(i) { rep(i, length(provinces[[i]])) }))
  d = cbind(unname(c_set), unlist(provinces))
  cities = lapply(1:nrow(d), function(i) { 1:runif(1,1,10) })
  cit_set = lapply(1:nrow(d), function(i) { d[rep(i, length(cities[[i]])), ] })
  full = cbind(do.call(rbind, cit_set), unlist(cities))
  colnames(full) = c("country", "prov", "city")
  full = as.data.frame(full)
  return(full)
}

data = data_gen()

Short description: there are 200 countries, each of which has 30-50 provinces, each of which has 1-10 cities.

Currently, this will cause resample_data() to fail: resample_data(data, N=c(20, 8, 5), ID_labels=c("country", "prov", "city")). It will only fail if you try to resample data from all three levels, just sampling the first two levels will not fail.

At some point in the process of drawing data, bootstrap_single_level will be handed a data frame with 0 rows, and trying to sample from a data frame with 0 rows will cause it to fail.

(Mostly just opening this issue to track progress in isolating what is breaking the resample_data function -- don't necessarily need feedback on this)

remove by functionality

To add variables from multiple datasets, users should pre-merge their data and provide to

fabricate_data(data = my_merged_data, countries = level(), regions = level())

Main argument is that merging is complicated and shouldn't be in fabricatr. Currently you could do this case in another way, which would be removed, which does:

countries_data <-
    fabricate_data(N = 2,
                   ID_label = "countries",
                   gdp = rnorm(N))

  regions_data <- fabricate_data(countries = level(N = 2),
                                 regions = level(N = 2, elevation = rnorm(N)))


  full_data <- fabricate_data(
    countries = level(data = countries_data,
                      new_country_variable = rnorm(N)),
    regions = level(data = regions_data, by = "countries",
                    new_region_variable = rnorm(N))
  )

Stochastic N in inner level not inheriting N from outer level

@graemeblair identified an issue where if you have two-level data, and the inner level's definition of N depends on the outer level's N, N isn't being inherited correctly. Issuing this to confirm I fix it, test, and verify stochastic inner N working.

I'd like to have n()

... or something like it, such as the option to create "global variables", that are expressions run on the whole nested data.frame at once.

I want heterogeneous block sizes with a variable telling me the size of the block. Here are four ways I would like to be able to do that, none of which work at present:

population <- declare_population(
  block = level(
    N =  10^4,
    block_effect = rnorm(N),
    # Method 1: pass block_size argument to lower level:
    n_1 = sample(
      x = 2:3,
      size = N,
      replace = TRUE
    )
    # And then pass this to the N argument for the individual level 
    ),
  individual = level(
    N = sample(
      x = 2:3,
      size = 10 ^ 4,
      replace = TRUE
    ),
   # Methods 2 and 3: create an n() function a la dplyr:
   n_2 = n(),
   n_3 = N,
   noise = rnorm(N)
  ),
  # Method 4: global variables
  global_variables = list(
    n_4 = group_by(block_ID) %>% mutate(n = n())
 )
)

At present I have to do this:

# Blocks of two or three units
block_data <-
  data.frame(
    block_size = sample(
      x = 2:3,
      size = 10 ^ 4,
      replace = TRUE
    )
  )

population <- declare_population(
  block = level(
    level_data = block_data,
    block_effect = rnorm(N)
    ),
  individual = level(
    N = block_data$block_size,
    noise = rnorm(N)
  )
)

Free speed gains?!? Too good to be true?!?!

I've been profiling the bootstrap/resampling functionality in fabricatr this week. I have a branch where I've implemented a few incremental speedups, but these are child's play.

One big speed boost we can get is replacing rbind with rbindlist (a function from the data.table package). In benchmarks, with a moderately large data, rbindlist runs about 9x faster than rbind, and the overall resample process runs about 2x faster using rbindlist than rbind. This is a pretty huge gain and I am very much in favour of it.

One issue is of course, the "Malawi problem", where we don't want to increase the size of numbers of dependencies for people who are extremely bandwidth constrained. But what if we could trade-off with both, allowing users who have data.table installed to make use of it, while allowing users who don't to be able to use our package without being told to install it.

Consider the following snippet:

        if(!requireNamespace("data.table")) {
            res = do.call(rbind, results_all)
            rownames(res) = NULL
        } else {
            # User has data.table, give them a speed benefit for it
            res = data.table::rbindlist(results_all)
            class(res) = "data.frame"
            attr(res, ".internal.selfref") = NULL
        }

requireNamespace will return false if data.table is not installed, so people without the package will get the do.call rbind version. I've benchmarked various ways of zapping the row names, don't worry about that.

If, on the other hand, the user DOES have data.table, then we can call the rbindlist function. I add the class and attr lines so that our function will return a data.frame -- in other words, the returned data will be exactly identical and pass an identical() call whether you have the data.table package or not.

In terms of how we signal this to users, we modify the docs/vignette. Neal assures me we can arbitrary key/value pairs to the DESCRIPTION file, so we could also add a key/value pair that has no ordinary meaning to let people know (i.e. FasterWith: data.table)

I'll post a full standalone profile script in Slack so you guys can play around with this

Summary:

Users with data.table get a speed boost (2X speed boost across full resample)
Users without data.table see no change
No additional dependencies
Output from the function will be exactly identical regardless of which version runs

@graemeblair Suggested I post an issue to make clear my intent here and see if anyone has a strong objection, but I really think this is a solution that's great!

Example needed for inner N defined by function

Need to verify docs have an example for an inner N defined by a user-specified function, and also need to tweak the error message if the user passes a closure instead of a function call.

Allow *_level to specify ID_Labels

Currently, the only way to specify the level variable name is to set it as the parameter name from fabricate() - we should make these two equiv:

fabricate(sleep, group=modify_level(m=1:2))
fabricate(sleep, modify_level(m=1:2, ID_label="group"))

providing cluster means/probs to draw_icc_* in fabricatr

draw_normal_icc and draw_binary_icc take x as the first arg which is the mean or probability for each cluster (or a single number to set them equal). This makes x not workable in fabricate, where you could have x per cluster but it'd be repeated across individual observations. Can we allow x to have the same length as cluster and figure out the cluster means using the two columns i.e. unique(data[, clusters, x)])?

Private arguments to level functions

This is just a placeholder issue to remind me that I currently have several private arguments in the suite of functions associated with levels and I have documentation telling users not to use them in order to silence the R CMD CHECK documentation warnings -- this should be replaced with CRAN-friendly public-facing documentation

What to do when some variables are levels and others aren't

The following use case is not handled by fabricate_data:

population <- declare_population(
  N = 10^5,
  noise = rnorm(N),
  block = level(
    N = 5^4,
    block_effect = rnorm(N)
    )
)
pop <- population()

When not all options are levels then it goes to fabricate_data_single_level_, which throws the following error:

 Error in paste(c(ID_label, "ID"), collapse = "_") : 
  argument "ID_label" is missing, with no default

As an aside, I can't quite understand why it throws this error because fabricate_data_single_level_ handles cases in which ID_label is NULL...

Anyway I see two solutions to the "not all variables are levels" problem (i.e.
if(any(options == "level") & !all(options == "level")), happy to implement either of them:

We throw an error when the user uses any levels but doesn't have all variables as levels
(My preference:) We treat all variables in the non-level list as belonging to the same level (since we have an N), with the default level-ID, and merge in the other levels as we would otherwise do if those variables had been declared inside a level() wrapper.

tests for recycle and N

recycle
N used from add_level(N=sample(N))

Vignette for common social science variables

This is stubbed, adding to remind myself to address. The stub page has a target date of Wednesday.

Quicker generation of correlated variables

@acoppock requested a quick helper function that would enable something like this:

example = fabricate(
  respondent = add_level(attitude_immigration = draw_ordered(...),
                         attitude_defence = draw_ordered(..., 
                                                         correlated=f(attitude_immigration, rho=0.7)))
)

I agree this would be super handy.

The two methodological issues I think we'd need to clear to make this possible:

Code introspection to confirm that the DGP specified in attitude_defence is exchangeable so that we can use the copula strategy to link the two resulting variables. In other words, if someone gives us attitude_defence = draw_ordered(rnorm(N, mean=how_conservative_am_i), correlated=f(attitude_immigration, rho=0.7)), the observation-level dependence of the variable on other variables makes the draw have a deterministic characteristic that is incompatible with the correlation specified.
Ensuring that for low Ns, at least, we generate a superpopulation before doing the correlated join so that we don't suffer from data density issues degrading the correlation.

Most of the copula code from the cross_level joins can be reused here.

Milestone: post-CRAN Future

Function for higher level modify to replicate group by with lower levels

Graeme wanted to see if we could have a function that allows it to be possible to modify a higher level and aggregate data from the lower level in an arbitrary function. Hypothetical working example:

data = fabricate(
  states = add_level(N = 100, ...),
  cities = add_level(N = 5, city_var_1 = ...),
  states = modify_level(state_var_n = f(city_var_1))
)

We can't do this because modify discards down to stuff not varying at the level you're modifying before processing.

Currently, the closest equivalent is to do:

data = fabricate(
  states = add_level(N = 100, ...),
  cities = add_level(N = 5, city_var_1 = ..., state_var_n = f_like_ave(city_var_1, states)),
)

Where f_like_ave is a function that will split the first argument by the value of the second argument and then calculate the statistic or whatever of interest on the splits.

Rewrite of Nesting / fabricate to support cross-classified data

Right now, nesting is determined by position of levels in a single fabricate call, which means level() has to pull double duty as both a generation

Potential alternatives for expressing "students in schools":

Nesting via parenthesis: schools = level(N_1, students=level(N_2, verbal=rnorm(), math=rnorm()))
slashes: schools = level(N_1) / student := level(N_2)
or the other way: students = level(N_2) %nested_in% schools := level(N_1)

Note the := in there, we need a way to set names inside an exp instead of only using arg names. We could still allow the first one to be a = or we could make the := required.

We could express cross classified designs similarly by defining * or perhaps %cross%

This would be a breaking change that would require us to fix many things, but it's probably better to figure it out now before we have a ton of users writing code against fabricatr.

Weird behavior when user supplies an unnamed level argument and fabricate interprets it as a data input

fabricate(add_level(N = 5, gdp = rnorm(N))) should fail because the level is not named. Instead, due to a variety of reasons (changes in how we handle invalid ID_label parameters), it works and produces a valid data frame -- because fabricate believes the add_level call is the data argument, evaluates it, gets a data frame back, and uses that as data, and then it's an empty fabricate() call with imported data.

On the other hand, this isn't necessarily incorrect behavior -- to the extent that someone ran an add level call outside fabricate (which we technically allow but do not suggest), they would get a data frame. And you can give fabricate a data frame as an anonymous first argument.

Attempting to do this with two levels will fail because the second anonymous level will parse an invalid N (and even if valid, would error because you provided data and N).

One fix for this is to move the data argument after the splat in the fabricate function definition, but this breaks the default compatibility with piping, so I won't do that.

This is an issue that I'm creating to remind myself to follow up on this. It might not actually be problematic behaviour, but it feels like it is. Notably, I need to check exactly where in the order of operations the check to ensure that a level call has a name occurs and why that doesn't fail here.

bootstrap speed

Curious how our resample function compares to the bootstrap package or this: https://github.com/topepo/rsample in terms of speed, do we know?

User provides data, makes single-level modification, does not specify ID label

One of the examples I use in the vignette is this:

simulated_quake_data = fabricate(data=quakes,
                                 fatalities = round(pmax(0, rnorm(N, mean=mag)) * 100),
                                 insurance_cost = fatalities * runif(N, 1000000, 2000000))

Basically, importing existing data (quakes is a dataset in base R), and adding new variables at that same level. If the user specifies an ID_label it will staple an ID label column with the name specified onto the dataset. If the user does NOT specify an ID_label, the call should:

Staple an ID_label with a default name (ID)
Assume the user does not require an ID_label and do nothing
Error
Warn
Other

I was doing (2) until I closed that last issue and now don't want any way for an add_level call (which is how this is processed internally) to not take an ID_label -- since this requires me either to do (1) or to make a special exception to do (2), I figured I'd take everyone's temperature.

Please vote below.

duplicated var names across levels

This replaces the var if the names are duplicated:

fabricate(
    cities = level(N = 2, elevation = runif(n = N)),
    citizens = level(N = 2, elevation = runif(n = N)),
  )

This, because you're going back to a higher level, does not:

fabricate(
    cities = level(N = 2, elevation = runif(n = N)),
    citizens = level(N = 2, elevation = runif(n = N)),
    cities = level(N = 3, elevation = runif(n = N))
  )
##   cities elevation.x citizens elevation.y
## 1      1   0.5728534        1   0.9446753
## 2      1   0.9082078        2   0.9446753
## 3      2   0.2016819        3   0.6607978
## 4      2   0.8983897        4   0.6607978

Not sure what the right behavior is, but not this!

fabricate() hierarchical issues

I was using fabricatr today and I may be missing something, but it appears the way fabricate() is handling hierarchical data is not working quite right:

> fabricate(
  regions = level(N = 3, gdp = rnorm(N)),
  districts = level(
    N = 2,
    var1 = rnorm(2)
  )
)

  regions         gdp districts       var1
1       1 -0.05955841         1  0.2869096
2       1 -0.05955841         2 -2.9556124
3       2 -0.07986005         3  0.2869096
4       2 -0.07986005         4 -2.9556124
5       3 -1.62880298         5  0.2869096
6       3 -1.62880298         6 -2.9556124

This should have produced different values of var1 in each district (it should have run rnorm(2) three times, once for each region).

When a vector is provided that is too long, it used to use the first N values, but now incorrectly expands the dataset. The following could should also have a produced a dataset of length 6:

fabricate(
  regions = level(N = 3, gdp = rnorm(N)),
  districts = level(
    N = 2,
    var1 = month.abb
  )
)

   regions       gdp districts var1
1        1 -1.522291         1  Jan
2        1 -1.522291         2  Feb
3        2  1.498186         3  Mar
4        2  1.498186         4  Apr
5        3  0.330097         5  May
6        3  0.330097         6  Jun
7        1 -1.522291         1  Jul
8        1 -1.522291         2  Aug
9        2  1.498186         3  Sep
10       2  1.498186         4  Oct
11       3  0.330097         5  Nov
12       3  0.330097         6  Dec

@nfultz since Aaron isn't on tomorrow, would you be able to take a look?

Tolerance for correlations in cross_level tests

Right now, a very small number of builds fail because the observed correlations are outside the tolerance levels I set. I think this is mostly a stochastic property and not anything not working (restarting the build will have it complete correctly), but I should look at this a little while later to ensure the tolerance is set to minimize this possibility.

Message not infomative when names is omitted

fabricate(N=2, 1:N)
## Error in fabricate(N = 2, 1:N): object 'N' not found

N should be found; issue is what to do with 1:N

FasterWith -> Enhances

for next version suggest we add enhances: to replace the lost fasterwith: info in DESCRIPTION

Variables with fixed ICC

@graemeblair requested variable generation with fixed ICC.

Design decision: Single level data where variable lengths are not equal

Considering just single level data, suppose a user makes the call: fabricate(N = 4, test1=runif(6)). Then they get an error correctly noting that N and the data variable imply differing data lengths.

But if they enter an N that is an even divisor of length(variable), the ID numbers get recycled: fabricate(N = 3, test1=runif(6)) -- the resulting ID values are 1, 2, 3, 1, 2, 3.

Likewise, if the actual data are of even divisible length, the data get recycled:
fabricate(N = 6, test1=runif(6), test2=runif(3)) -- the resulting test2 variables get recycled.

This suggests two questions we should answer:

Should it be possible to provide multiple columns of data with differing lengths?

2a. If so, should it be possible to provide an N different than the maximum length of the generated variables?
2b. If not, it surely sounds as though we should error if the expected N is not the length of the data rows.

We might imagine that users are typically going to use n=N in an argument, but there are perhaps some cases where they might not, and so we should have either a clear error message or explicitly document that this behaviour is permitted and expected.

Cleanup refactor from Neal's code review

Stuff from Neal's code review of the mega merge that we might reasonably consider outstanding business.:

Use of lists inside enviroments -- I assume @nfultz would rather switch to environments inside environments?
~~The shelving code could probably be abstracted into a helper function~~
~~Split the level functions into separate files to keep file length manageable~~
As per @nfultz, the ordered_indices line in cross_classify_helpers.R could be refactored out of the lapply into a sweep
~~As per @nfultz, the index_maps[] line (484-487 on the current fabricate.R, about a page down in modify_level() is an ugly hack.~~

Parallelization

We want to add parallelization to resample and fabricate. This will be a dummy issue to track progress.

Allow users to provide a parallelization backend
Allow users to ask use to provide a parallelization backend
Ensure sane seed / RNG behavior across parallelization
Benchmark parallelization to see under what cases this actually provides reasonable additional speed
Figure out testing options to reflect parallelization across operating systems

Searching EGAP for common pre-registration patterns we want to support

This is a catch-all issue for the results of a search I'm going to do of EGAP to read pre-registration plans and see if there are common paradigms in terms of variable or data structure that we might want to support and don't currently.

change describe_variable to describe_variables, accept >1

Return a data.frame of summary of the variables(s)

Unhappy with ID column stapling

MWE:

  df <- fabricate(N = 2, Y = 10)
  df2 <- fabricate(data = df, Y2 = Y + 1)
  ncol(df2)

In general we do want fabricatr to staple an extra ID column onto imported data; but in this specific case, it results in two identical columns. If the user explicitly sets an ID_label, this will of course not happen.

Proposal: Use a heuristic to check if any existing column on the imported data is exactly identical to the proposed ID column, and if so, don't add one on.

This is a mildly breaking change in that the data being output is not exactly identical to the data being output without this change, but I'd rather mildly break now than mildly break later and I'm reasonably confident no real users are impacted by this.

arg names in draw_binary() etc. with links

When we have a function like draw_binary with a link argument, right now we renamed x to prob, even though if you change link to be logistic for example, prob will have to be the latent var. I thought we'd decided to include prob but also latent or something like that (and it checks consistency between args.

Current circumstance of providing draw_binary(prob = runif(N), link = 'logistic') does not seem right.

Master Progress Tracker for CRAN Complete Submission

This is a roadmap to CRAN submission.

Feature Additions

Testing and Test Coverage

Fabricate (dummy item pending rewrite)
Fabricatr does not depend on other packages, but we should ensure fabricatr is running in the test matrix for the other packages

Getting Started Vignette

Rewrite
Examples with new syntax
Reduce dependence on examples that don't make substantive sense
Front page example

CRAN Submission Prep

Look up any checks needed besides the default R build/check process
- WinBuilder
Confirm licenses, copyright statements, and other submission metadata
Submit

variables shorter than level's data

We've talked about this before, but right now if you create for example a time series of 18 months long and want a "month name" var you will get an error if you do this:

months = add_level(N = 18, month = month.abb)

because month.abb is 12 long. Can we think of a clever way to help users do this instead of having to type c(month.abb, month.abb[1:6]) like a helper that rep's them an appropriate number of times? (Need not be new func, just example of how to do ok that is simple and non-programmery!)

Categorical data generation needs sane defaults

Currently categorical data generation expects a matrix of probabilities. So, for instance, if you have a population of size 10 that is generated to have a partisan affiliation (D, R, I with some probability on each), draw_discrete expects a 10x3 matrix of probabilities and hard errors if provided with the vector c(0.4, 0.3, 0.3).

The matrix model makes sense if someone is simulation a population with heterogeneity with respect to the probabilities between units -- the vignette example is like this. I would expect, though, that more users would want to make N i.i.d. draws from a categorical distribution.

If there's no strong objection, I will change the behavior to the following:

If the user supplies a matrix, process as intended
If the user supplies a vector, assume constant probabilities for each unit and potentially warn the user that this implicit assumption is being made (this should also allow users to generate degenerate categorical draws)
If the user supplies something that is neither of these things, then it is invalid data and we should error.

I'll close the issue and commit the code change on the weekend if no one has feedback. Thanks.

argument names singular/plural

We have a few arg names for the variable creation functions that are plural -- probs, means, etc. -- which I think are inconsistent with base functions such as rnorm/rbinom, i.e. we have

draw_binary(N = N, probs = 0.2)

whereas base R has:

rbinom(n, size, prob)

Suggest we make them all singular?

"Passthrough" resampling: Naming, Vignette, Documentation

We have talked on and off about resampling that passes through levels of the data transparently. An example of this is "For each state, I wanted to resample 10 cities" or "For each school, I want to resample 10 students" -- this is different than resampling N cities or N schools, of course.

I have this implemented, but we should probably think a little bit about naming. Right now I'm calling this ALL_UNITS. In caps because I grew up on C-family languages which use all-caps variable names for constants or flags; and ALL_UNITS because you quite literally want all of the units, unchanged. The name could be longer (ALL_UNITS_AT_LEVEL) and more descriptive. We might also try PASSTHROUGH (The "through" / "thru" corruption in American English might make this non-obvious). There's also a question of whether there should be case sensitivity at all.

Finally, we need a tests (I will take care of these), a vignette entry (if we want to expose this functionality), and to update the documentation.

The current syntax is -- we should also talk about what we want to do here:

my_resample = resample(my_data, N=c(ALL_UNITS=TRUE, 5), ID_labels=c("state", "city"))

With this syntax, all of the implementation details are solved and pretty trivial (the TRUE will autoresolve to 1, the variable will set a names() on the vector unit, and we can read from there ignoring the number; all of the error handling doesn't need to be changed because the 1 is a positive numeric integer).

Since N can be a variable containing a vector, a directly provided anonymous vector, a scalar, etc. using NSE to capture the ALL_UNITS token is not ideal -- it's difficult to extract each value of a vector for individual NSE, where some are evaluated and some are deparsed.

If there are no suggestions on syntax, then I think the only thing I need from you guys is to give some thought about the vignette / documentation.

The code will be in a branch called passthrough_resample shortly.

better error if variable doesn't exist at current level

Would be great if the following threw an error that the var doesn't exist at this level:

fabricate(L1 = level(N = 2, A = runif(N)),
          L2 = level(N = 1:2, B = runif(N)),
          L1 = level(N = 2, C = runif(N), D = A+B)
          )
## Error in overscope_eval_next(overscope, expr): object 'B' not found

Vignettes for working with other data generating packages

I just wrote a very short two-example vignette for working with wakefield. The two examples are: 1) Using wakefield when generating a variable within a fabricate call; 2) piping wakefield data frame (tibble) output into fabricate to then nest data.

Let's use this issue to track other packages we want to work with and what we think a MWE for using that package's functionality is.

coerce to numeric when possible

Was surprised to learn draw_ordered results in a factor. When the labels are numeric for this or categorical, can we coerce to numeric?