Giter VIP home page Giter VIP logo

randomizr's Introduction

DeclareDesign: Declare and Diagnose Research Designs

CRAN status CRAN RStudio mirror downloads Build status Code coverage Replications

DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. For more, see declaredesign.org.

Installation

To install the latest stable release of DeclareDesign, please ensure that you are running version 3.5 or later of R and run the following code:

install.packages("DeclareDesign")

Usage

Designs are declared by adding together design elements. Here’s a minimal example that describes a 100 unit randomized controlled trial with a binary outcome. Half the units are assigned to treatment and the remainder to control. The true value of the average treatment effect is 0.05 and it will be estimated with the difference-in-means estimator. The diagnosis shows that the study is unbiased but underpowered.

library(DeclareDesign)

design <-
  declare_model(
    N = 100, 
    potential_outcomes(Y ~ rbinom(N, size = 1, prob = 0.5 + 0.05 * Z))
  ) +
  declare_inquiry(ATE = 0.05) +
  declare_assignment(Z = complete_ra(N, m = 50)) +
  declare_measurement(Y = reveal_outcomes(Y ~ Z)) + 
  declare_estimator(Y ~ Z, .method = lm_robust, inquiry = "ATE")

diagnosands <-
  declare_diagnosands(bias = mean(estimate - estimand),
                      power = mean(p.value <= 0.05))

diagnosis <- diagnose_design(design, diagnosands = diagnosands)
diagnosis
Inquiry Estimator Outcome Bias SE(Bias) Power SE(Power) n sims
ATE estimator Y -0.004 0.004 0.076 0.01 500

Companion software

The core DeclareDesign package relies on four companion packages, each of which is useful in its own right.

  1. randomizr: Easy to use tools for common forms of random assignment and sampling.
  2. fabricatr: Imagine your data before you collect it.
  3. estimatr: Fast estimators for social scientists.
  4. DesignLibrary: Templates to quickly adopt and adapt common research designs.

Learning DeclareDesign

  1. To get started, have a look at this vignette on the idea behind DeclareDesign, which covers the main functionality of the software.

  2. For an explanation of the philosophy behind DeclareDesign, examples in code and words of declaring and diagnosing common research designs in the social sciences, as well as examples of how to incorporate DeclareDesign into your own research, see the book Research Design in the Social Sciences (Blair, Coppock, Humphreys, 2023).

Package structure

Each of these declare_*() functions returns a function.

  1. declare_model() (describes dimensions and distributions over the variables, including potential outcomes)
  2. declare_inquiry() (takes variables in the model and calculates estimand value)
  3. declare_sampling() (takes a population and selects a sample)
  4. declare_assignment() (takes a population or sample and adds treatment assignments)
  5. declare_measurement() (takes data and adds measured values)
  6. declare_estimator() (takes data produced by sampling, assignment, and measurement and returns estimates linked to inquiries)
  7. declare_test() (takes data produced by sampling, assignment, and measurement and returns the result of a test)

To declare a design, connect the components of your design with the + operator.

Once you have declared your design, there are four core post-design-declaration commands used to modify or diagnose your design:

  1. diagnose_design() (takes a design and returns simulations and diagnosis)
  2. draw_data() (takes a design and returns a single draw of the data)
  3. draw_estimates() (takes a design and returns a single simulation of estimates)
  4. draw_estimands() (takes a design and returns a single simulation of estimands)

A few other features:

  1. A designer is a function that takes parameters (e.g., N) and returns a design. expand_design() is a function of a designer and parameters that return a design.
  2. You can change the diagnosands with declare_diagnosands().

This project was generously supported by a grant from the Laura and John Arnold Foundation and seed funding from EGAP.

randomizr's People

Contributors

aaronrudkin avatar acoppock avatar ck37 avatar fkohrt avatar graemeblair avatar lukesonnet avatar nfultz avatar nick-rivera avatar yihui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

randomizr's Issues

permutation matrix bug

something's wrong!

library(randomizr)
# What's going on here?
declaration <- declare_ra(N = 8, m_each = c(2, 2, 2, 2)) 
perms <- obtain_permutation_matrix(declaration) 
> table(perms[, 1])

T2 T3 T4 
 2  2  4 ```

Simple random assignment within blocks not allowed with declare_ra

As far as I can tell, conduct_ra() and declare_ra() do nothing with the simple = T/F flag if the design is blocked and nothing about simple vs. complete is stored in a declare object for blocked designs, other than in the original_call value which isn't accessed by conduct_ra().

Unless you get a bad seed and the simple randomization returns complete randomization, this is a MWE:

library(randomizr)

## Not blocked, works as expected, information about simple stored in 'ra_type'
d <- declare_ra(N = 1000)
d$ra_type # returns 'complete'
table(conduct_ra(d))

d <- declare_ra(N = 1000, simple = T)
d$ra_type # returns 'simple'
table(conduct_ra(d))

## Block, does not work as expected
block <- rep(1:2, times = c(500, 500))

d_block_simp <- declare_ra(N = 1000, block_var = block, simple = T)
d_block_simp$ra_type
d_block_comp <- declare_ra(N = 1000, block_var = block, simple = F)
d_block_comp$ra_type
# information only stored in 'original_call' which is never used elsewhere in randomizr

y_simp <- conduct_ra(d_block_simp)
table(y_simp) # rerun it, always 0.5/0.5
y_comp <- conduct_ra(d_block_comp)
table(y_comp)

Is this the preferred behavior or should you allow SRS within blocks? How should that information be stored?

add support for custom ra declarations

need to add three args to declare_ra

  1. custom_ra (a function that returns a vector of random assignments
  2. permutation_matrix
  3. sims

3 ways to declare a custom ra

  1. custom_ra + sims. sims defautls to 1000. we estimate empriical probs of assignment by simulation.
  2. custom_ra + permutation matrix. No simulation needed; we estimate empirical probs of assignmetn from the permutation matrix
  3. permutation matrix only. The "random assignment function" is selecting one of the columns of the permutation matrix at random.

add block_prob

should be analogous to block_m

block_prob <- c(.1, .2)

means block 1 has a .1 probabbility of assignment and block 2 has a .2 probability of assignment

increase speed of block_ra

Hey @acoppock,

I was playing around with the block_ra() function to improve speed for large datasets and wanted to share some preliminary thoughts. For large datasets, it looks like using lapply() is much faster than using a for loop to store the assignment by block. Here's the ratio of time using lapply versus for loop for a few different dataset sizes:

  • 5 blocks, 2000 observations -- 1 : 1.02.
  • 100 blocks, 5000 observations -- 1 : 0.99.
  • 5000 blocks, 1 million observations -- 1 : 18.

Looking at the code profile, it looks like the for loop version is less efficient because it has to allocate a new object every time it creates this inside vector: assign[block_var == blocks[i]].

Have you considered using lapply before / did it get nixed for other reasons? I've really only focused on one use-case (block assignment based on the prob_each param, which is our most common use). As you pointed out in your email, one drawback of this approach is that requires independent assignment by block. In my experience, this assumption has always been true, but I'm not sure if there are other use-cases.

Note: all tests below run on an early 2015 Macbook Air, 8 GB memory, 1.6 Ghz processor, no multicore set up.

# original function for block randomization in randomizr 
# this handles only one use-case, which is prob_each.
# note the use of for-loop for randomization assignment.
func_orig <- function(block_var, 
                      blocks = sort(unique(block_var)), 
                      prob_each = c(0.5, 0.25, 0.25),
                      N_per_block = as.numeric(table(block_var)),
                      condition_names = c('a', 'b', 'c')){

  # create assignment
  assign <- rep(NA, length(block_var))

  # assign using for-loop
  for (i in 1:length(blocks)) {
    assign[block_var == blocks[i]] <-
      randomizr::complete_ra(
        N = N_per_block[i],
        prob_each = prob_each,
        condition_names = condition_names
      )
  }

  assign <- condition_names[assign]
  factor(assign, levels = condition_names)

}
# new function for block randomization in randomizr 
# this handles only one use-case, which is prob_each.
# note the use of lapply for randomization assignment.
func_new <- function(block_var, 
                     blocks = sort(unique(block_var)), 
                     prob_each = c(0.5, 0.25, 0.25),
                     N_per_block = as.numeric(table(block_var)),
                     condition_names = c('a', 'b', 'c')){

  # set order
  o1 <- order(block_var)

  # create assignment
  assign1 <- lapply(1:length(blocks), function(x){
      randomizr::complete_ra(
        N = N_per_block[x],
        prob_each = prob_each,
        condition_names = condition_names
      )
    })

  # return list in the original order
  # http://stackoverflow.com/questions/15464793/restoring-original-order-of-a-vector-matrix-in-r
  unlist(assign1)[order(o1)]

}

#helper function to generate a random set of blocks
gen_block <- function(n_blks, n) sample(1:n_blks, size = n, replace = T)

Speed tests --- small dataset

set.seed(12345)
system.time(
  z_new <- replicate(1000, func_new(gen_block(5, 2000)))
  )
##    user  system elapsed 
##   2.970   0.108   3.344
set.seed(12345)
system.time(
  z_orig <- replicate(1000, func_orig(gen_block(5, 2000)))
  )
##    user  system elapsed 
##   3.119   0.086   3.421
testthat::expect_equal(z_new, z_orig)

Speed tests --- medium dataset

set.seed(12345)
system.time(
  z_new <- replicate(1000, func_new(gen_block(100, 5000)))
  )
##    user  system elapsed 
##  25.176   0.399  27.391
set.seed(12345)
system.time(
  z_orig <- replicate(1000, func_orig(gen_block(100, 5000)))
  )
##    user  system elapsed 
##  25.707   0.491  27.079
testthat::expect_equal(z_new, z_orig)

Speed test -- large dataset

set.seed(12345)
system.time(
  z_new <- replicate(20, func_new(gen_block(5000, 1000000)))
)
##    user  system elapsed 
##  37.920   1.844  45.688
set.seed(12345)
system.time(
  z_orig <- replicate(20, func_orig(gen_block(5000, 1000000)))
)
##    user  system elapsed 
## 435.903 230.442 821.574
testthat::expect_equal(z_new, z_orig)

Suggested feature: nearest-neighbor matching for continuous variables

(Apologies if there's a way to do this that I missed; maybe this is something for the blockTools package, but I cannot understand how to 'exact' block on some variables and nearest-match on others in that package).

Suppose I want to block-randomize on Gender and Race, and then within these categories 'alternate' assignment by Age, so that within any small window of Age we will have an even balance of Control and Treatment(s). Is this do-able through your (great!) package? I could not find it in the docs.

fixed number of units per stratum block?

Proposed syntax for sampling

strata_var <- rep(c("A", "B","C"), times = c(50, 100, 200))
S <- strata_rs(strata_var = strata_var, n = 10)
table(strata_var, S)

10 As, 10 bs, 10 cs.

for random assignment

block_var <- rep(c("A", "B","C"), times = c(50, 100, 200))
Z <- block_ra(block_var = block_var, m = 10)
table(strata_var, Z)

treat 10 of 50 As, 10 of 100 Bs, etc.

Installing randomizr

Hey--having trouble installing the package v 0.9: I just updated R and restarted everything and all the magic that usually works. Any ideas?

`> install.packages("randomizr", repos = "http://r.declaredesign.org")
Installing package into ‘C:/Users/kevin/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
Package which is only available in source form, and may need compilation of C/C++/Fortran:
‘randomizr’
Do you want to attempt to install these from sources?
y/n: y
installing the source package ‘randomizr’

trying URL 'http://r.declaredesign.org/src/contrib/randomizr_0.9.0.tar.gz'
Content type 'application/gzip' length 76479 bytes (74 KB)
downloaded 74 KB

  • installing source package 'randomizr' ...
    ** libs

*** arch - i386
c:/Rtools/mingw_32/bin/gcc -I"C:/PROGRA1/R/R-341.3/include" -DNDEBUG -O3 -Wall -std=gnu99 -mtune=generic -c onload.c -o onload.o
c:/Rtools/mingw_32/bin/gcc: not found
make: *** [onload.o] Error 127
Warning: running command 'make -f "C:/PROGRA1/R/R-341.3/etc/i386/Makeconf" -f "C:/PROGRA1/R/R-341.3/share/make/winshlib.mk" SHLIB="randomizr.dll" OBJECTS="onload.o restrictedparts.o"' had status 2
ERROR: compilation failed for package 'randomizr'

  • removing 'C:/Users/kevin/Documents/R/win-library/3.4/randomizr'
    In R CMD INSTALL
    Warning in install.packages :
    running command '"C:/PROGRA1/R/R-341.3/bin/x64/R" CMD INSTALL -l "C:\Users\kevin\Documents\R\win-library\3.4" C:\Users\kevin\AppData\Local\Temp\Rtmp2ZNnqq/downloaded_packages/randomizr_0.9.0.tar.gz' had status 1
    Warning in install.packages :
    installation of package ‘randomizr’ had non-zero exit status`

factors still weird

str(complete_ra(N = N, num_arms = 2))
str(complete_ra(N = N, num_arms = 3))

vsample type mismatch

from

https://www.stats.ox.ac.uk/pub/bdr/LTO/randomizr.out

* installing *source* package ‘randomizr’ ...
** package ‘randomizr’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
make[2]: Entering directory '/data/gannet/ripley/R/packages/tests-LTO/randomizr/src'
gcc -I"/data/gannet/ripley/R/LTO9/include" -DNDEBUG   -I/usr/local/include   -fpic  -g -O2 -Wall -pedantic -mtune=native -flto -c onload.c -o onload.o
gcc -I"/data/gannet/ripley/R/LTO9/include" -DNDEBUG   -I/usr/local/include   -fpic  -g -O2 -Wall -pedantic -mtune=native -flto -c restrictedparts.c -o restrictedparts.o
gcc -I"/data/gannet/ripley/R/LTO9/include" -DNDEBUG   -I/usr/local/include   -fpic  -g -O2 -Wall -pedantic -mtune=native -flto -c vsample.c -o vsample.o
gcc -shared -g -O2 -Wall -pedantic -mtune=native -flto -fpic -L/usr/local/lib64 -o randomizr.so onload.o restrictedparts.o vsample.o
onload.c:9:13: warning: type of ‘randomizr_vsample’ does not match original declaration [-Wlto-type-mismatch]
    9 | extern SEXP randomizr_vsample(SEXP, SEXP);
      |             ^
vsample.c:5:6: note: type mismatch in parameter 2
    5 | SEXP randomizr_vsample(SEXP pmat) {
      |      ^
vsample.c:5:6: note: type ‘void’ should match type ‘struct SEXPREC *’
vsample.c:5:6: note: ‘randomizr_vsample’ was previously declared here
make[2]: Leaving directory '/data/gannet/ripley/R/packages/tests-LTO/randomizr/src'
installing to /data/gannet/ripley/R/packages/tests-LTO/Libs/randomizr-lib/00LOCK-randomizr/00new/randomizr/libs
** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (randomizr)
Time 0:13.99, 4.95 + 0.53

Clarify ordering of block probs in documentation

Help is not too clear on order in which to specify block probs. Same for block_prob_each.

Appears to depend on sort order. eg:

> b <- sample(rep(c("a", "d", "_b", 2), 100))
> z <- block_ra(b, block_prob = (1:4)/4)
> table(b, z)
    z
b      0   1
  _b  75  25
  2   50  50
  a   25  75
  d    0 100 

specify probs and other args that are length 1 or N for full dataset, not just of length clusters / strata / blocks

When you specify prob and other arguments referring to the prob of sampling clusters / prob within strata or blocks, you are required to provide a vector that is shorter than the dataset. This makes things complex in a tidy framework and, for related reasons, with DeclareDesign. Can it just take the probs / etc. for the full dataset, with repeated values? Then you don't have to worry about ordering the numbers in the same way.

If you want to keep your functionality, it could accept of length 1, n_clusters, and N.

unquoted variable names in prob argument

For probability-proportional-to-size sampling (and other designs), you may want to use an unquoted variable name in the probability to set the inclusion probability as a function of that variable. This doesn't seem to work in the same way that block names etc. work.

my_pop <- declare_population(N = 1000, smp_prob = runif(N, 0, 1))
my_smp <- declare_sampling(prob = smp_prob)

dat <- my_pop()
> my_smp(dat)
Error in sample.int(length(x), size, replace, prob) : 
  incorrect number of probabilities

Seems like the NSE should work in the same way taking the blocks variable out of data as taking the variable here smp_prob out of data:

decl <- eval_tidy(quo(declare_rs(N=!!nrow(data), !!!options)), data)
  
  data <- fabricate(data,
    !!S := draw_rs(!!decl),
    !!samp := obtain_inclusion_probabilities(!!decl),
    ID_label = NA
  )

Any ideas why this isn't working?

two arm trial condition names

cookie_id <- paste0("cookie_id_", str_pad(1:72, 2, pad = "0"))
cookie_type <- rep(c("sugar", "chip"), c(36, 36))
batch <- block_ra(block_var = cookie_type, block_m = c(18, 18), condition_names = c("batch_1", "batch_2"))

load balancing

"load balancing" has long been a goal for randomizr. Various solutions have been attempted and even implemented, but were finally abandoned for some reason or other. This issue is for version 2!

some background

block_ra and variants conduct complete_ra within blocks. The various arguments to block_ra allow users either to give the same arguments to complete_ra for each block or different arguments in each block. We implement block_ra with an mapply, which calls complete_ra separately for each block with right arguments passed at the right time.

The issue

Imagine that we have 2 blocks of 3. If we do complete_ra within each, we will assign either 1 or 2 to treatment in each block. There are three ways this could come out:

  1. a total of 4 units will be treated (two in each block),
  2. a total of 3 (one in one block, 2 in the other)
  3. a total of 2 will be treated (1 in each block).

The goal of load balancing is to ensure that only scenario 2 (a total of three units treated) ever occurs.

Troubles

We think that the two-arm trial with load balancing is feasible. The user sets something like a "total treated" argument, then we allocated the number to be treated across blocks in a way that preserves the nominal probabilities of assignment. I.e., the randomization is weighted in such a way that the true probabilities of assignment for each unit are exactly what they would be for the otherwise equivalent block_ra call.

Extending load_balancing to the multi-arm is very hard (and may actually be logically infeasible, i.e., it may be impossible to achieve the nominal probabilties of assignment while maintaining the load balance.

Here's an email that I think explains the issue with multi-arm trials:

If we take this out of the blocking scenario, let’s just first imagine using prob_each in complete random assignment

15 units,

prob_each = c(.1, .2, .7)

we know we should assign

floor(15 * prob_each) = c(1, 3, 10), i.e. 1 unit to T1, 3 to T2 and 10 to T3.

That leaves us with 1 unit left to assign, which we should assign to each of the three treatments with prob_each probabilities, i.e.

10% of the time to T1
20% of the time to T2
70% of the time to T3

This procedure works great, even with 0 probabilities.

Now imagine that we have two blocks of 15 units.

We could:

A) do complete_ra within blocks independently
B) assign the remaining 1 unit within each block with EQUAL probability to each condition, but implement “load balancing” so that it’s never the case that both remaining units gets assigned (for example) to T1.

We might think that there would be an option C) in which we respect the prob_each within each block, and also load balance, but that doesn’t work: if in the first block, we assign the remaining unit to T3 with 70% probability, that means that the remaining unit in the second block is assigned to T3 with 30% probability, i.e. exactly 1 - p of the correct probability.  This problem is compounded if there is a zero probability in some condition.

Solution B) has a problem too with zero probabilities.  if the remainders are assigned with equal probability, it can happen that, within a block that is supposed to have no units assigned to T3, 1/3rd of the time there WILL be such an assignment.

For this reason, I think I’ve come down to preferring solution A.  In that case, if users want to implement some version of load balancing across blocks, they’ll have to specify a custom function (and figure out the analytic probabilities of assignment themselves…)

Way forward

I think that we should probably implement a new function balanced_block_ra or similar that handles the two arm case.

There may also be special cases of the multi arm trials that we could accomodate

Attachments

I'm attaching a pdf of a write-up of the problem that Macartan circulated a while ago.

Systematic load balancing block randomization for arbitrary individual assignment probabilities.pdf

Changes in treatment assignment from 0.12 to 0.16

See the following MWE:

install.packages("randomizr") # currently on version 0.16

library(randomizr)
prs <- rep(1/7, times = 7)
bls <- paste0(rep(letters[1:6], each = 100), rep(1:3, times = c(40, 10, 50)))
set.seed(42)
t1 <- block_ra(blocks = bls, prob_each = prs)

devtools::install_version("randomizr", "0.12.0")

library(randomizr)
set.seed(42)
t2 <- block_ra(blocks = bls, prob_each = prs)
table(t1==t2)

The result is that there are a few cases where the randomization is not equivalent. Indeed I believe this is due to a change in load balancing (Issue #35). This isn't ideal for those of us who use randomizr in old randomizations and rely on rerunning those scripts to ensure the randomization protocol doesn't change as data upstream from the randomization changes. Also, would like to know (for a friend) if the old version was "wrong" because then I've got a bigger problem.

add back block_m

block_m should be a vector of length n, analogous to m in complete_ra

complete_ra():block_ra()
::
m:block_m
::
m_each::block_m_each
::
prob_each:block_prob_each

prob and prob_each work the same in both complete_ra and block_ra

odd behavior for stata format block variables




blocks <- structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                      0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
                      1, 1, 1, 1, 1, 1, 1), label = "0 = texas; 1 = arkansas", format.stata = "%8.0g")



declaration <- declare_ra(
  N = 66,
  # bug??
  blocks = blocks,
  block_m = c(15, 18)
)

# thinks it's complete
declaration



blocks2 <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
             0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
             1, 1, 1, 1, 1, 1, 1)

declaration <- declare_ra(
  N = 66,
  blocks = blocks2,
  block_m = c(15, 18)
)
# gets it right
declaration

Unhelpful error if condition names are duplicated

Maybe most users aren't stupid enough to make this mistake, but if you repeat an element in the condition_names vector, then you just get lots of warnings without an actual description of the error.

It's not a big deal with small datasets because you eventually just see the outcome, but if you're randomizing a large dataset with lots of strata, then you just get lots and lots of the same warning.

Might be a good idea to add a stopifnot(length(unique(condition_names)) == length(condition_names)) or similar?

See example below

> block_var <- rep(c("A", "B","C"), times = c(50, 100, 200))
> Z <- block_ra(block_var = block_var, prob_each = c(1/3, 1/3, 1/3), condition_names = c('C', 'T1', 'T1'))
Warning messages:
1: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
3: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
4: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated
> table(block_var, Z)
         Z
block_var   C  T1  T1
        A  17  33   0
        B  34  66   0
        C  67 133   0
Warning message:
In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

In my case, I was randomizing a dataset with a few thousand strata, so my screen just overloaded with warning messages.

complete_ra sometimes returns more than N randomisation values

Reproducible example:

> set.seed(1); randomizr::complete_ra(1, prob_each = c(0.5, 0.5), conditions = 1:2)
[1] 2 1
> set.seed(4); randomizr::complete_ra(1, prob_each = c(0.5, 0.5), conditions = 1:2)
[1] 1
> sessionInfo()
R version 4.1.0 (2021-05-18)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] randomizr_0.20.0     pbmcapply_1.5.0      rstan_2.21.2         StanHeaders_2.21.0-7 ggplot2_3.3.3       
[6] data.table_1.14.0   

loaded via a namespace (and not attached):
 [1] tidyselect_1.1.1    xfun_0.23           purrr_0.3.4         V8_3.4.2            colorspace_2.0-1   
 [6] vctrs_0.3.8         generics_0.1.0      stats4_4.1.0        loo_2.4.1           utf8_1.2.1         
[11] rlang_0.4.11        pkgbuild_1.2.0      pillar_1.6.1        glue_1.4.2          withr_2.4.2        
[16] DBI_1.1.1           matrixStats_0.58.0  lifecycle_1.0.0     munsell_0.5.0       gtable_0.3.0       
[21] stringfish_0.15.1   codetools_0.2-18    labeling_0.4.2      inline_0.3.19       RApiSerialize_0.1.0
[26] knitr_1.33          callr_3.7.0         ps_1.6.0            curl_4.3.1          fansi_0.5.0        
[31] Rcpp_1.0.6          scales_1.1.1        RcppParallel_5.1.4  jsonlite_1.7.2      farver_2.1.0       
[36] gridExtra_2.3       digest_0.6.27       processx_3.5.2      dplyr_1.0.6         qs_0.24.1          
[41] grid_4.1.0          cli_2.5.0           tools_4.1.0         magrittr_2.0.1      tibble_3.1.2       
[46] crayon_1.4.1        pkgconfig_2.0.3     ellipsis_0.3.2      prettyunits_1.1.1   assertthat_0.2.1   
[51] rstudioapi_0.13     R6_2.5.0            compiler_4.1.0   

Register native routines

File 'randomizr/libs/x64/randomizr.dll':
  Found no calls to: 'R_registerRoutines', 'R_useDynamicSymbols'
It is good practice to register native routines and to disable symbol
search.

Drop partition dependency

Right now, randomizr pulls in the partitions package for one function. We may be able to reimplement it.

Convo from slack:

So I'm taking at quick look at your use of partitions, it looks like you only import restrictedparts - thoughts below

  • That function is backed by a C implementation - it is probably pretty fast - https://github.com/cran/partitions/blob/master/src/partitions.c#L222
  • He's calling through the legacy .C interface instead of the better .Call interface - this can be slow for large inputs
  • References /* algorithm on p232 of Andrews */ - do you know what that is?
  • None of the C is using lin/lapack - everything is procedural scalar arithmetic and it's speed really will depend on how optimized the compiler can get it- there's a decent chance that a pure R implementation that used matrix math properly would outperform it - an RcppArmadillo implementation has an even better chance.
  • partitions is GPL, randomizr is MIT, so you probably shouldn't copy-paste the functions into randomizr.
    GitHub
    cran/partitions
    ❗ This is a read-only mirror of the CRAN R package repository. partitions — Additive Partitions of Integers

Found it • G. E. Andrews. "The theory of partitions", Cambridge University Press, 1998`

Sampling terminology is incorrect

I am a sampling statistician. Your terminology is not compatible with any of the existing sampling books.

What you call complete random sampling is hard to name, but the term you came up with is not used in sampling work. What you call simple random sampling is called Poisson sampling. The definition of simple random sampling is that all samples of size n from the finite population of size N have equal probabilities of selection. Equal probabilities of selection of individual units is not a sign of SRS; when I taught sampling to grad students, an exercise I used to have in the sampling section of the survey statistics class was to came up with multiple designs that are equal probability but not SRS (1 for C -- you need to know at least one to understand what SRS is vs. what it isn't, 3 designs for B, 5 different designs for A).

I understand that you are deep in development with a lot of parts that have already been documented, exposed, put on the website, published, etc. Still... this is fundamentals.

A standard introductory reference on sampling is Lohr (3rd edition 2022). The next level is probably Kish (1965). The graduate level textbooks are Valliant Dever Kreuter (2018) accompanied by library(PracTools), and Tille 2006 accompanied by library(sampling) written by his student.

A technical problem that you have is that the unequal probability sampling you use is not a proper probability proportional to size (PPS) sample: you can simulate and see that probabilities of selection are not maintained. The problem is with the underlying base::sample(); proper procedures are implemented in library(sampling) by Alina Matei and Yves Tille, e.g. sampling::UPmaxentropy().

# example from Tille (2006)
set.seed(25)
# sum of probabilities == expected sample size == 3L
uneq_p = c(0.07,0.17,0.41,0.61,0.83,0.91)
sim_uneq_p_base <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_base)) {
  this <- sample.int(n=length(uneq_p),size=sum(uneq_p),prob=uneq_p)
  sim_uneq_p_base[k,this] <- 1
}
colSums(sim_uneq_p_base)/nrow(sim_uneq_p_base)
uneq_p
# done right
sim_uneq_p_done_right <- as.data.frame(matrix(rep(0,6*20000),ncol=6))
for (k in 1:nrow(sim_uneq_p_done_right)) {
   sim_uneq_p_done_right[k,] <- sampling::UPmaxentropy(uneq_p)
}
colSums(sim_uneq_p_done_right)/nrow(sim_uneq_p_done_right)
uneq_p

Assigning block probs after sampling

Dealing with a situation where blocks may be empty (this could happen with some sampling, though in current application because blocks are based on XY combinations some of which may be empty).

Can we link block probs to blocks using factors? Or perhaps have block_probs calculated from a column of unit level probs.

#OK
blocks = factor(rep(4:1, each = 2))
block_ra(blocks, block_prob = c(1:4)/4)

# Not OK
blocks = blocks[3:8]
block_ra(blocks, block_prob = c(1:4)/4)

Allow varying probabilities of assignment in simple_ra

I think it should be like this:

if prob is of length 1, we assume constant probs
if prob is of length N, we assume varying probs.

Mutatis mutandis for prob_each.

The probabilities matrix will also need to be updated.

Non-informative error from obtain_permutation_matrix with some blocked designs

When I try to get permutation matrices for blocked designs I keep getting the following error:

> treatment_perms <- obtain_permutation_matrix(declaration)
Error in 1:ncol(mat_2) : argument of length 0

Below are some examples that fail or succeed as labeled:

library(randomizr)

## Following all fail
declaration <- declare_ra(N = 10,
                          block_var = rep(1:2, each = 5),
                          block_m = c(3, 3))
treatment_perms <- obtain_permutation_matrix(declaration)
## Following all fail
declaration <- declare_ra(N = 10,
                          block_var = rep(1:2, each = 5))
treatment_perms <- obtain_permutation_matrix(declaration)

## even sizes
declaration <- declare_ra(N = 8,
                          block_var = rep(1:2, c(4, 4)))
treatment_perms <- obtain_permutation_matrix(declaration)

## three groups
declaration <- declare_ra(N = 9,
                          block_var = rep(1:3, c(3, 3, 3)))
treatment_perms <- obtain_permutation_matrix(declaration)

## following works!
declaration <- declare_ra(N = 9,
                          block_var = rep(1:3, c(4, 4, 2)),
                          block_m = c(2, 2, 1))
treatment_perms <- obtain_permutation_matrix(declaration)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.