Chapter 4 Randomization and Design

After working together with our agency partners to translate insights from the social and behavioral sciences into policy recommendations following our process, our combined OES and agency teams assess these new ideas by observing differences and changes in real world outcomes (usually measured using existing administrative data). Nearly always, we design a randomized control trial (an RCT) to ensure that the differences and changes we observe arise from the policy intervention and not from some other pre-existing difference or change. Here we show examples of the ways that we create the random numbers that form the core of our different types of RCTs that we use to build evidence about the effectiveness of the new policy.

4.1 Coin flipping randomization versus Urn-drawing randomization

Many discussions of RCTs begin by talking about the intervention being assigned to units (people, schools, offices, districts) “by the flip of a coin”. We rarely use this method directly even though it is a useful way to introduce the idea that RCTs guarantee fair access to the new policy.

The following code contrasts coin-flipping style randomization (where the number of units in each condition is not guaranteed) with drawing-from-an-urn style, or complete, randomization (where the number of units in each condition is guaranteed). We try to avoid the coin-flipping style of randomization because of this lack of control over the number of units in each condition — coin-flipping based experiments are still valid and tell us about the underlying counterfactuals, but they can have less statistical power.

Notice that the simple randomization implemented in the code below results in more observations in the treatment group (group 1) than in the control group (group 0). The complete randomization will always assign 5 units to the treatment, 5 to the control.

## Start with a small experiment with only 10 units
n <- 10
## Set a seed for the pseudo-random number generator so that we always get the same results
set.seed(12345)
## Coin flipping does not guarantee half and half treated and control
## This next bit of code shows the base R version of coin flipping randomization
## trt_coinflip <- rbinom(10, size = 1, prob = .5)
trt_coinflip <- simple_ra(n)
## Drawing from an urn or shuffling cards, guarantees half treated and control
## trt_urn <- sample(rep(c(1, 0), n / 2))
trt_urn <- complete_ra(n)
table(trt_coinflip)

trt_coinflip
0 1 
3 7

table(trt_urn)

trt_urn
0 1 
5 5

4.2 Urn-Drawing or Complete Randomization into 2 or more groups

We tend to use the randomizr R package (Coppock 2022) for simple designs rather than the base R sample function because thee randomizr does some quality control checks. Notice that we implement a check on our code below with the stopifnot command: the code will stop and issue a warning if we didn’t actually assign 1/4 of the observations to the treatment condition. Here, we assign the units first to 2 arms with equal probability (Z2armEqual) and then, to show how the code works, to 2 arms where one arm has only 1/4 the probability of receiving treatment (imagining a design with an expensive intervention) (Z2armUnequalA and Z2armUnequalB), and then to a design with 4 arms, each with equal probability. (We often use \(Z\) to refer to the variable recording our intervention arms.)

N <- nrow(dat1)
## Two equal arms
dat1$Z2armEqual <- complete_ra(N)
## Two unequal arms: .25 chance of treatment
set.seed(12345)
dat1$Z2armUnequalA <- complete_ra(N, prob = .25)
stopifnot(sum(dat1$Z2armUnequalA) == N / 4)
dat1$Z2armUnequalB <- complete_ra(N, m = N / 4)
## Four equal arms
dat1$Z4arms <- complete_ra(N, m_each = rep(N / 4, 4))

table(Z2armEqual = dat1$Z2armEqual)

Z2armEqual
 0  1 
50 50

table(Z2armUnequalA = dat1$Z2armUnequalA)

Z2armUnequalA
 0  1 
75 25

table(Z2armUnequalB = dat1$Z2armUnequalB)

Z2armUnequalB
 0  1 
75 25

table(Z4arms = dat1$Z4arms)

Z4arms
T1 T2 T3 T4 
25 25 25 25

4.3 Factorial Designs

One can often test the effects of more than one intervention without losing much statistical power by randomly assigning more than one treatment independently of each other. The simplest design that we use for this purpose is the \(2 \times 2\) factorial design. For example, in the next table we see that we have assigned 50 observations to each arm of both of two interventions. Since the randomization of treatment1 is independent of treatment2, we can assess the effects of each treatment separately and pay little power penalty unless one of the treatments dramatically increases the variance of the outcome compared to the experiment with only one treatment assigned.

## Two equal arms, second cross treatment
dat1$Z2armEqual2 <- complete_ra(N)
table(treatment1 = dat1$Z2armEqual, treatment2 = dat1$Z2armEqual2)

          treatment2
treatment1  0  1
         0 22 28
         1 28 22

Although factorial designs allow us to test more than one intervention at the same time, they tend to provide little statistical power for testing hypotheses about the interaction of the two treatments. If we want to learn about how two different interventions work together, then the sample size requirements will be much larger than if we want to learn about only each treatment separately.⁴

4.4 Block Random Assignment

Statistical power depends not only on the size of the experiment and the strength of the treatment effect, but also on the amount of noise, or non-treatment-related variability, in outcomes. Block-randomized designs can help reduce the noise in outcomes, while simultaneously minimizing estimation error – the amount that our particular experiment’s estimate differs from the truth.

In a block-randomized, or stratified, experiment, we randomly assign units to the policy interventions within groups.

Suppose we are evaluating whether dedicated navigators can increase the percentage of students who live in public housing who complete federal financial aid applications (FAFSA). Our experiment will send navigators to two of four eligible buildings, two of which are large and two of which are small. Though we can never know the outcome in all buildings both with and without navigators (the “fundamental problem of causal inference”), if we could, we might have the data below.

In this case, the true average treatment effect for this sample is the average under treatment minus the average under control: \(\text{ATE} = 50-25=25\) percent more applications if navigators are deployed everywhere.

If we randomly allocate two buildings to treatment and two to control, we might treat the first two buildings and observe

yielding an estimate of the treatment effect of \(65-25 = 40\) percent more applications – an estimate larger than the true value of 25.

Or, we might observe

yielding an estimate of the treatment effect of \(45-30 = 15\) percent fewer applications – an estimate smaller than the true value of 25.

All the possible equiprobable assignments, and their estimated treatment effects are below.

These possible estimates have mean equal to the true value of 25, showing that the difference in means is an unbiased estimator. However, some of these estimates are far from the truth, and they have a lot of variability.

To design our experiment to best estimate the true value, and to do so with more statistical power, we can block the experiment. Here, this means restricting our randomization to those three possible assignments that balance the large and small buildings across the treatment groups.

With the blocked design, we will get an estimate no more than 10 percentage points from the truth. Further, our estimates will have less variability (an SD of 8.16 rather than 11.4), improving the power of our design.

For a more realistic example, suppose our experiment has two hospitals. We might randomly assign people to treatment and control within each hospital. We might assign half of the people in hospital “A” to treatment and half to control and likewise in hospital “B”.⁵ Below, we have 50 units in hospital “A” and 50 in hospital “B”.

dat1$blockID <- gl(n = 2, k = N / 2, labels = c("A", "B"))
with(dat1, table(blockID = dat1$blockID))

blockID
 A  B 
50 50

We assign half of the units in each hospital to each treatment condition:

dat1$Z2armBlocked <- block_ra(blocks = dat1$blockID)
with(dat1, table(blockID, Z2armBlocked))

       Z2armBlocked
blockID  0  1
      A 25 25
      B 25 25

If, say, there were fewer people eligible for treatment in hospital “A” — or perhaps the intervention was more expensive in that block — we might assign treatment with different probability within each block. The code below shows half of the hospital “A” units assigned to treatment, but only a quarter of those from hospital “B”. Again, we check that this code worked by including a test. This approach is an informal version of one of the best practices in writing code in general, called “unit testing”. See the EGAP Guide to Workflow for more examples.

dat1$Z2armBlockedUneqProb <- block_ra(blocks = dat1$blockID, block_prob = c(.5, .25))
with(dat1, table(blockID, Z2armBlockedUneqProb))

       Z2armBlockedUneqProb
blockID  0  1
      A 25 25
      B 38 12

stopifnot(sum(dat1$Z2armBlockedUneqProb == 1 & dat1$blockID == "B") == ceiling(sum(dat1$blockID == "B") / 4) |
  sum(dat1$Z2armBlockedUneqProb == 1 & dat1$blockID == "B") == floor(sum(dat1$blockID == "B") / 4))

Our team tries to implement block-randomized assignment whenever possible in order to increase the statistical power of our experiments. We also often find it useful in cases where different administrative units are implementing the treatment or where we expect different groups of people to have different reactions to the treatment.

4.4.1 Using only a few covariates to create blocks

If we have background information on a few covariates, we can great blocks by hand using the kind of process demonstrated here:

## For example, make three groups from the cov2 variable
dat1$cov2cat <- with(dat1, cut(cov2, breaks = 3))
table(dat1$cov2cat, exclude = c())


(-7.32,-2.6]   (-2.6,2.1]   (2.1,6.82] 
          11           68           21

with(dat1, tapply(cov2, cov2cat, summary))

$`(-7.32,-2.6]`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -7.30   -5.03   -3.67   -4.14   -3.01   -2.77 

$`(-2.6,2.1]`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.3916 -0.7630 -0.0194 -0.0218  0.7592  2.0864 

$`(2.1,6.82]`
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.13    2.46    2.95    3.24    3.68    6.80

## And we can make blocks that are the same on two covariates
dat1$cov1bin <- as.numeric(dat1$cov1 > median(dat1$cov1))
dat1$blockV2 <- droplevels(with(dat1, interaction(cov1bin, cov2cat)))
table(dat1$blockV2, exclude = c())


0.(-7.32,-2.6] 1.(-7.32,-2.6]   0.(-2.6,2.1]   1.(-2.6,2.1]   0.(2.1,6.82]   1.(2.1,6.82] 
             7              4             38             30              5             16

## And then assign within these blocks
set.seed(12345)
dat1$ZblockV2 <- block_ra(blocks = dat1$blockV2)
with(dat1, table(blockV2, ZblockV2, exclude = c()))

                ZblockV2
blockV2           0  1
  0.(-7.32,-2.6]  4  3
  1.(-7.32,-2.6]  2  2
  0.(-2.6,2.1]   19 19
  1.(-2.6,2.1]   15 15
  0.(2.1,6.82]    2  3
  1.(2.1,6.82]    8  8

4.4.2 Multivariate blocking using many covariates

If we have many background variables, we can increase precision by thinking about the problem of blocking as a problem of matching or creating sets of units which are as similar as possible in terms of the collection of those covariates (Moore 2012; Moore and Schnakenberg 2016). Here we show two approaches.

Creating pairs:

## using the blockTools package
mvblocks <- block(dat1, id.vars = "id", block.vars = c("cov1", "cov2"), algorithm = "optimal")
dat1$blocksV3 <- createBlockIDs(mvblocks, data = dat1, id.var = "id")
dat1$ZblockV3 <- block_ra(blocks = dat1$blocksV3)
## just show the first ten pairs
with(dat1, table(blocksV3, ZblockV3, exclude = c()))[1:10, ]

        ZblockV3
blocksV3 0 1
      1  1 1
      2  1 1
      3  1 1
      4  1 1
      5  1 1
      6  1 1
      7  1 1
      8  1 1
      9  1 1
      10 1 1

Creating larger blocks:

## using the quickblock package
distmat <- distances(dat1,
  dist_variables = c("cov1", "cov2"), id_variable =
    "id", normalize = "mahalanobiz"
)
distmat[1:5, 1:5]

       1      2      3      4      5
1 0.0000 1.4766 0.3313 0.2835 0.6736
2 1.4766 0.0000 1.6919 1.5891 0.9681
3 0.3313 1.6919 0.0000 0.5477 0.7729
4 0.2835 1.5891 0.5477 0.0000 0.9089
5 0.6736 0.9681 0.7729 0.9089 0.0000

quantile(as.vector(distmat), seq(0, 1, .1))

      0%      10%      20%      30%      40%      50%      60%      70%      80%      90%     100% 
-3.13279 -1.18748 -0.79019 -0.38455 -0.14051  0.08898  0.26701  0.53818  0.89242  1.34321  2.91687

## The caliper argument is supposed to prevent the inclusion of ill-matched
## points.
mvbigblock <- quickblock(distmat, size_constraint = 6L) # ,caliper=2.5)
## Look for missing points
table(mvbigblock, exclude = c())

mvbigblock
 0  1  2  3  4  5  6  7  8  9 10 11 
 6  6  6  6  8 11 14  7  7 10  7 12

dat1$blocksV4 <- mvbigblock
dat1$ZblockV4 <- block_ra(blocks = dat1$blocksV4)
with(dat1, table(blocksV4, ZblockV3, exclude = c()))[1:10, ]

        ZblockV3
blocksV4 0 1
       0 3 3
       1 2 4
       2 4 2
       3 3 3
       4 5 3
       5 6 5
       6 6 8
       7 4 3
       8 4 3
       9 4 6

Here we produce some description of the differences within block: the proportion of people in category “1” on the binary covariate (notice that the sets are homogeneous on this covariate) and the difference between the largest and smallest value on the continuous covariate.

blockingDescEval <- dat1 %>%
  group_by(blocksV4) %>%
  summarize(
    cov2diff = max(abs(cov2))
    - min(abs(cov2)),
    cov1 = mean(cov1)
  )
blockingDescEval

# A tibble: 12 × 3
   blocksV4   cov2diff    cov1
   <qb_blckn>    <dbl>   <dbl>
 1  0             3.62  0.555 
 2  1             2.92  4.21  
 3  2             1.91 -3.89  
 4  3             3.64 -1.38  
 5  4             2.78  3.60  
 6  5             1.28  1.51  
 7  6             3.44  0.577 
 8  7             1.91 -0.0277
 9  8             1.46  1.56  
10  9             2.61 -2.41  
11 10             2.68 -1.35  
12 11             1.16 -0.770

4.5 Cluster random assignment

We often implement a new policy intervention at the level of some group of people — like a doctor’s practice, or a building, or some other administrative unit. Notice that, even though we have 100 units in our example data, imagine that they are grouped into 10 buildings, and the policy intervention is at the building level. Below, we assign 50 of those units to treatment and 50 to control. Everyone in each building has the same treatment assignment.

ndat1 <- nrow(dat1)
## Make an indicator of cluster membership
dat1$buildingID <- rep(1:(ndat1 / 10), length = ndat1)
set.seed(12345)
dat1$Zcluster <- cluster_ra(cluster = dat1$buildingID)
with(dat1, table(Zcluster, buildingID))

        buildingID
Zcluster  1  2  3  4  5  6  7  8  9 10
       0 10  0 10 10  0  0  0  0 10 10
       1  0 10  0  0 10 10 10 10  0  0

Cluster randomized designs raise new questions about estimation and testing and thus statistical power. We describe our approaches to analysis and power analysis of cluster randomized designs in the section on the analysis of Cluster Randomized Trials.

4.6 Other designs

Our team has also designed stepped-wedge style designs, saturation designs aimed to discover whether the effects of the experimental intervention are communicated across people (via some spillover or network mechanism), and designs where we try to isolate certain experimental units (like buildings) from each other so that we can focus our learning about the effects of the intervention rather than on the effects of communication of the intervention across people. In later versions of this document we will include simple descriptions and code for those other, less common designs.

4.7 Randomization assessment

If we have covariates, we can assess the performance of the randomization procedure by testing the hypothesis that the treatment-vs-control differences, or differences across treatment arms, in covariates are consistent with our claimed mechanism of randomization. In the absence of covariates, we assess whether the number of units assigned to each arm (conditional on other design features, such as blocking or stratification) are consistent with the claimed random assignment.

Here is an example with a binary treatment and a continuous outcome and 10 covariates. In this case we use the \(d^2\) omnibus balance test function xBalance() in the package RItools (see Hansen and Bowers 2008 ; Bowers, Fredrickson, and Hansen 2016).

The “overall” \(p\)-value below shows us that we have little evidence against the idea that treatment (\(Z\)) was assigned at random — at least in terms of the relationships between the two covariates and the treatment assignment. The test statistics here are mean differences. This overall or omnibus test is the key here — if we had many covariates, it would easy to discover one or a few covariates with means that differ detectably from zero just by chance. That is, in a well-operating experiment, we would expect some baseline imbalances (as measured using randomization based tests) – roughly 5 in 100. Thus the value of the omnibus test is that we will not be misled by chance false positives.

randAssessV1 <- balanceTest(Z ~ cov1 + cov2, data = dat1, report = "all")
randAssessV1$overall[, ]

chisquare        df   p.value 
   1.9978    2.0000    0.3683

Here is an example of randomization assessment with block-randomized assignment:

randAssessV3 <- balanceTest(ZblockV3 ~ cov1 + cov2 + strata(blocksV3), data = dat1, report = "all")
randAssessV3$overall[, ]

         chisquare df p.value
blocksV3    2.8342  2  0.2424
--          0.1365  2  0.9340

One can also assess the randomization given both block and cluster random assignment using a formula like so: Z ~ cov1 + cov2 + strata(blockID) + cluster(clusterID).

This approach, using the RItools package, works well for experiments that are not overly small. In a very small experiment, say, an experiment with 5 clusters assigned to treatment and 5 clusters assigned to control, we would do the same test, but would not use the \(\chi^2\) distribution. Instead, we would do a permutation-based test.

We do not use \(F\)-tests or Likelihood Ratio tests to assess randomization. See Hansen and Bowers (2008) for some evidence that a test based on randomization-inference (like the \(d^2\) test developed in that article) maintains false positive rates better than the sampling- or likelihood-justified \(F\) and likelihood ratio tests.

4.7.1 What to do with “failed” randomization assessments?

A \(p\)-value less than .05 on a test of randomization mechanism ought to triggers extra scrutiny of how the randomization was conducted and how the data were recorded by the agency. For example, we might contact our agency partner to learn more deetails about how the random numbers themselves were generated, or we may ask for the SQL or SAS code or other code that might have been used to randomize. In most cases, we will learn that randomization worked well but that our understanding of the design and the data were incorrect. In some cases, we will learn that the randomization occurred as we had initially understood. In such cases, we tend to assume that rejecting the null in our randomization assessment is a false positive from our testing procedure (we assume we would see about 5 such errors in every 100 experiments). If the rejection of the null appears to be driven by one or more particularly substantively relevant covariates, say, the variable age looks very imbalanced between treated and control groups in a health study, then we will present both the unadjusted results but also adjust for that covariate via stratification and/or covariance adjustment as we describe later in this document. A chance rejection of the null that the experiment was randomized as it should not cause us to the use the adjusted estimate as our primary causal effect — after all, it will be biased whereas the unadjusted estimate will not be biased undere chance departures from the null. But, large differences between the two estimates can inform the qualitative task of substantive interpretation of the study: looking at different effects by age groups, for example, might tell us something about the particular context of the study and, in turn, help us think about what we have learned.

4.7.2 How to minimize large chance departures of randomization?

Our approach to block-randomization helps us avoid these problems. We can also restrict our randomization in ways that are more flexible than requiring blocks, but which, in turn should minimize chance. We tend not to use re-randomization, among methods of restricted randomization, only because we often can use block-randomization and thus can minimize the complexity of later analysis. However, since we pursue a randomization-based approach to the analysis of experiments, we can easily (in concept) estimate causal effects and test hypotheses about causal effects as well after such modes of randomization.

References

Bowers, Jake, Mark Fredrickson, and Ben Hansen. 2016. RItools: Randomization Inference Tools.

———. 2022. Randomizr: Easy-to-Use Tools for Common Forms of Random Assignment and Sampling. https://CRAN.R-project.org/package=randomizr.

Hansen, Ben B., and Jake Bowers. 2008. “Covariate Balance in Simple, Stratified and Clustered Comparative Studies.” Statistical Science 23 (2): 219–36.

Moore, Ryan T. 2012. “Multivariate Continuous Blocking to Improve Political Science Experiments.” Political Analysis 20 (4): 460–79.

Moore, Ryan T., and Keith Schnakenberg. 2016. blockTools: Blocking, Assignment, and Diagnosing Interference in Randomized Experiments. http://www.ryantmoore.org/html/software.blockTools.html.

Small, D. S., K. G. Volpp, and P. R. Rosenbaum. 2011. “Structured Testing of 2\(\times\) 2 Factorial Effects: An Analytic Plan Requiring Fewer Observations.” The American Statistician 65 (1): 11–15.

But see Small, Volpp, and Rosenbaum (2011) for a way to increase the power of tests for such questions. This approach is not yet part of our standard practice.↩︎
We use “block” rather than “strata” throughout this document to refer to groups of experimental units that are fixed before randomization, where group membership cannot be changed by the experiment itself.↩︎