In randomized experiments, treatment and control groups should be roughly the same—balanced—in their distributions of pretreatment variables. But how nearly so? Can descriptive comparisons meaningfully be paired with significance tests? If so, should there be several such tests, one for each pretreatment variable, or should there be a single, omnibus test? Could such a test be engineered to give easily computed p-values that are reliable in samples of moderate size, or would simulation be needed for reliable calibration? What new concerns are introduced by random assignment of clusters? Which tests of balance would be optimal?
¶
To address these questions, Fisher’s randomization inference is applied to the question of balance. Its application suggests the reversal of published conclusions about two studies, one clinical and the other a field experiment in political participation.