Suppose an experimenter __\( E \)__ wishes to compare the effectiveness of two treatments, __\( A \)__ and __\( B \)__, on a somewhat vaguely defined population. As individuals arrive, __\( E \)__ decides whether they are in the population, and if he decides that they are, he administers __\( A \)__ or __\( B \)__ and notes the result, until __\( nA \)__’s and __\( nB \)__’s have been administered. Plainly, if __\( E \)__ is aware, before deciding whether an individual is in the population, which treatment is to be administered next, he may, not necessarily deliberately, introduce a bias into the experiment. This bias we call selection bias.

We propose to investigate the extent to which a statistician __\( S \)__, by determining the order in which treatments are administered, and not revealing to __\( E \)__ which treatment comes next until after the individual who is to receive it has been selected, can control this selection bias. Thus a design __\( d \)__ is a distribution over the set __\( T \)__ of the __\( \binom{2n}{n} \)__ sequences of length __\( 2n \)__ containing __\( nA \)__’s and __\( nB \)__’s.

We shall measure the bias of a design by the maximum expected number of correct guesses which an experimenter can achieve, knowing __\( d \)__, attempting to guess the successive elements of a sequence __\( t \in T \)__ selected by __\( d \)__, and being told after each guess whether or not it is correct. The distribution of the number __\( G \)__ of correct guesses depends both on __\( d \)__ and on the prediction method __\( p \)__ used by the experimenter. We shall consider particularly two designs, the truncated binomial, in which the successive treatments are selected independently with probability __\( 1/2 \)__ each until __\( n \)__ treatments of one kind have occurred, and the sampling design, in which all __\( \binom{2n}{n} \)__ sequences are equally likely.

We shall consider particularly two prediction methods, the convergent prediction, which predicts that treatment which has hitherto occurred less often, and the divergent prediction, which predicts that treatment which has hitherto occurred more often, except that after __\( n \)__ treatments of one kind have been administered, the divergent prediction agrees with the convergent predictions that the other treatment will follow; when both treatments have occurred equally often, either method predicts __\( A \)__ or __\( B \)__ by tossing a fair coin, independently for each case of equality.

We find that among all designs, the truncated binomial minimizes the maximum expected number of correct guesses. For this design, the expected number of correct guesses is independent of the prediction method, and is
__\[ n + n \binom{2n}{n} \big/ 2^{2n} \sim n + \Bigl(\frac{n}{\pi}\Bigr)^{1/2} .\]__
With the truncated binomial design, the variance in the number of correct guesses is largest for the divergence strategy and is
__\[ \frac{3n}{2} - D - \frac{D^2}{4} \sim \frac{(3\pi - 2)n}{2\pi} - 2\Bigl(\frac{n}{\pi}\Bigr)^{1/2}, \]__
where __\( D = n \binom{2n}{n} \big/ 2^{2n - 1} \)__, and is smallest for the convergence strategy, and is
__\[ \frac{n}{2} - \frac{D^2}{4} \sim \frac{(\pi - 1)n}{2\pi} .\]__
For the sampling design, convergent prediction maximizes the expected number of correct guesses; this maximum is
__\[ n + 2^{2n - 1} \!\big/ \binom{2n}{n} - \frac{1}{2} \sim n + \Bigl(\frac{\pi n}{4}\Bigr)^{1/2}. \]__
Finally we note that, if treatments are selected independently at random, bias of the kind we discuss disappears, but the treatment numbers can no longer be preassigned. Three such designs are considered: the fixed total design, in which the total number of treatments is a fixed number __\( s \)__, the fixed factor design, in which we continue until
__\[ \frac{1}{X} + \frac{1}{Y} \leq \frac{2}{n} ,\]__
where __\( X \)__ is the number of __\( A \)__ treatments and __\( Y \)__ is the number of __\( B \)__ treatments administered, and the fixed minimum design, in which we continue until __\( \min (X, Y) = n \)__. For the fixed total design, we find that, for __\( s = 2n + 4 \)__,
__\[ \mathrm{Pr}\Bigl(\frac{1}{X} + \frac{1}{Y} \leq \frac{2}{n}\Bigr) \sim 0.955 \]__
for large __\( n \)__; at the expense of 4 extra observations, we have a bias-free design whose variance factor will with probability __\( 0.955 \)__ be smaller than that in which treatment numbers are preassigned. For the fixed factor design, the additional number of observations required to achieve the given precision has for large __\( n \)__ the distribution of the square of a normal deviate. For the fixed minimum design, in which we guarantee precision for the estimated effect of each treatment, the expected number of additional observations is roughly __\( 1.13 (n)^{1/2} \)__.