On Estimating the Size and Confidence of a Statistical Audit
Javed A. Aslam 
Abstract: We consider the problem of statistical sampling for auditing elections, and we develop a remarkably simple and easilycalculated upper bound for the sample size necessary for determining with probability at least c if a given set of n objects contains fewer than b “bad” objects. While the size of the optimal sample drawn without replacement can be determined with a computer program, our goal is to derive a highly accurate and simple formula that can be used by election officials equipped with only a handheld calculator. We actually develop several formulae, but the one we recommend for use in practice is:As a practical matter, this formula is essentially exact: we prove that it is never too small, and empirical testing for many representative values of n ≤ 10,000, and b ≤ n/2, and c ≤ 0.99 never finds it more than one too large. Theoretically, we show that for all n and b this formula never exceeds the optimal sample size by more than 3 for c≤ 0.9975, and by more than (−ln(1−c))/2 for general c.
U_{3}(n,b,c)=
⌈

/

\n −
(b−1) 2 \

/· /

\1  (1−c)^{1/b} \

/⌉

=
⌈

/

\n −
(b−1) 2 \

/· /

\1 − exp(ln(1−c)/b) \

/⌉


Given the increased popularity of voting systems with voterverified paper ballots, there is increased need for effective audits to confirm that those paper ballots agree with their electronic counterparts (which might be the result of scanning those ballots). Since auditing is expensive (it is typically done by hand), it is important to minimize the expense by choosing a sample size for the audit that is as small as possible, while guaranteeing a desired level of statistical confidence. This paper addresses the question of determining the appropriate sample size and develops nearly exact approximations that can be evaluated easily on a handheld calculator. We believe that these formulae will turn out to be useful in practice.
Given a universe of n objects, how large a sample should be tested to determine with high confidence if fewer than a given number b of them are bad? (In the voting context, these objects are typically voting precincts.)
As noted, our goal is to develop approximations that are both accurate and simple enough to be usable, if not by hand, then at least with the use of only a calculator, with no computer needed. (Your calculator must be a “scientific” one, though, so that you can compute arbitrary powers.^{1})
We first present a simple approximate “rule of thumb” (the “Rule of Three”) for estimating how big such a statistical sample should be, when using sampling with replacement.
This “Rule of Three” is simple and known, although perhaps not particularly wellknown. Jovanovic and Levy [15] discuss the Rule of Three, its derivation, and its application to clinical studies. See also van Belle [24].
We then address the question of sampling without replacement, which is the desired procedure for an election audit, of course, and provide improved formulae for sample size when sampling without replacement.
This paper justifies and improves approximations originally developed by Rivest [20], who attempted to correct for the bias in the Rule of Three due to sampling with replacement instead of sampling without replacement, by only sampling (now without replacement) the expected number of distinct elements that the Rule of Three sample (with replacement) would have contained. While that may be an interesting approach, the current paper derives its approximation formulae more directly, and provides rigorous upper and lower bounds on their approximation error.
Finally, in Section 5, we address three related questions: (1) determining the confidence level for a given audit size and level of fraud one wishes to detect, (2) determining the minimum amount of fraud one can detect for a given audit size with a given confidence level, and (3) auditing with constraints, such as the requirement that at least one precinct in each county be audited.
Saltman [22, Appendix B] was the first to study sample size (for sampling without replacement) in the context of voting; the basic formulae he develops for the optimal sample size are the ones we are trying to approximate here.
(There is much earlier relevant work on sampling theory, particularly the notion of “lot acceptance sampling” in statistical quality control. For example, the DodgeRomig Sampling Inspection Tables [6], developed in the 1930's and first published in 1940, provide generalizations of the simple sampling methods used here.)
Previous work by Neff [17] is noteworthy, particularly with regard to the economies resulting from having a larger universe of many smaller, easilytestable, objects. Brennan Center report [3, Appendix J] gives some simple estimation formula, based on sampling with replacement. An excellent report [8] on choosing appropriate audit sizes by Dopp and Stenger from the National Election Data Archive Project is now also available; there is also a nice associated audit size calculation utility on a web site [16]. Stanislevic [23] also examines the issue of choosing a sufficient audit size; he gives a particularly nice treatment of handling varying precinct sizes.
Some states, such as California, mandate a certain level (e.g. 1%) of auditing [19]. As we shall see, using a fixed level of auditing is not a well justified approach; sometimes one may need more auditing, and sometimes less, to obtain a given level of confidence that no fraud has occurred.
Suppose we have n “objects.” In a voting context, such an “object” might typically correspond to a precinct; it could also correspond to a particular voting machine or even an individual ballot, depending on the situation; the math is the same.
In this paper, we assume an adversarial situation, where an adversary may have corrupted some of the objects. For example, the adversary might have tampered with the results of some precincts in a state.
Thus, after the adversary has acted, each object is either “good” (that is, clean, untampered with, uncorrupted), or “bad” (that is, tampered with, corrupted).
We now wish to test a sample of the objects to determine with high confidence that the adversary has not committed a “large” amount of fraud.
(As we shall see, this is related to the following standard combinatorial problem: We have an urn containing n balls, b of which are black and n−b of which are white; we wish to sample enough balls to have a sufficiently high probability of sampling at least one black ball.)
We assume that each object is independently auditable. That is, there is a test or audit procedure that can determine whether a given object is good or bad. We assume this procedure is always correct.
For example, testing the results in a precinct may involve comparing the electronic results from the precinct with a hand count of the corresponding voterverified paper ballots. The precinct may be judged to be good if the results are equal (or perhaps if they are “sufficiently close”).
Of course, there may easily be explanations for a discrepancy other than malicious behavior on the part of some “adversary.” Indeed, one of the normal goals of such a postelection audit is to assess if there were systematic errors in the results due to undetected equipment or procedural problems, such a misprogramming ballot style or other configuration data, in addition to assessing if fraud was present. In general, electronic officials will try to determine the cause of any discrepancies found, whether it be due to malicious causes or not, and take appropriate corrective or remedial action as necessary (which might involve further auditing and investigations).
While discrepancies found are typically not due to adversarial behavior (fraud), in this paper we nonetheless focus on the problem of detecting fraud in election results through the use of appropriate postelection auditing. This is in part because this is the most difficult case; we feel that systematic problems not due to malicious behavior are likely to be uncovered pretty well anyway during an audit designed to detect fraud. (Although we do favor requirements such as ensuring that at least one precinct in each county participate in a postelection audit.)
To continue with our modeling: we'll now simply assume that each object tested is found to be “good” or “bad.” In the voting scenario, we assume each precinct whose results have been in any way manipulated by an adversary will test as “bad,” and each precinct which the adversary has not touched will test as “good.”
This is of course a bit of a simplification, since an adversary may try to influence an election by making very small modifications to the results a large number of precincts, hoping that each such modification will be judged as too small to cause the precinct to be flagged as “bad” during the audit.
We note in this context that Ansolabehere and Reeves [2] determined from historical recount data in New Hampshire that there is typically a difference in range of 0.5%–1.0% between an initial machine count of a set of opscan ballots and a hand recount of those ballots; adversarial manipulation in a similar range might pass as nonanomalous. Similar results were found by Alvarez, Katz, and Hill [1] in their study of the recounting of punchcard ballots in California. (Another interesting study of recounts in New Hampshire by Herron and Wand [12] examines the question as to whether the choice of different voting technologies by different precincts was a source of partisan bias in the 2004 election results.)
Nonetheless, in spite of the clear existence of a small level of such “measurement noise,” we'll continue to make the simplifying assumption for our purposes in this paper that each precinct measures cleanly as “good” or “bad.”
For more general discussions of fraud in elections, see the references collected by Hill [13], the report by the EAC [4], and the report by the Brennan Center [3]. For more general discussions of auditing election results, see the case study by electionline.org [9], and the Brennan Center web site [10] (including the House testimony by Norden [18]).
To determine whether any fraud at all occurred, we would need to test all objects. Here we give up the ability to detect any fraud, and test only a sample of the objects in order to determine, with high confidence, that a large amount of fraud has not occurred. We lose a bit of confidence in return for a large increase in efficiency, as is usually the case for a statistical test.
Let b denote the number of “bad” objects we wish to detect, where b is a given constant, 1≤ b≤ n. That is, we wish to determine, with high confidence, that the number of corrupted objects is not b or greater.
Since the adversary wishes to escape detection, he will corrupt as few objects as possible, consistent with achieving his evil goals. We assume that corrupting b objects suffices, and so the adversary corrupts exactly b objects. (For voting, this implies that all precincts are assumed to have roughly the same size; see Section 2.1.)
We let c denote our desired “confidence level”—that is, we want the probability of detecting corruption of b or more objects to be at least c, where c is a given parameter, 0≤ c≤ 1 (e.g. c = 0.95).
We let
f = b/n (1) 
denote the fraction of bad objects we wish to detect; we call f the “fraud rate.” Given one of b or f, the other is determined, via equation 1).
We will be considering samples drawn both with replacement and without replacement. For mnemonic convenience, we use t to denote sample sizes when the sample is drawn with replacement, and u to denote sample sizes when the sample is drawn without replacement. (Think of “u” for “unique” or “distinct”.)
The auditing process can be cast in the terminology of a conventional “hypothesistesting” framework. We set the null hypothesis to be the hypothesis we wish to refute,
H_{0} = the reported election outcome is incorrect 
(i.e., there was fraud or other error in the electronic totals sufficient to change the election outcome), while the alternative hypothesis is its complement,
H_{1} = the reported election outcome is correct 
(i.e., the electronic totals give the correct result).
We then randomly select a sample of precincts to audit and handcount the corresponding paper ballots. If the handcounts all match the electronic totals in those precincts, we have event:
D_{0} = no sampled precincts were “bad” 
(where “bad” means “showed evidence of possible fraud”); otherwise, we have event:
D_{1} = at least one sampled precinct was “bad” . 
Our statistical test is designed in such a way that if the null hypothesis were true (sufficient fraud to change the election outcome exists), then it is very unlikely (probability at most 1−c) that no “bad” objects would be detected (event D_{0}) due to the random nature of sampling. Thus, the absence of any “bad” objects in our sample permits us to reject the null hypothesis with high confidence, and in this case, we would typically declare the winner to be the winner as shown by the electronic totals.
However, if we do detect “bad” objects (event D_{1}), we cannot reject the null hypothesis. This does not necessarily imply that the null hypothesis is true; rather, we simply do not have sufficient evidence to reject the null hypothesis and declare a winner. (Indeed, if b−1 objects were corrupted, a quantity posited to be insufficient to alter the election outcome, then it is quite likely nonetheless that a “bad” object would be detected.) In the case where we have evidence that some fraud may have occurred, one would typically proceed with a wider investigation of the election. Thus, our statistical test is inherently onesided: the absence of any “bad” objects in our sample allows us to reject the null hypothesis with high confidence and typically declare a winner, while the presence of any “bad” objects would typically trigger a wider investigation.
In hypothesis testing terms, one considers two types of errors. A Type I error occurs when the null hypothesis is rejected incorrectly. For a given statistical test, the probability that a Type I error occurs is denoted by α; in our case, this type of error occurs with probability
α = Pr[D_{0} ∣ H_{0}] . 
Here we have failed to detect fraud (or other significant error) sufficient to change the outcome of the election.
Similarly, a Type II error occurs when the null hypothesis is accepted incorrectly. For a given statistical test, the probability that a Type II error occurs is denoted by β; in our case, this type of error occurs with probability
β = Pr[D_{1} ∣ H_{1}] . 
Here we have detected errors or fraud in some precincts even though the election outcome is correctly determined by the electronic totals.^{2}
The quantity α is the statistical significance level of the test, and the quantity 1−β is the statistical power of the test. We are primarily concerned with the significance level of our test, i.e, the probability that we fail to detect an incorrect election outcome due to the nature of random sampling, a Type I error that occurs with probability α. In this paper, for historical reasons, we also refer to the confidence level c of our test, where c = 1 − α. (In the final version of this paper we may stick with the more typical usage.)
If we choose b appropriately (as, for example, suggested in the next section), then we have that
α = Pr[D_{0} ∣ H_{0}] ≤ Pr[D_{0} ∣ H_{0}'] 
where
H_{0}' = b precincts are “bad” . 
By determining an appropriate value of b, and then choosing the appropriate sample size to make Pr[D_{0} ∣ H_{0}'] sufficiently small, we will make the probability of a Type I error at most α = 1−c. Thus, the probability of reporting an incorrect election outcome will be at most 1−c.
We now explain how a suitable value for b might be determined for an election audit from the apparent margin of victory, using a method suggested by Dopp and Stenger [8]. Here, b is the number of precincts that an adversary would have had to corrupt to swing the election. If we assume (as is suggested by Dopp and Stenger to be reasonable) that the adversary wouldn't dare to change more than a fraction s = 0.20 (i.e. 20%) of the votes in a precinct, and that the “winner” won by a margin of m of the votes (where 0≤ m≤ 1), then the adversary would have had to have corrupted a fraction
f=m/(2s)=2.5m (2) 
of the precincts—or, equivalently,
b=mn/(2s)=2.5mn (3) 
precincts.
(We assume all precincts have the same size. If all of the votes changed had been moved from the actual winner to the alleged winner, then a margin of victory of a fraction m of the votes cast for the alleged winner must have involved at least a fraction m/(2*0.20)=2.5m of the precincts, since each precinct corrupted changed the difference in vote count between the top two candidates by 2s=40% of the vote count of that precinct.) If the apparent winner has won by m=1% in a county with 400 precincts, you would want to test for b=2.5mn=10 or more bad precinct counts.
This approach can be modified in various ways (e.g. by adjusting s) to suit particular situations; in any case a value of b is determined that represents the minimum number of precincts that an adversary “must” corrupt in order to have changed the election outcome.
See Saltman [22], Stanislevic [23], or Dopp et al. [8] for further examples and excellent treatment of the issue of computing appropriate target values b (or f) given a set of election results and possibly varying precinct sizes. Rivest [21] also treats the case of varying precinct sizes.
We begin by examining sampling with replacement (where the sample may contain an element more than once). Although this wouldn't be used in practice for auditing an election, it is a useful starting point for our analyses, and provides some reasonably accurate estimation formulae that can be easily computed in one's head.
For sampling with replacement, we use t to denote the sample size, and t_{*}(n,b,c) to denote the optimal sample size (when sampling a set of size n with replacement, in order to find at least one bad element, with probability at least c, when b bad elements are present). We'll later use the analogous notation u_{*}(n,b,c) for the optimal sample size for sampling without replacement.
Here now is a simple “rule of thumb” for sampling with replacement.
ft = 
 ≥ 3. (4) 
t ≥ 3n / b . (5) 
As a simple example: to detect a 1% fraud rate (f=0.01) (with 95% confidence), you then need to test t=300 objects.
Note that for a given fraud rate f, the rule's sample size is independent of the universe size n. This may seem counterintuitive, but is to be expected. If you have some wellmixed sand where most sand grains are white, but a fraction f are black, you need only sample a handful to be confident of obtaining a black grain, no matter whether the amount of sand to be examined is a cupful, a bucketfull, or a beach.
The sample size t may even be greater than n (if b<3); this is OK since we are sampling with replacement, and it may take more than n samples (when sampling with replacement) to get adequate coverage when b is so small.
We now justify the Rule of Three, and then generalize it to handle an arbitrary confidence level (not just c=0.95). Let f=b/n be the fraud rate, and let t be the sample size.
We first justify the Rule of Three for a confidence level of 95%; this analysis follows that given by Jovanovic and Levy [15].
The probability that a fraud rate of f or greater goes undetected (when drawing a sample of size t with replacement) is:
(1−b/n)^{t} = (1 − f) ^{t} . (6) 
so t must be large enough so that
(1−f)^{t} ≤ 0.05 
or equivalently:
t ≥ 
 (7) 
Since
ln(0.05) = − ln(20) = −2.9957 ≈ −3 
—isn't it so very nice that ln(20) is almost exactly 3?—equation (7) is implied by
t ≥ 
 . (8) 
Using the wellknown approximation
ln(1 − f) ≈ −f , (9) 
which is quite accurate for small values of f (and −f is an lower bound on ln(1−f)), equation (8) is implied by:
t ≥ 

which can be rewritten as
t ≥ 
 (10) 
or equivalently as
ft ≥ 3 . (11) 
Equation (11) has a very nice and intuitive interpretation. Since t is the number of objects tested, and f is the fraud rate, then ft is the number of objects among the test objects that we would expect to find corrupted.
The sample should be large enough so that you expect it to contain at least three corrupted objects. If you sample enough so that you expect to see at least three corrupted objects on the average, then you'll see at least one corrupted object almost always (i.e., at least 95% of the time).
(Similarly, a random variable X distributed according to the Poisson distribution with mean λ>3 satisfies Pr[X=0]=e^{−λ}<e^{−3}=0.04978….)
As a running example, suppose that n = 400, b=10, and f = b/n = 0.025; the Rule of Three says to pick a sample of size 3n/b=3*400/10 = 120.
(We shall see that the optimal sample size for sampling without replacement for these parameters is a little smaller—103—, so considering sample size with replacement may be a good firstcut approximation to the sample size needed for sampling without replacement.This “Rule of Three” ( t ≥ 3n/b ) is simple enough for some practical guidance.)
The Rule of Three is also easily generalized to handle other confidence levels. For a general confidence level c, 0≤ c≤ 1, we need that
(1−f)^{t} ≤ (1−c) (12) 
so we obtain the following formulae for the optimal sample size t_{*}(n,b,c), when sampling with replacement:

We note that equation (13) may give “optimal” values for t_{*} that are nonintegral, while in practice the sample size must be an integer. Of course, the optimal integral sample size is then just t_{*} rounded up to the next integer, yielding T_{*}:
T_{*}(n,b,c) = ⌈ t_{*}(n,b,c) ⌉ . 
Using equation (9), we obtain the generalized form of the Rule of Three as an approximation:
t_{1}(n,b,c) = 
 . (15) 
This completes our discussion of sample sizes for sampling with replacement.
Suppose we pick u objects to test, where 0< u≤ n. These u objects are chosen independently at random, without replacement—the objects are distinct.^{3}
In an election, if any of the u tested objects (e.g. precincts or voting machines) turns out to be “bad,” then we may declare that “evidence of possible fraud is detected” (i.e., at least one bad object was discovered). Otherwise, we report that “no evidence of fraud was detected.” When a bad object is detected, additional investigation and further testing may be required to determine the actual cause of the problem.
We wish it to be the case that if a large amount of fraud has occurred (i.e., if the number of corrupted objects is b or greater), then we have a high chance of detecting at least one bad object.^{4}
Given that we are drawing, without replacement, a sample of size u from a universe of size n containing b bad objects, the chance that no bad objects are detected (i.e. all bad objects escape detection) is:

the chance that at least one bad object is detected is:

We note here the convenient duality between b and u, which we shall use later:

(If we think of the b bad objects as the “sample” and the u audit objects as the targets to be detected, then we are just switching the role of the bad objects and the audited objects.) This duality gives us another expression for e(n,b,u), dual to equation (18):
e(n,b,u) = 

 . (24) 
For a given confidence level c (e.g. c = 0.95), the optimal sample size u_{*}=u_{*}(n,b,c) is the least value of u making d(n,b,u) at least c:

We now address again the issue of nonintegral sample sizes. Although of course sample sizes are integral in practice, our formulae work perfectly well for nonintegral sample sizes, and it is convenient for us to work with them: note that e(n,b,u) equation (24) is well defined when u is any real number, and so d(n,b,u)=1−e(n,b,u) is also well defined when u is any real number. In practice, a nonintegral optimal sample size u_{*}(n,b,c) would be rounded up to the next integer ⌈ u_{*}(n,b,c) ⌉, which we denote as U_{*}(n,b,c).
Equations (16)–(20) and (25)–(26) are not new here; they have been given and studied by others (e.g. [22, 17, 8]).
In our running example, we have n=400 and b=10; we wish to determine if a set of 400 objects contains 10 or more bad ones. Using a computer program to try successive values of u yields the result:
U_{*}(400,10,0.95) = 103 ; (27) 
we need to test a sample (drawn without replacement) of size at least 103 in order to determine if our set of 400 objects contains 10 or more bad objects, with probability at least 95%.
In some sense, this completes the analysis of the problem; it is easy for a computer program to determine the optimal sample size U_{*}(n,b,c), given n, b, and c. (See http://uscountvotes.org where such a program may be posted.)
However, it is useful to find simple but accurate approximations for this optimal value U_{*}(n,b,c) of u, that can be easily calculated without the use of a computer. That is the main purpose of this paper—to derive accurate and rigorously justified approximations for U_{*} that can be evaluated by election officials using only a pocket calculator.
The formulae of the previous section for T_{*} (for sampling with replacement) are of course crude estimates for U_{*} (sampling without replacement); they are overestimates.
To see this, note that equation (18) implies that
e(n,b,u) ≤  /   \  1 − 
 \   / 
 (28) 
Now (1−b/n)^{u} is the probability of drawing a multiset of size u with replacement having no bad objects. Thus, for a fixed sample size, the probability of failure when drawing samples without replacement is, as one would expect, upper bounded by the probability of failure when drawing samples with replacement. The quality of this upper bound is a function of the difference between the righthand sides of equation (15) and inequality (25). Note that this difference grows as u increases, and for high probability results with large n and small b, u can be quite large. (Indeed, when b=1 and c very large, t_{*}(n,b,c) is approximately nln(n) — this is the “coupon collector's problem” — while u_{*}(n,b,c) is clearly no larger than n.)
Thus, we can in fact use the Rule of Three or other formulae from the preceding section to get an upper bound on the sample size needed for sampling without replacement; in many cases this may give a satisfactory firstcut answer. But we can do better, as the next section shows.
We now develop an upper bound on the optimal sample size when sampling without replacement to detect at least one of b bad objects in a universe of size n with probability at least c.
From equation (24), one can derive (analogous to the derivation of equation (28) from equation (18)), the following bound:
e(n,b,u) ≤  /   \  1 − 
 \   / 
 (29) 
Our goal is to determine a value u is sufficiently large to guarantee that e(n,b,u) is at most 1−c; from the bound (29) we can obtain such a sufficiently large u:
/   \  1− 
 \   / 
 ≤ 1−c 
⇔ 1−u/n ≤ (1−c)^{1/b} 
⇔ u/n ≥ 1 − (1−c)^{1/b} 
⇔ u ≥ n (1 − (1−c)^{1/b}) (30) 
Since (29) holds for any u satisfying the above relation, u_{*}(n,b,c) is no larger than the right hand side of the above relation. This upper bound on u_{*}(n,b,c) is our first major result for sampling without replacement; it is a formula that is easy to calculate, yet remarkably accurate.
We designate this bound as u_{1}:
u_{*}(n,b,c) ≤ u_{1}(n,b,c) (31) 

The formula for u_{1}(n,b,c) is the same as the that proposed by Rivest [20] as an approximation for u_{*}(n,b,c); however, that paper only justified u_{1} as an approximation heuristically and empirically; here we have shown that it is a firm upper bound for u_{*}(n,b,c).
Of course, if we round up u_{1}(n,b,c) to obtain U_{1}(n,b,c), we obtain an integer upper bound on the optimal integral sample size:

We can obtain a tighter upper bound by analyzing the product in equation (24) directly. Using the following wellknown inequalities relating the harmonic, geometric, and arithmetic means for nonnegative values x_{i} [11]
 ≤ 
 ≤ 
 (33) 
we proceed as follows, where H_{k} is the kth harmonic number, i.e., H_{k} = 1 + 1/2 + ⋯ + 1/k.
As before, our goal is to determine a u sufficient to guarantee that the above quantity is at most 1−c. Solving the inequality
/   \  1 − u · 
 \   / 
 ≤ 1−c 
in much the same manner as the derivation of inequality (30), we obtain

Note that the bound obtained in inequality (35) was derived using only one approximation, inequality (34) above. The righthand side of inequality (35) is our second upper bound on the optimal sample size required for sampling without replacement. We call this upper bound u_{2}(n,b,c); we also let U_{2}(n,b,c)=⌈ u_{2}(n,b,c) ⌉; this is of course an upper bound on U_{*}(n,b,c).
u_{*}(n,b,c) ≤ u_{2}(n,b,c) (36) 

Unfortunately, most calculators don't have a “harmonic number” button, so inequality (35) isn't so useful in practice!
To fix this situation, without weakening our bound too much, we note that
 = 

is the harmonic mean of the set of values {n, …, n−b+1 }; thus, we can obtain a simpler though slightly weaker bound by employing inequality (33) and replacing this harmonic mean by the corresponding (and somewhat larger) arithmetic mean (n − (b−1)/2), which yields

This gives our third and final upper bound:
u_{*}(n,b,c) ≤ u_{3}(n,b,c) (39) 

Note the similarity of inequalities (30) and (38): the factor n has been replaced with (n − (b−1)/2). Thus, the new inequality (38) (and inequality (35) which precedes it) is a strict improvement over inequality (30) for all b > 1 (and the same for b=1).
We let U_{3}(n,b,c)=⌈ u_{3}(n,b,c) ⌉; this is of course also an upper bound on U_{*}(n,b,c).
Inequality (38) is our third (and final) upper bound on the optimal sample size required for sampling without replacement; it is the inequality that we recommend for actual use in practice.^{5} As we see in the next section, it should never give a sample size that is more than 3 too large, assuming that c≤ 0.9975.
Here is a simple proof that our bound (38) does not exceed u_{*}(n,b,c) by too much. Interestingly, the amount that it exceeds u_{*}(n,b,c) is largely independent of both n and b.
We now give a lower bound on our probability of failure, derived from equation (24), complementary to our previous upper bound (29):

Thus, our probability of failure is at least 1−c if
/   \  1− 
 \   / 
 ≥ 1−c . 
Solving for u, this is equivalent to
u ≤ (n−(b−1)) · (1−(1−c)^{1/b}) . 
Thus,
u_{*}(n,b,c) ≥ (n−(b−1)) · (1−(1−c)^{1/b}) (41) 
Note the resemblance of this lower bound on u_{*} to the upper bound of inequality (38):
u_{*}(n,b,c) ≤ (n−(b−1)/2) · (1−(1−c)^{1/b}). 
Now we can show that the bound (38) does not exceed u_{*}(n,b,c) by much; the difference is at most
 · (1−(1−c)^{1/b}). (42) 
Note that this is independent of n. It is also effectively independent of b: Using elementary calculus, one can show that the difference (42) above is monotonically increasing in b and that

 = 

Thus, our bound u_{3}(n,b,c) never exceeds u_{*}(n,b,c) by more than (−ln(1−c))/2, independent of n and b, and this quantity is less than 3 for all c≤ 0.9975. (It follows that U_{3}(n,b,c)−U_{*}(n,b,c) is at most 3.)
Similar reasoning shows that our bound u_{1}(n,b,c) never exceeds u_{*}(n,b,c) by more than twice as much as u_{3}(n,b,c) does: it is off by no more than (−ln(1−c)), independent of n and b, and this quantity is less than 6 for all c≤ 0.9975.
In conclusion, we have a sample size
u_{3}(n,b,c) =  /  \  n − 
 \  /  ·  /  \  1 − (1−c)^{1/b}  \  / 
that is
The following table demonstrates the accuracy of our formula for n=500 (slightly larger than the number of precincts in a typical U.S. Congressional district), for c=0.95 and c=0.99, and various values of b. The “low” column gives the lower bound of equation (41), the “opt” column gives the optimal sample size U_{*}(n,b,c), and the “up” column gives our upper bound u_{3}(n,b,c).
c=0.95  c=0.99  
n  b  low  opt  up  low  opt  up 
500  1  475  475  475  495  495  495 
500  2  388  388  388  450  450  450 
500  5  224  225  225  299  300  300 
500  10  128  129  129  182  183  183 
500  20  67  69  69  99  101  101 
500  50  27  28  28  40  42  42 
500  100  12  14  14  19  21  21 
500  200  5  6  6  7  9  10 
This paper has largely been concerned with determining the size of a statistical audit u for a given universe of size n, desired fraud detectability level b, and desired confidence c. However, there are related “inverse” questions which are frequently asked that our bounds and techniques can usefully address.
For example, the size u of a statistical audit may be mandated by law (e.g., u=0.02n for 2% audit), and one may wish to know for this u and a given b what confidence level c one has in detecting corruption of b (or more) objects. This is the “confidence level” question.
Or, one may wish to know for this u and a given c the smallest number b of corrupted objects one can detect with confidence at least c. This is the “level of fraud detectability” question.
These two questions can be effectively answered using the bounds or techniques developed above. Essentially, the four variables n, u, b, and c are related by the equation
/  \ 
 \  /  ÷  /  \ 
 \  /  =  /  \ 
 \  /  ÷  /  \ 
 \  /  = 1−c 
and fixing any three of these variables, one can approximate the fourth.
Besides the “inverse” questions, our techniques can be adapted to handle various restrictions imposed on the auditing process. The proposed Holt bill [14], for instance, specifies that at least one precinct from each county needs to be audited.
We show how to answer the “inverse” as well as the restriction questions above using our bounds and techniques.
Given a universe of size n and a given audit size u, what confidence can one have in being able to detect one (or more) of b “bad” objects?
This confidence is given exactly by
c = d(n,b,u) = 1 − e(n,b,u). (43) 
Much of Section 4 was effectively devoted to proving the following bounds on e(n,b,u):

Applying these inequalities to equation (43), we obtain:

The above inequalities may be useful, say, when considering legislation that mandates some fixed level u of auditing (see [7] as one example of this sort of consideration). This scenario is also useful for a “tiered” auditing approach [18]. In the tiered approach, thresholds for the margin of victory correspond to particular auditing percentages. For example, for a margin of victory of 1.75%, one should audit 5% of the precincts because it offers a confidence of 61% compared to 43% as given by a 3% auditing strategy [18]. In this case, our techniques are useful to compute the confidence level achieved when auditing a certain percent of precincts for a given margin of victory.
Given a universe of size n, a fixed audit size u, and a confidence level c, what is the smallest b for which can one detect one (or more) of b “bad” objects with confidence at least c?
While our original problem was solved by approximating the quantity
e(n,b,u) =  /  \ 
 \  /  ÷  /  \ 
 \  /  , 
this dual problem is best solved by approximating the equivalent quantity
e(n,b,u) =  /  \ 
 \  /  ÷  /  \ 
 \  /  . 
Using the techniques developed in Section 4, one can derive the following analogous bounds on e(n,b,u):

Setting e(n,b,u) = 1−c and solving for b, we obtain:

As before, one can show that these bounds are never different by more than (−ln(1−c))/2, which is less than 3 for all c ≤ 0.9975.
One could then apply these results using relationship (3) to estimate what is the corresponding smallest margin of victory that one could confirm with an audit of the given size u, to the given confidence level c, in a straightforward manner.
Some election systems might place constraints on the auditing process. In this section, we illustrate how one can employ our techniques in the case of the Holt bill when at least one precinct from each county needs to be audited (as specified in [14]).
Let z denote the number of counties and a_{i} the number of precincts in county i. We are still aiming for a total confidence of c, when sampling n precincts out of which b are corrupted. There are at least two ways to audit the precincts:
In this case, we prefer the first approach because the latter can audit as many as u(n,b,c)+z−1 if all the precincts audited with our technique belong to the same county.
Specifically, we propose the following procedure:
At step 2, b^{*} = b because no fraud was detected so far and n^{*} = n−z because z precincts were already audited. We show that c^{*} = 1 − 1−c/(1−1/a_{max})^{b} guarantees a global confidence level of at least c, where a_{max} is the maximum number of precincts in a county. For brevity, let us define the following events: A = “fraud goes undetected after conducting the procedure above,” A_{1} = “fraud is undetected at step 1,” and A_{2} = “fraud is undetected at step 2.” We now have:
Pr[A] = Pr[A_{1} A_{2}] = Pr[A_{1}] Pr[A_{2} ∣ A_{1}] (44) 

where b_{i} is the number of corrupted precincts from the ith county. We also used the fact that (1−b_{i}/a_{i}) ≤ (1−1/a_{i})^{bi}.
According to our formula in Section 4,
Pr[A_{2} ∣ A_{1}] ≤ 1−c^{*} ≤ 
 , 
if we audit u(n^{*}, b^{*}, c^{*}).
Therefore, using equation (44), the probability that corruption goes undetected is less than 1−c.
Note that the formulae for c^{*} makes sense because because when 1−c > (1−1/a_{max})^{b} step 1 suffices to guarantee the desired confidence. Also, since n^{*} < n and c^{*} < c, u(n^{*},b^{*},c^{*})+z < u(n,b,c) + z; hence, this method is more efficient method than the first.
Furthermore, the number of precincts audited beyond the u(n,b,c) that we would audit when no constraints exist is small as follows. We make use of the formulae in equation (32) for simplicity.

where the inequality comes from the fact that 1 − c ≤ 1 − (1−1/a_{max})^{b} since otherwise one would stop auditing at step 1. For a balanced distribution of the precincts in counties, z − n/a_{max} is close to 0 which means that the additional number of precincts we audit with the new constraint is small.
A related question that may be encountered in elections is when the candidate that lost the election is allowed to audit z precincts of his/her choice. In this scenario, we have a similar choice: to use our techniques before or after the loser's auditing. If the loser audits first, we will next audit an additional number of u(n−z,b,c) precincts. The sample space is reduced by z while the confidence required, c, stays the same because we do not know the loser's probability of detecting fraud (and thus assume it to be 0). On the other hand, if the loser picks after we employ our auditing techniques, we will have to audit u(n,b,c). The sample space remains the same because the loser's strategy may be so bad that he/she selects only uncorrupted precincts. The first procedure turns out to be better once again.
We note (as other authors have as well) that overly simple rules, such as “sample at a 1% rate”, are not statistically justified in general. Using the Rule of Three, we see that a 1% sample rate is appropriate only when
t ≤ 0.01n 
or
3n/b ≤ 0.01n 
or
b ≥ 300 . 
Since b is the total number of corrupted objects, we see that a 1% sampling rate may be inadequate when n is small, or the fraud rate is small…(Of course, the Rule of Three is only for sampling with replacement, but the intuition it gives carries over to the case of sampling without replacement.)
We hope that the rules presented here will provide useful guidance for those designing sampling procedures for audits.
Indeed, since the formula
U_{3}(n,b,c) = ⌈ (n − (b−1)/2) ( 1 − (1−c)^{1/b}) ⌉ (46) 
is so simple, so accurate, and always conservative, one could imagine just always using this sample size (instead of the optimal value), or writing this formula into election law legislation mandating audit sample sizes. Along with this formula, one could perhaps mandate use of equation (3) deriving the number b of bad objects to test from the apparent margin of victory m of the winner. (But it would probably be best to merely mandate a sample size sufficient to detect, with a specified level of confidence, any election fraud sufficient to have changed the outcome. In addition, one may wish to ensure that objects (e.g. precincts) with surprising or suspicious results also get examined.)
We thank Mike Alvarez and the anonymous reviewers for their helpful comments. We also gratefully acknowledge the support provided by the CalTech/MIT Voting Technology Project and NSF grants CCF0418390 and IIS0534482.
n·(1 − (1−c) 
 ) + 1 
This document was translated from L^{A}T_{E}X by H^{E}V^{E}A.