Check out the new USENIX Web site.


Confidence and Accuracy

Benchmarking can never produce an exact result because complex systems exhibit inherent variability in their behavior. The best we can do is to make a probabilistic claim about the interval in which the ``true'' value for a metric lies based on measurements from multiple independent trials [13]. Such a claim can be characterized by a confidence level and the confidence interval at this confidence level. For example, by observing the mean response time $ \bar{R}$ at a test load $ \lambda$ for $ 10$ independent trials, we may be able to claim that we are $ 95$% confident (the confidence level) that the correct value of $ \bar{R}$ for that $ \lambda$ lies within the range $ [25 ms,30 ms]$ (the confidence interval).

Basic statistics tells us how to compute confidence intervals and levels from a set of trials. For example, if the mean server response time $ \bar{R}$ from $ t$ trials is $ \mu$, and standard deviation is $ \sigma$, then the confidence interval for $ \mu$ at confidence level $ c$ is given by:


$\displaystyle [\mu - \frac{z_{c}\sigma}{\sqrt{t}}, ~~ \mu + \frac{z_{c}\sigma}{\sqrt{t}}]$ (1)


$ z_c$ is a reading from the table of standard normal distribution for confidence level $ c$. If $ t <= 30$, then we use Student's t distribution instead after verifying that the $ t$ runs come from a normal distribution [13].

The tightness of the confidence interval captures the accuracy of the true value of the metric. A tighter bound implies that the mean response time from a set of trials is closer to its true value. For a confidence interval $ [low,high]$, we compute the percentage accuracy as:


$\displaystyle accuracy = 1 - error = (1 - \frac{high-low}{high+low}) %\times 100\%
$ (2)



Table 2: Benchmarking parameters used in this paper.
$ \lambda ^*$ Peak rate for a given server configuration and workload.
$ \lambda$ Offered load (arrival rate) for a given test load level.
$ \rho$ Load factor $ =\lambda/\lambda^*$ for a test load $ \lambda$.
$ \bar{R}$ Mean server response time for a test load.
$ R_{sat}$ Threshold for $ \bar{R}$ at the peak rate: the server is saturated if $ \bar{R} > R_{sat}$.
$ s$ Factor that determines the width of the peak-rate region $ [R_{sat} \pm sR_{sat}]$3.3).
$ a$ Target accuracy (based on confidence interval width) for the estimated value of $ \lambda ^*$2.3).
$ c$ Target confidence level for the estimated $ \lambda ^*$2.3).
$ t$ Number of independent trials at a test load.
$ r$ Runlength: the test interval over which to observe the server latency for each trial.



\begin{algorithm}
% latex2html id marker 275
[t]
\caption{Mapping Response Surfa...
...he server to the saturation
state\;
}
\end{list}
\vspace{-3ex}
\end{algorithm}

varun 2008-05-13