Confidence and Accuracy

Benchmarking can never produce an exact result because complex systems exhibit inherent variability in their behavior. The best we can do is to make a probabilistic claim about the interval in which the ``true'' value for a metric lies based on measurements from multiple independent trials [13]. Such a claim can be characterized by a confidence level and the confidence interval at this confidence level. For example, by observing the mean response time $\bar{R}$ at a test load $\lambda$ for

independent trials, we may be able to claim that we are

% confident (the confidence level) that the correct value of $\bar{R}$ for that $\lambda$ lies within the range

(the confidence interval).

Basic statistics tells us how to compute confidence intervals and levels from a set of trials. For example, if the mean server response time $\bar{R}$ from

trials is $\mu$ , and standard deviation is $\sigma$ , then the confidence interval for $\mu$ at confidence level

is given by:

is a reading from the table of standard normal distribution for confidence level

. If

, then we use Student's t distribution instead after verifying that the

runs come from a normal distribution [13].

The tightness of the confidence interval captures the accuracy of the true value of the metric. A tighter bound implies that the mean response time from a set of trials is closer to its true value. For a confidence interval

, we compute the percentage accuracy as:

Table 2: Benchmarking parameters used in this paper.

$\lambda ^*$	Peak rate for a given server configuration and workload.
$\lambda$	Offered load (arrival rate) for a given test load level.
$\rho$	Load factor $=\lambda/\lambda^*$ for a test load $\lambda$ .
$\bar{R}$	Mean server response time for a test load.
$R_{sat}$	Threshold for $\bar{R}$ at the peak rate: the server is saturated if $\bar{R} > R_{sat}$ .
	Factor that determines the width of the peak-rate region $[R_{sat} \pm sR_{sat}]$ (§3.3).
	Target accuracy (based on confidence interval width) for the estimated value of $\lambda ^*$ (§2.3).
	Target confidence level for the estimated $\lambda ^*$ (§2.3).
	Number of independent trials at a test load.
	Runlength: the test interval over which to observe the server latency for each trial.

$\begin{algorithm} % latex2html id marker 275 [t] \caption{Mapping Response Surfa... ...he server to the saturation state\; } \end{list} \vspace{-3ex} \end{algorithm}$