Next: Experimental setup Up: //TRACE: Parallel trace replay Previous: Replaying the traces

Evaluation

This work is motivated by four hypotheses.

Hypothesis 1: Data dependencies and computation must be independently modeled during replay, otherwise the replay may differ from the traced application.
Hypothesis 2: By throttling every node and delaying every I/O, the I/O dependencies and compute time can be discovered and accurately replayed.
Hypothesis 3: Not every I/O necessarily needs to be delayed in order to achieve good replay accuracy.
Hypothesis 4: Not every node necessarily needs to be throttled in order to achieve good replay accuracy.

To test these hypotheses, three applications are traced and replayed across three different storage systems. The applications and storage systems chosen have different performance characteristics in order to highlight how application I/O rates scale (differently) across storage systems, and illustrate how //TRACE can collect traces on one storage system and accurately replay them on another. Recall, the primary goal of this work is to evaluate a new storage system, using trace replay to simulate the application. As such, traces are normally collected from one storage system and replayed on another.

There are three replay modes we could use as a baseline for comparison: a closed-loop as-fast-as-possible replay that ignores the think time between I/Os (AFAP), a closed-loop replay that replays think time (we call this think-limited), and an open-loop replay that issues I/O at the same time they are issued in the trace (timing-accurate [3]). Think-limited assumes that the think time (some combination of compute and synchronization) between I/Os is fixed. In general, we find think-limited to be more accurate than AFAP and therefore use it as our baseline comparator. A timing-accurate replay is not considered because, by definition, it will have an identical running time to the traced application. Note, a replayer that only models compute time (and ignores synchronization) requires some mechanism to distinguish compute time from synchronization time (e.g., a causality engine). Think-limited is therefore the best one can do before introducing such a mechanism.

Experiment 1 (Hypothesis 1) compares the running time of think-limited against the application. Because think-limited assumes a fixed synchronization time, one should expect high replay error when an application with significant synchronization time is traced on one storage system and replayed on another that has different performance.

Experiment 2 (Hypothesis 2) uses the causality engine to create annotated I/O traces. The traces are replayed and compared against think-limited.

Experiment 3 (Hypothesis 3) uses I/O sampling to explore the trade-off between tracing time and replay accuracy. Similarly, Experiment 4 (Hypothesis 4) uses node sampling to illustrate that not all nodes necessarily need to be throttled in order to achieve a good replay accuracy.

For all experiments, the traces used during replay are obtained from a storage system other than the one being evaluated. In other words, if storage system A is being evaluated, then the traces used for replay will have been collected on either storage system B or C. We report the error of the trace that produced the greatest replay error.

In all tests, running time is used to determine the replay accuracy, and the percent error is the evaluation metric. The reported errors are averages over at least 3 runs. More specifically, percent error is calculated as follows:

$\begin{displaymath} \frac{ApplicationTime - ReplayTime}{ApplicationTime}\times100 \end{displaymath}$

Average bandwidth and throughput are not reported, as these are simply functions of the running time.

Subsections

Next: Experimental setup Up: //TRACE: Parallel trace replay Previous: Replaying the traces

Michael Mesnier 2006-12-22