Check out the new USENIX Web site. next up previous
Next: I/O throttling Up: Background & motivation Previous: Trace replay models

Synchronization and the effect on I/O

A variety of synchronization mechanisms are in use today, including standard operating system mechanisms (signals, pipes, lock files, memory-mapped I/O) [35], message passing [20], shared memory [11], and remote procedure calls [44]. Also, some applications use hybrid approaches [34] (e.g., shared memory together with message passing). Although many of these mechanisms can be traced with a conventional tracing tool (e.g., ltrace, strace, mpitrace), it is unclear how one could replay asynchronous communication (e.g., applications using select or poll) without a semantic understanding of the application. Such asynchronous operations are used extensively in parallel applications in order to overlap communication with computation. Further, some of these synchronization mechanisms (e.g., shared memory) are not traceable using conventional tracing software.

For these reasons, tracing and replaying synchronization calls is difficult. Namely, the variety of synchronization mechanisms and their semantics would need to be understood, determining causality for asynchronous messages would require application-level knowledge, and ``untraceable'' calls would not be easy to capture. Unfortunately, ignoring synchronization is not a viable option.

Consider Figure 2 which illustrates a hypothetical parallel application modifying a shared data structure; barriers [33] are used to keep the nodes synchronized between stages. As can be seen in the figure, the I/O time composes only a fraction of the overall running time; there is also compute time and synchronization (``wait'') time. Moreover, if the speed of the storage system changes, the time each node spends waiting on other nodes could also change. These effects must be modeled during replay.

Figure 2: A hypothetical parallel application
\begin{figure}\begin{center}
\epsfig{file=fig/timeline.eps, scale=1.0}
\end{cent...
...sulting
in changes in the synchronization time for nodes 0 and 2.}
\end{figure}


next up previous
Next: I/O throttling Up: Background & motivation Previous: Trace replay models
Michael Mesnier 2006-12-22