Check out the new USENIX Web site. next up previous
Next: Experiment 1 (think-limited) Up: Evaluation Previous: Evaluation

Experimental setup

Three parallel applications are used in the evaluation: Pseudo, Fitness, and Quake. All three applications use MPI [20] for synchronization (none use MPI-IO).

Pseudo is a pseudo-application from Los Alamos National Labs [28]. It simulates the defensive checkpointing process of a large-scale computation: MPI processes write a checkpoint file (with interleaved access), synchronize, and then read back the file. Optional flags specify whether or not nodes also synchronize after every write I/O, and if there is computation on the data between read I/Os. Three versions of the pseudo-application are evaluated: one without any flags specified (Pseudo), one with barrier synchronization (PseudoSync), and one with both synchronization and computation (PseudoSyncDat2).

Fitness is a parallel workload generator from Intel [22]. The generator is configured so that $n$ MPI processes read non-overlapping portions of a file in turn; the first node reads its portion, then the second node, etc. There are only $n-1$ data dependencies: node 0 signaling node 1, node 1 signaling node 2, etc. This test illustrates a case where nodes are not proceeding strictly in parallel, but rather have some ordering that must be respected.

Quake is a parallel application developed at Carnegie Mellon University, used for simulating earthquakes [2]. It uses the finite element method to solve a set of partial differential equations that describe how seismic waves travel through the Earth (modeled as a mesh). The execution is divided into three phases. Phase 1 builds a multi-resolution mesh to model the region of ground under evaluation. The model, represented as an etree [47], is an on-disk database structure; the portion of the database accessed by each node depends on the region of the ground assigned to that node. Phase 2 writes the mesh structure to disk; node 0 collects the mesh data from all other nodes and performs the write. Phase 3 solves the equations to propagate the waves through time; computation is interleaved with the I/O, and the state of the simulated region is periodically written to disk by all nodes. Quake runs on a parallel file system (PVFS2 [10]) which is mounted on the storage system under evaluation.

The applications are traced and replayed on three storage systems. The storage systems are iSCSI [38] RAID arrays with different RAID levels and varying amounts of disk and cache space. Specifically, VendorA is a 14-disk (400GB 7K RPM Hitachi Deskstar SATA) RAID-50 array with 1GB of RAM; VendorB is a 6-disk (250GB 7K RPM Seagate Barracuda SATA) RAID-0 with 512 MB of RAM; and VendorC is an 8-disk (250GB 7K RPM Seagate Barracuda SATA) RAID-10 with 512 MB of RAM.

The applications and replayer are run on dedicated compute clusters. Pseudo and Fitness are run on Dell PowerEdge 650s (2.67 GHz Pentium 4, 1 GB RAM, GbE, Linux 2.6.12); Fitness is configured for 4 nodes, Pseudo is configured for 8. Quake is run on a cluster of Supermicros (3.0 GHz dual Pentium 4 Xeon, 2.0 GB RAM, GbE, Linux 2.6.12), and is configured for 8 nodes. The local disk is only used to store the trace files and the operating system. Pseudo and Fitness access the arrays in raw mode. For these applications, each machine in the cluster connects to the same array using an open source iSCSI driver [22]. For Quake, each node runs PVFS2 and connects to the same PVFS2 server, which connects to one of the storage arrays via iSCSI.


next up previous
Next: Experiment 1 (think-limited) Up: Evaluation Previous: Evaluation
Michael Mesnier 2006-12-22