Check out the new USENIX Web site. next up previous
Next: Evaluation Metrics Up: Evaluation Previous: Evaluation

Experimental Environment

The implementation of SMFS that we worked was implemented as a user-mode application coded in Java. SMFS borrows heavily from our earlier file system, Antiquity [39]; however, the log address was modified to be a segment identifier and offset into the segment. A hash of the block can optionally be computed, but it is used as a checksum instead of as part of the block address in the log. We focus our evaluation on the append() operation since that is by far the dominant operation mirrored between two sites.

SMFS uses the Maelstrom network appliance [10] as the implementation of the network-sync option. Maelstrom can run as a user-mode module, but for the experiments reported here, it was dropped into the operating system, where it runs as a Linux 2.6.20 kernel module with hooks into the kernel packet filter [2]. Packets destined for the opposite site are routed through a pair of Maelstrom appliances located at each site. More importantly, situating a network appliance at the egress and ingress router for each site creates a virtual link between the two sites, which presents many opportunities for increasing mirroring reliability and performance.

The Maelstrom egress router captures packets, which it processes to create redundant packets. The original IP packets are forwarded unaltered; the redundant packets are then sent to the ingress router using a UDP channel. The ingress router captures and stores a window consisting of the last $ K$ IP packets that it has seen. Upon receiving a redundant packet it checks it against the last $ K$ IP packets. If there is an opportunity to recover any lost IP packet it does so, and forwards the newly recovered IP packet through a raw socket to the intended destination. Note that each appliance works in both egress and ingress mode since we handle duplex traffic.

To implement network-sync redundancy feedback, the Maelstrom kernel module tracks each TCP flow and sends an acknowledgment to the sender. Each acknowledgment includes a byte offset from the beginning of the stream up to the most recent byte that was included in an error correcting packet that was sent to the ingress router.

We used the TCP Reno congenstion control algorithm to communicate between mirrored storage systems for all experiments. We experimented with other congestion control algorithms such as cubic; however, the results were nearly identical since we were measuring packets lost after a primary site failure due to a disaster.

We tested the setup on Emulab [40]; our topology emulates two clusters of eight machines each, separated by a wide-area high capacity link with 50 to 200 ms RTT and 1 Gbps. Each machine has one 3.0 GHz Pentium 64-bit Xeon processor with 2.0 GB of memory and a 146 GB disks. Nodes are connected locally via a gigabit Ethernet switch. We apply load to these deployments using up to 64 testers located on the same cluster as the primary. A single tester is an individual application that has only one outstanding request at a time. Figure 3 shows the topology of our Emulab experimental setup (with the difference that we used eight nodes per cluster, and not four). Throughout all subsequent experiments, link loss is random, independent and identically distributed. See Balakrishnan et al [10] for an analysis with bursty link loss. Finally, all experiments show the average and standard deviation over five runs.

The overall SMFS prototype is fast enough to saturate a gigabit wide-area link, hence our decision to work with a user-mode Java implementation has little bearing on the experiments we now report: even if SMFS was implemented in the kernel in C, the network would still be the bottleneck.


next up previous
Next: Evaluation Metrics Up: Evaluation Previous: Evaluation
Hakim Weatherspoon 2009-01-14