sponsors
help promote
Get more
Help Promote graphics!
usenix conference policies
Scalable Error Isolation for Distributed Systems
Diogo Behrens, Technische Universität Dresden; Marco Serafini, Qatar Computing Research Institute; Sergei Arnautov, Technische Universität Dresden; Flavio P. Junqueira, Microsoft Research Cambridge; Christof Fetzer, Technische Universität Dresden
In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption.
In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two realworld applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPUintensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Diogo Behrens and Marco Serafini and Flavio P. Junqueira and Sergei Arnautov and Christof Fetzer},
title = {Scalable Error Isolation for Distributed Systems},
booktitle = {12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)},
year = {2015},
isbn = {978-1-931971-218},
address = {Oakland, CA},
pages = {605--620},
url = {https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens},
publisher = {USENIX Association},
month = may
}
connect with us