Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • Home
  • Attend
    • Registration Information
    • Registration Discounts
    • Students and Grants
    • Venue, Hotel, and Travel
    • Oakland Dining Guide
  • Activities
    • Birds-of-a-Feather Sessions
    • Poster Session
  • Program
    • At a Glance
    • Technical Sessions
  • Participate
    • Call for Papers
    • Call for Posters
    • Instructions for Participants
  • Sponsorship
  • About
    • Symposium Organizers
    • Past Symposia
    • Questions?
    • Help Promote!
  • Home
  • Attend
    • Registration Information
    • Registration Discounts
    • Students and Grants
    • Venue, Hotel, and Travel
    • Oakland Dining Guide
  • Activities
  • Program
    • At a Glance
    • Technical Sessions
  • Participate
    • Call for Papers
    • Call for Posters
    • Instructions for Participants
  • Sponsorship
  • About
    • Symposium Organizers
    • Past Symposia
    • Questions?
    • Help Promote!

sponsors

Gold Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
General Sponsor
General Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Industry Partner

help promote

NSDI '15 button

Get more
Help Promote graphics!

connect with us


  •  Twitter
  •  Facebook
  •  LinkedIn
  •  Google+
  •  YouTube

twitter

Tweets by @usenix

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » Scalable Error Isolation for Distributed Systems
Tweet

connect with us

https://twitter.com/usenix
https://www.facebook.com/usenixassociation
https://plus.google.com/108588319090208187909/posts
http://www.linkedin.com/groups/USENIX-Association-49559/about
http://www.youtube.com/user/USENIXAssociation

Scalable Error Isolation for Distributed Systems

Authors: 

Diogo Behrens, Technische Universität Dresden; Marco Serafini, Qatar Computing Research Institute; Sergei Arnautov, Technische Universität Dresden; Flavio P. Junqueira, Microsoft Research Cambridge; Christof Fetzer, Technische Universität Dresden

Abstract: 

In distributed systems, data corruption on a single node can propagate to other nodes in the system and cause severe outages. The probability of data corruption is already non-negligible today in large computer populations (e.g., in large datacenters). The resilience of processors is expected to decline in the near future, making it necessary to devise cost-effective software approaches to deal with data corruption.

In this paper, we present SEI, an algorithm that tolerates Arbitrary State Corruption (ASC) faults and prevents data corruption from propagating across a distributed system. SEI scales in three dimensions: memory, number of processing threads, and development effort. To evaluate development effort, fault coverage, and performance with our library, we hardened two realworld applications: a DNS resolver and memcached. Hardening these applications required minimal changes to the existing code base, and the performance overhead is negligible in the case of applications that are not CPUintensive, such as memcached. The memory overhead is negligible independent of the application when using ECC memory. Finally, SEI covers faults effectively: it detected all hardware-injected errors and reduced undetected errors from 44% down to only 0.15% of the software-injected computation errors in our experiments.

Diogo Behrens, Technische Universität Dresden

Marco Serafini, Qatar Computing Research Institute

Flavio P. Junqueira, Microsoft Research

Sergei Arnautov, Technische Universität Dresden

Christof Fetzer, Technische Universität Dresden

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {189030,
author = {Diogo Behrens and Marco Serafini and Flavio P. Junqueira and Sergei Arnautov and Christof Fetzer},
title = {Scalable Error Isolation for Distributed Systems},
booktitle = {12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15)},
year = {2015},
isbn = {978-1-931971-218},
address = {Oakland, CA},
pages = {605--620},
url = {https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens},
publisher = {USENIX Association},
month = may,
}
Download
Behrens PDF
View the slides

Presentation Video 

Presentation Audio

MP3 Download

Download Audio

  • Log in or    Register to post comments

Gold Sponsors

Silver Sponsors

Bronze Sponsors

General Sponsors

Media Sponsors & Industry Partners

© USENIX

  • Privacy Policy
  • Contact Us