Skip to main content
Back to USENIX
  • Conferences
  • Students
Sign in

USENIX Conference Policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

Transparent Fault Tolerance for Parallel Applications on Networks of Workstations

Daniel J. Scales, Digital Equipment Corporation, Western Research Laboratory; Monica S. Lam, Stanford University

This paper describes a new method for providing transparent fault tolerance for parallel applications on a network of workstations. We have designed our method in the context of shared object system called SAM, a portable run-time system which provides a global name space and automatic caching of shared data. SAM incorporates a novel design intended to address the problem of the high communication overheads in distributed memory environments and is implemented on a variety of distributed memory platforms. Our fundamental approach to providing fault tolerance is to ensure the replication of all data on more than one workstation using the dynamic caching already providedby SAM. The replicated data is accessible to the local processor like other cached data, making access to shared data faster and potentially offsetting some of the fault tolerance overhead. In addition, our method uses information available in SAM applications on how processes access shared data to enable several optimizations which reduce the fault-tolerance overhead. We have built an implementation of our fault tolerance method in SAM for heterogeneous networks of workstations running PVM3. In this paper, we present our fault-tolerance method and describe its implementation in detail. We give performance results and overhead numbers for several large SAM applications running on a cluster of Alpha workstations connected by an ATM network. Our method is successful in providing transparent fault tolerance for parallel applications running on a network of workstations and is unique in requiring no global synchronizations and no disk operations to a reliable file server.

Daniel J. Scales, Digital Equipment Corporation Western Research Laboratory

Monica S. Lam, Stanford University

BibTeX
@inproceedings {260510,
author = {Daniel J. Scales and Monica S. Lam},
title = {Transparent Fault Tolerance for Parallel Applications on Networks of Workstations},
booktitle = {USENIX 1996 Annual Technical Conference (USENIX ATC 96)},
year = {1996},
address = {San Diego, CA},
url = {https://www.usenix.org/conference/usenix-1996-annual-technical-conference/transparent-fault-tolerance-parallel-applications},
publisher = {USENIX Association},
month = jan
}
Download

Links

Paper: 
http://usenix.org/publications/library/proceedings/sd96/full_papers/scales.ps
  • Log in or register to post comments

© USENIX
EIN 13-3055038

  • Privacy Policy
  • Contact Us