We present the design and initial evaluation of a resilient operating system architecture that leverages HW architectures combining few resilient with many non-resilient CPU cores. To this end, we build our system around a Reliable Computing Base (RCB) consisting of those software components that must work for reliable operation, and run the RCB on the resilient cores. The remainder of the system runs replicated on unreliable cores. Our system’s RCB consists of an L4 microkernel, a runtime environment and a replication manager. In this paper we state and justify assumptions about the hardware architecture, motivate the corresponding software architecture and evaluate communication mechanisms between the RCB and the replicas.
Adam J. Oliner, Anand Iyer, University of California, Berkeley; Eemil Lagerspetz and Sasu Tarkoma, University of Helsinki; Ion Stoica, University of California, Berkeley
We aim to detect and diagnose code misbehavior that wastes energy, which we call energy bugs. This paper describes a method and implementation, called Carat, for performing such diagnosis on mobile devices. Carat takes a collaborative, black-box approach. A non-invasive client app sends intermittent, coarse-grained measurements to a server, which identifies correlations between higher expected energy use and client properties like the running apps, device model, and operating system. Carat successfully detected all energy bugs in a controlled experiment and, during a deployment to 883 users, identified 5434 instances of apps exhibiting buggy behavior in the wild.
Ingo Weber and Hiroshi Wada, NICTA and University of New South Wales; Alan Fekete, NICTA and University of Sydney; Anna Liu and Len Bass, NICTAand University of New South Wales
The facility to rollback a collection of changes, i.e., reverting to a previous acceptable state, a checkpoint , is widely recognised as valuable support for dependability [3, 5, 10]. This paper considers the particular needs of users of cloud computing resources, wishing to manage the resources. Cloud computing provides infrastructure programmatically managed through a fixed set of simple system administration commands. For instance, creating and configuring a virtualized Web server on Amazon Web services (AWS) can be done with a few calls to operations that are offered through the AWS management API. This improves the efficiency of system operations; but having simple powerful system operations may increase the chances of human-induced faults, which play a large role in overall dependability [24, 25]. Catastrophic errors, like deleting a disk volume in a production environment, can happen easily with a few wrong API calls.
Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi, Purdue University
A key challenge in developing large scale applications (both in system size and in input size) is finding bugs that are latent at the small scales of testing, only manifesting when a program is deployed at large scales. Traditional statistical techniques fail because no error-free run is available at deployment scales for training purposes. Prior work used scaling models to detect anomalous behavior at large scales without being trained on correct behavior at that scale. However, that work cannot localize bugs automatically. In this paper, we extend that work with automatic diagnosis technique, based on feature reconstruction, and validate our design through case studies with two real bugs from an MPI library and a DHT-based file sharing application.
Reinhard Tartler, Friedrich-Alexander University Erlangen-Nuremberg; Anil Kurmus, IBM Research—Zurich; Bernhard Heinloth, Valentin Rothberg, and Andreas Ruprecht, Friedrich-Alexander University Erlangen-Nuremberg; Daniela Dorneanu, IBM Research—Zurich; Rüdiger Kapitza, TU Braunschweig; Wolfgang Schröder-Preikschat and Daniel Lohmann, Friedrich-Alexander University Erlangen-Nuremberg
The Linux kernel can be a threat to the dependability of systems because of its sheer size. A simple approach to produce smaller kernels is to manually configure the Linux kernel. However, the more than 11; 000 configuration options available in recent Linux versions render this a demanding task. We report on designing and implementing the first automated generation of a workload-tailored kernel configuration and discuss the security gains such an approach offers in terms of reduction of the Trusted Computing Base (TCB) size. Our results show that the approach prevents the inclusion of 10% of functions known to be vulnerable in the past.
Nicolas Schiper, Vincent Rahli, Robbert Van Renesse, Mark Bickford, and Robert L. Constable, Cornell University
This paper describes ShadowDB, a replicated version of the BerkeleyDB database. ShadowDB is a primary-backup based replication protocol where failure handling, the critical part of the protocol, is taken care of by a synthesized consensus service that is correct by construction. The service has been proven correct semi-automatically by the Nuprl proof assistant. We describe the design and process to prove the consensus protocol correct and present the database replication protocol. The performance of ShadowDB is good in the normal case and recovering from a failure only takes seconds. Our approach offers simplified means to diversify the code in a way that preserves correctness.
Baris Kasikci, Cristian Zamfir, and George Candea, École Polytechnique Fédérale de Lausanne
Modern concurrent software is riddled with data races and these races constitute the source of many problems. Data races are hard to detect accurately before software is shipped and, once they cause failures in production, developers find it challenging to reproduce and debug them.
Ideally, all data races should be known before software ships. Static data race detectors are fast, have few false negatives, but unfortunately have many false positives. Conversely, dynamic data race detectors do not have false positives, but have many false negatives and incur high runtime overhead. There is no silver bullet and, as a result, modern software still ships with numerous data races.
We present CoRD, a collaborative distributed testing framework that aims to combine the best of the two approaches: CoRD first statically detects races and then dynamically validates them via crowdsourced executions of the program. Our initial results show that CoRD is more effective than static or dynamic detectors alone, and it introduces negligible runtime overhead.
Muntasir Raihan Rahman, HP Labs, Palo Alto and University of Illinois at Urbana Champaign; Wojciech Golab, Alvin AuYoung, Kimberly Keeton, and Jay J. Wylie, HP Labs, Palo Alto
Large-scale key-value storage systems sacrifice consistency in the interest of dependability (i.e., partitiontolerance and availability), as well as performance (i.e., latency). Such systems provide eventual consistency, which—to this point—has been difficult to quantify in real systems. Given the many implementations and deployments of eventually-consistent systems (e.g., NoSQL systems), attempts have been made to measure this consistency empirically, but they suffer from important drawbacks. For example, state-of-the art consistency benchmarks exercise the system only in restricted ways and disrupt the workload, which limits their accuracy.
In this paper, we take the position that a consistency benchmark should paint a comprehensive picture of the relationship between the storage system under consideration, the workload, the pattern of failures, and the consistency observed by clients. To illustrate our point, we first survey prior efforts to quantify eventual consistency. We then present a benchmarking technique that overcomes the shortcomings of existing techniques to measure the consistency observed by clients as they execute the workload under consideration. This method is versatile and minimally disruptive to the system under test. As a proof of concept, we demonstrate this tool on Cassandra.
Takeshi Yoshimura, Keio University; Hiroshi Yamada and Kenji Kono, Keio University and CREST/JST
Linux kernel oops is invoked when the kernel detects an erroneous state inside itself. It kills an offending process and allows Linux to continue its operation under a compromised reliability. We investigate how reliable Linux is after a kernel oops in this paper. To investigate the reliability after a kernel oops, we analyze the scope of error propagation through an experimental campaign of fault injection in Linux 2.6.38. The error propagation scope is process-local if an error is confined in the process context that activated it, while the scope is kernel-global if an error propagates to other processes’ contexts or global data structures. If the scope is process-local, Linux can be reliable even after a kernel oops. Our findings are twofold. First, the error propagation scope is mostly process-local. Thus, Linux remains consistent after a kernel oops in most cases. Second, Linux stops its execution before accessing inconsistent states when kernel global errors occur because synchronization primitives prevent the inconsistent states from being accessed by other processes.
Wei-Chiu Chuang, Bo Sang, Charles Killian, and Milind Kulkarni, Purdue University
An attractive approach to leveraging the ability of cloudcomputing platforms to provide resources on demand is to build elastic applications, which can scale up or down based on resource requirements. To ease the development of elastic applications, it is useful for programmers to write applications with simple, inelastic semantics and rely on runtime systems to provide elasticity. While this approach has been useful in restricted domains, such as MapReduce, we argue in this paper that adding elasticity to general distributed applications introduces new fault-tolerance challenges that must be addressed at the programming model and run-time level. We discuss a programming model for writing elastic, distributed applications, and describe an approach to fault-tolerance that integrates with the elastic run-time to provide transparent elasticity and fault-tolerance.