Next: Conclusion Up: Adaptive and Reliable Previous: Cilk-NOW macroscheduling

Related work

Cilk-NOW is unique in delivering adaptive and reliable execution for parallel programs on networks of workstations. Traditionally, systems such as PVM [41], TreadMarks [2], and others [11, 16, 23, 29] that are designed to support parallel programs on networks of workstations have not provided adaptive parallelism or fault tolerance. On the other hand, most systems that do provide support for adaptive execution or fault tolerance take a "process-centric" approach. That is, they provide an abstraction of mobile processes and/or an abstraction of reliable processes. As such these systems are very general in their potential application, but they do not provide much support for parallel programs. In contrast, Cilk-NOW does provide support for parallel programs and it does provide adaptive parallelism and fault tolerance, but it does so only for the Cilk parallel programming model. Such specificity allows the Cilk-NOW design to take an end-to-end approach [38] that leverages properties of the Cilk programming model in order to implement adaptive parallelism and fault tolerance simply and efficiently.

Distributed operating systems [17, 36, 43, 46] and remote execution facilities [18, 19, 31, 35, 47] provide services such as remote process execution and, in some cases, process migration. These systems are not intended to be parallel programming environments, though presumably a parallel programming environment could by built atop one of these systems. In fact, Orca [42], which has been built on top of Amoeba, is such a system. These systems are process-centric in that they adapt only by remotely executing and/or migrating processes.

A small number of parallel programming and runtime systems have been built that are adaptively parallel, but unlike Cilk-NOW, none are fault tolerant. Possibly the first adaptively parallel system is the Benevolent Bandit Laboratory (BBL) [22], and Cilk-NOW borrows some of its overall system architecture from BBL. The Piranha system [24, 27], which is based on the Linda programming model [12], is also adaptively parallel. (In fact, the authors of Piranha appear to have coined the term "adaptively parallel.") These systems support programming models that are quite different from Cilk's, but as with the Cilk-NOW design, both leverage properties of their programming model in order to implement adaptive parallelism. A runtime system for the programming language COOL [14] running on symmetric multiprocessors [44] and cache-coherent, distributed, shared-memory machines uses process control to support adaptive parallelism. This system relies on special-purpose operating system and hardware support. In contrast, Cilk-NOW supports adaptive parallelism entirely in user-level software on top of commercial hardware and operating systems. The Spawn system [45] supports concurrent applications with dynamic and adaptive resource management policies based on microeconomic principles. Unlike Cilk-NOW, none of these systems are fault tolerant.

A growing number of systems do provide fault tolerance, but unlike Cilk-NOW, none provide "application" fault tolerance in a high-level parallel programming environment. The Hive [15] distributed operating system provides "system" fault tolerance, meaning that a fault in one component does not bring down the entire system. Hive does not, however, provide "application" fault tolerance, meaning that with Hive, if an application is using a failed component, then the entire application crashes (unless the application itself has taken care to be fault tolerant). Application fault tolerance is provided by the Manetho system [21] via the technique of message logging. The Sam system [39] uses message logging to implement a fault tolerant distributed shared memory. The Sam implementation leverages properties of its shared-memory consistency model in order to avoid logging certain messages. Unlike Cilk-NOW, both Manetho and Sam are process-centric, as they both provide the abstraction of reliable processes, and neither is really a high-level parallel programming environment.

In comparing Cilk-NOW with these other process-centric systems, an interesting question to ask is, why not build the Cilk-NOW runtime system on top of one these other systems? After all, these systems already implement adaptive and/or fault-tolerant execution. The answer is performance. As we have seen, the overhead of adaptive parallelism and fault tolerance in Cilk-NOW is amortized against the overhead in Cilk's provably efficient scheduling algorithm. This amortization is only possible because all facets of the design are specialized with high-level knowledge of algorithmic structure in the Cilk programming model.

Next: Conclusion Up: Adaptive and Reliable Previous: Cilk-NOW macroscheduling

Robert D. Blumofe
Wed Nov 13 10:25:03 CST 1996