USENIX - USENIX'97 Conference Summaries

Adaptive and Reliable Parallel Computing on Networks of Workstations

By Robert D. Blumofe, University of Texas, Austin, and Philip A. Lisiecki, Massachusetts Institute of Technology

Summary by Gordon Galligher

The parallel system was built on the Cilk language, which is an algorithmic parallel language built on C with runtime features. It allows the multithreaded approaches by the use of the divide() call and the parallelism through the use of the spawn() call. The spawn() call simply may make something parallel, but it does not guarantee that it will be parallel. A sync() call is then used to wait for the parallel operations to come back together to process the rest in serial. One design approach is that a parent may synchronize with its children, but the siblings may not synchronize with each other.

The particular implementation of the Cilk language is called Cilk-NOW for Network of Workstations. This approach uses a "work-stealing" algorithm. This algorithm is implemented by having daemons running on processors that randomly pick other processors to attempt to steal any work that they have in their queue but have not yet started. It asks the randomly selected processor if there is on its stack a job that it has not yet had a chance to start. If the answer is yes, it tries to steal the least recently used job.

Bob started a demo of an image being rendered on a single laptop, and then the other three laptops started stealing the work. The demo had the picture separated into four component parts if a particular image area was large, and it then started on one part. The other workstations then checked to see if there was work to steal, and they would grab a quadrant. Once they successfully stole a quadrant, they also performed the same algorithmic separation if the picture they stole was large enough. This adequately demonstrated that all nodes could steal from the others.

There was also a pseudo-fault tolerance with work that was stolen. The primary work node remembers which node stole which part of the process. If it does not hear back from that node in a timely fashion, it will resume ownership of that part, which makes it a candidate for stealing by another node. This was impressively demonstrated by turning off one of the nodes while it was in the middle of the rendering. Eventually, the main node took back the job and partitioned it, with the remaining two nodes then stealing the other partitions to finish it up. It was mentioned that if the primary node has a crash, then the job must be manually restarted. There are periodic checkpoints made of the processing, and the job can restart from the last checkpoint.

For more information on the Cilk language (but, unfortunately, not the Cilk-NOW software) the following URL can be used: https://theory.lcs.mit.edu/~cilk.

Originally published in ;login: Vol. 22, No.2, April 1997.

webster@usenix.org
Last changed: May 28, 1997 pc