Check out the new USENIX Web site.

next up previous
Next: Related work Up: Adaptive and Reliable Previous: Fault tolerance

Cilk-NOW macroscheduling

The Cilk-NOW runtime system contains components that perform macroscheduling [30]. The macroscheduler identifies idle machines and determines which machines work on which jobs. In this section, we discuss each component of the macroscheduler, and we show how they work together.

Like the workers and the clearinghouse, which together comprise a single parallel Cilk job, the components of the macroscheduler are distributed across the network. As already mentioned, each machine in the network runs a node manager, an instance of the CilkNodeManager program, that monitors the machine's idleness and the status of a worker if one is present. In addition to the clearinghouse, each Cilk job executes a single job manager, an instance of the CilkJobManager program, that services requests for the job description. Each of these components registers with a central job broker, an instance of the CilkJobBroker program. The job broker keeps track of the set of node managers and job managers running in the network. The Cilk-NOW runtime system can continue operation even if some of these components, including the job broker, fail.

Each machine in the network runs a node manager that is responsible for determining when the machine is idle. When the machine is being used, the node manager wakes up every 5 seconds to determine if the machine has gone idle. It looks at how much time has elapsed since the keyboard and mouse have been touched, the number of users logged in, and the processor load averages. The node manager then passes these values through a predicate to decide if the machine is idle. A typical predicate might require that the keyboard and mouse have not been touched for at least 5 minutes and the 1-minute processor load average is below 0.35. The predicate can be customized for each machine. We believe that maintaining the owner's sovereignty is essential if we want owners to allow their machines to be used for parallel computation. A user can change a predicate with a simple command-line utility called CilkPred. For example, issuing the command

CilkPred user=lisiecki global add idletime=900

causes any workstation on the network to require that the user "lisiecki" be idle for at least 900 seconds. Alternatively, a user might issue the command

CilkPred always node=vulture add load=.2,.2,.2

which applies only to the workstation Vulture and requires it to have a load average of 0.2 or less for all of the 1, 5, and 15 minute load averages.

When all applicable conditions of the predicate are satisfied, the machine is idle, and the node manager obtains a job description from a job manager (using an address given by the job broker or another node manager) and forks a worker. The node manager then monitors the worker and continues to monitor the machine's idleness. With a worker running, the node manager wakes up once every second to determine if the machine is still idle (adding an estimate of the running job's processor usage to any processor load-average threshold). If the machine is no longer idle, then the node manager sends a kill signal to the worker as previously described. When the worker process dies for any reason, the node manager takes one of two possible actions. If the machine is still idle, then it obtains a new job description and forks a new worker. If the machine is no longer idle, then it returns to monitoring the machine once every 5 seconds.

When a Cilk job begins execution, a job manager is started automatically by the clearinghouse. The clearinghouse submits the job description to the job manager as alluded to in Section 3, and then the job manager registers itself with the job broker. The job manager then goes to sleep, and it periodically wakes up to reregister with the job broker in case the job broker has crashed and restarted. When the job terminates, the job manager unregisters with the job broker. The job manager is the central authorizing agent for the job. Any time a node manager forks a worker, it receives a copy of the job description directly from the job's job manager.

When a node manager forks a new worker, it must take special precautions that the user specified in the job description actually authorized the job to be run. Failure to do so would allow an outsider to gain unauthorized access to a user's account. Furthermore, it is desirable for the macroscheduler's protocols to be secure against unauthorized messages. For these reasons, all of the macroscheduler's protocols are secured with an abstraction on top of UDP called secure active messages [30]. This abstraction maintains all of the semantics of the split-phase protocols mentioned earlier but adds a guarantee of the authenticity of messages to the receiver. Unlike a normal UDP message which is sent from one network address to another, a secure active message is sent between "principals." A principal is a pair consisting of a network address and a claim as to the identity of the sender. Each secure active message contains user data, the sending principal, and whatever additional data might be required by the underlying authentication protocol, whether that be the standard UNIX rsh protocol or a protocol like Kerberos [34]. The security layer is very simplistic, providing only enough functionality to allow the protocols to be secured in a manner independent of the authentication protocol.

The decision to receive the job description directly from the job manager stems from security considerations with protocols like Kerberos, where the job broker and node managers are not trusted with the user's credentials. Only the user in the possession of tickets can be trusted to start a remote process. Since the job manager runs as the desired user, retrieving the job description directly and securely from the job manager assures that the user has actually authorized running the job.

The task of scheduling jobs on workstations is shared between the job broker and the node managers. The job broker is responsible for ensuring that each job is running on at least one workstation. The node managers then use a distributed, randomized algorithm to divide the workstations evenly among the jobs. Because the node managers are capable of performing scheduling, temporary outages of the job broker do not impede progress in scheduling jobs on the network of workstations. We are currently experimenting with a distributed, randomized macroscheduling algorithm that uses steal rates to estimate worker utilization. Each job should get its fair share of the idle machines, but no job should get more machines than it can efficiently utilize.


next up previous
Next: Related work Up: Adaptive and Reliable Previous: Fault tolerance

Robert D. Blumofe
Wed Nov 13 10:25:03 CST 1996