################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX SEDMS IV Conference (Experiences with Distributed and Multiprocessor Systems) San Diego, California, September 22-23, 1993 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org [This paper appears in "Proceedings of the 1993 Symposium on Experience with Distributed and Multiprocessor Systems", pages 195-211. Postscript and paper copies of the paper are available by sending email to any one of the authors (e.g. david.sims@uc.edu).] Performance of Concurrent Servers Generated Automatically from Sequential Servers David L. Sims Debra A. Hensgen Lantz Moore Department of Electrical and Computer Engineering University of Cincinnati Cincinnati Ohio USA david.sims@uc.edu debra.hensgen@uc.edu lmoore@leo.ece.uc.edu Abstract Concurrent servers in a multiprocessor system, concurrent graphical user interfaces, and operating systems containing concurrent objects are fairer on a uniprocessor and can yield better performance on a multiprocessor than their sequential counterparts. However, concurrent servers are more difficult to implement correctly. Our contribution is a tool that gener- ates correct concurrent servers from correct sequential ones. The only restriction on the sequential servers is that they must be programmed in a slightly restricted subset of Modula-3 in the Generic Server Format us- ing the Defensive Object Model. Our tool uses well known static analysis techniques from compiler theory to insert locks that are guaranteed to be free from deadlock. It uses information obtained from exception handling statements to insert synchronization primitives. The tool produces very readable Modula-3 programs that are subsequently compiled using stan- dard Modula-3 compilers. In addition to describing the rationale, design, and implementation of our tool, this paper presents performance compar- isons between hand-tuned and automatically generated queues and search structures. Moreover, we report on the lessons that were learned from automatically generating various concurrent servers. 1. Introduction Concurrent servers in a multiprocessor system, concurrent graphical user interf* *aces, and operating systems containing concurrent objects are fairer on a uniprocessor an* *d can yield better performance on a multiprocessor. However, concurrent objects are more di* *fficult to implement correctly than their sequential counterparts, as evidenced by the abu* *ndance of research in parallel and distributed debugging. Programmers must determine wher* *e mutual exclusion is needed, where synchronization is needed, and how to ensure livenes* *s, fairness, and absence of deadlock. A high degree of confidence in the concurrent implemen* *tation of a server can be obtained only through a formal proof of correctness or thorough* * testing. Thorough testing is extremely computationally complex due to non-determinism [2* *]. Our contribution is a tool that generates correct concurrent servers from * *correct se- quential ones. The only restriction on the servers is that they must be progra* *mmed in In Proceedings of the Symposium on Experiences with Distributed and Multiproces* *sor Systems. To appear. Modula-3 [4, 18], be written in the Generic Server Format, and use the Defensiv* *e Object Model. Our tool uses well known static analysis techniques from compiler theory* * to insert read and write locks that are guaranteed to be free from deadlock. It uses inf* *ormation obtained from exception handling statements to insert synchronization mechanism* *s. This paper describes the rationale for our tool as well as its design and implementa* *tion. We also report on the performance of the concurrent servers generated by the tool. Our goal is to aid programmers whose main program fits the Generic Server * *Format shown in Figure 1. Programs usually written in this format include servers in a* * multipro- cessor or distributed system awaiting request messages, graphical user interfac* *es awaiting mouse clicks, and operating systems awaiting service calls. Furthermore, microk* *ernel oper- ating systems are also collections of servers usually written in the Generic Se* *rver Format. Our goal is to allow the programmer to concentrate on implementing a correct, s* *equential version of such a program, where the Thread.Fork calls in Figure 1 have been re* *placed by procedure invocations. This sequential version is easier to implement because i* *t frees the programmer from concerns about synchronization, mutual exclusion, and deadlock.* * Using our system, the sequential server must be developed, and then tested or verifie* *d. Then the sequential server can be automatically mapped to a correct, concurrent implemen* *tation. PROCEDURE Server() = VAR request: Request; BEGIN Initialize(); LOOP ReceiveRequest(request); CASE request.type OF _ RequestType1 => Thread.Fork(ServiceRequest1); _ RequestType2 => Thread.Fork(ServiceRequest2); . . . _ RequestTypeN => Thread.Fork(ServiceRequestN); END CASE; END LOOP; END Server; Figure 1 Generic Server Format for a Concurrent Server Our tool requires the programmer to use the Defensive Object Model of prog* *ram- ming described in Section 2.2. After the programmer is satisfied with the corr* *ectness of the sequential version, our tool can be used to produce the corresponding concu* *rrent ver- sion in two steps. The first pass inserts synchronization mechanisms, and the s* *econd pass inserts mutual exclusion mechanisms. The correctness of the algorithms that ins* *ert mutual exclusion and synchronization mechanisms depends only on the correct placement * *of these mechanisms, not on which mechanism is actually inserted. Our tool correctly so* *lves the major problems associated with implementing a concurrent server: mutual exclusi* *on with- out deadlock and synchronization. It treats each of the forked threads shown in* * Figure 1 as transactions in order to guarantee synchronization atomicity. We also ensure th* *at all locks are released whether the service request fails or succeeds. Our tool is not the first one to handle some of these problems automatical* *ly. Database programmers are usually required to insert BeginTransaction and EndTransaction * *primitives around their transactions. Doing so would correspond to inserting BeginTransact* *ion and 2 EndTransaction statements around each Fork statement in the Generic Server Form* *at. The database system then uses a dynamic scheduling technique, such as two-phase loc* *king [7], to obtain locks as needed. If the DBMS does not use static scheduling, locks a* *re not released until all locks have been obtained. Deadlock is possible in these sys* *tems, but it is usually detected, and transaction abortion releases the deadlock. Because th* *ese systems do not commit transactions until all locks have been obtained, there is no poss* *ibility that a transaction would be partially committed when it needed to be aborted. Herlihy [12] applies this technique to sequential objects to make them con* *current, and, like the database systems, Herlihy's system requires that data structures * *be copied upon modification. His system is similar to techniques used in the distributed* * language ARGUS [16], which also has atomic object facilities. In the ARGUS system, locks* * are held until transactions are committed, which also requires copying all modified data* *. ARGUS and Herlihy's system require significantly more overhead than our system due to* * substantial copying overhead and because they do not insert cooperative synchronization mec* *hanisms. However, Herlihy's solution is more fault tolerant than ours because it guarant* *ees that even if processes try to hold mutual exclusion forever (or a very long time), progre* *ss will be made. Barnes [3] has improved upon Herlihy's technique, avoiding some of the c* *opying. Our technique uses copying only when synchronization is required after the serv* *er's state has been modified. Our tool differs from database systems in several major ways. Our tool ana* *lyzes the code statically and inserts the locking mechanisms, rather than inserting them * *for every ex- ecution during runtime. Our tool differs from both typical database systems and* * Herlihy's technique because it does not allow deadlock to occur and it inserts cooperativ* *e synchro- nization mechanisms. Our tool uses well known compiler techniques, and known l* *ocking and synchronization mechanisms. The major contribution of our work lies in the * *verifica- tion of the correctness of the insertion algorithms, deriving synchronization c* *onditions from exception handling statements, and in reporting on the performance of our algor* *ithms. In the next section, we present the program model, including the Defensive* * Object Model. Then in Section 3., we present our design, that is, our algorithms: four* * for the in- sertion of mutual exclusion mechanisms and one for the insertion of synchroniza* *tion mecha- nisms. We discuss the implementation and current status of our tool in Section * *4.. Section 5. gives results from experiments with sequential servers, hand-coded concurrent s* *ervers, and automatically generated concurrent servers. Finally, we present our conclusion* *s in Sec- tion 6.. 2. Program Model We model servers using objects. An object is a set of procedures, or methods, t* *hat operates on a set of data. Each method implements a service offered to clients. The da* *ta, called object variables, are shared among all methods of an object, but other objects * *can access the data only through approved methods. Each method can contain local data (par* *ameters and local variables) that are private. Threads are asynchronous, sequential processes that execute concurrently w* *ith other threads. Each thread's environment is the same as its parent's environment, tha* *t is, the parent and child share data and code segments although the stack segments are i* *ndependent. A client thread accesses the server by invoking methods of a server object. Hen* *ce, in the 3 most general case, several instances of each object's methods may be executing * *concurrently although individual object methods are sequential. A concurrent server, then, is an object whose methods may execute simultan* *eously using threads. Conversely, a sequential server is an object whose methods execu* *te one at a time. The programmer of a sequential object may assume that while a method is e* *xecuting, no other method can execute. 2.1 Modula-3 The examples throughout this paper are expressed in Modula-3 [4, 18], a simple * *program- ming language that supports object-orientation, threads, synchronization, mutua* *l exclusion, exception handling, and garbage collection. Since Modula-3 supports threads and* * synchro- nization, it neatly fits our server model. Our concurrent object generator, Concurra, takes as input Modula-3 program* *s written using the Defensive Object Model, as explained in the next section. The notion * *of an object in Modula-3 is represented in Figure 2. The object's data are declared after th* *e keywords BRANDED OBJECT. These data are not accessible from outside the object. They may be accessed only by the methods listed below the keyword OVERRIDES. In Figure 2* * the method insert is implemented by the procedure my_insert. Procedures reference t* *he object's data through the self parameter. For example, the object variable some_data is * *referenced via the expression self.some_data. REVEAL T = Public BRANDED OBJECT some_data : Some_Type; more_data : Some_Other_Type; OVERRIDES insert := my_set; delete := my_delete; END; PROCEDURE my_insert(self : T; newvalue : Some_Type) = BEGIN ... END my_insert; PROCEDURE my_delete(self : T) : Some_Type = BEGIN ... END my_delete; Figure 2 Implementation of a Modula-3 Object 2.2 Defensive Object Model We now present the Defensive Object Model that sequential objects must adhere t* *o in order to guarantee the correctness of our transformations. The model that we describe* * is simply good programming practice, but we explicitly describe it here because our syste* *m depends upon it to correctly insert mutual exclusion and synchronization mechanisms. When errors occur in object methods, they should be handled in an organize* *d man- ner. Exceptions have been devised as a means to interrupt normal program flow w* *hen an error occurs and to recover from the error. When a Modula-3 method discovers an* * error 4 condition, it should execute the statement RAISE ExceptionName, where Exception* *Name has been previously declared as an exception. For example, if a method intends * *to insert elements into a table, but finds that the table is full, it might execute a RAI* *SE TableFull statement to indicate the error condition to the caller of the method. Once an * *exception has been raised, the Modula-3 runtime system uses dynamic scoping rules to find an * *appropriate response, otherwise known as an exception handler, to the exception. Untrusted clients make requests of servers. Good programming practice dict* *ates that a server must neither make any assumptions about the parameters passed to it in* * a client request nor make assumptions about the state of the object when a client reques* *t arrives, except that the object state is consistent. A properly written server method al* *ways verifies that client requests are legal. As an additional example, we consider a stack server. Typical methods incl* *ude push and pop. When push and pop are invoked, they assume nothing about the state of * *the stack, except that it is consistent. Therefore, the pop method must check that the sta* *ck is not empty before proceeding with the rest of the method. If this requirement is not* * met, the server typically raises an exception to the client such as RAISE IsEmpty. The p* *op method cannot complete successfully unless the stack is non-empty. 2.3 Synchronizing Methods Sequential object methods never need to synchronize. In order to extract synch* *roniza- tion conditions from sequential methods, we rely on the sequential programmer t* *o use exceptions when indicating error conditions. As described in the next section, * *we use the exception handling information within a method to generate synchronization cond* *itions when constructing concurrent objects. Basically, a synchronization condition ho* *lds when an exception might be raised. As long as a synchronization condition holds, we* * consider the method unsynchronized and put the method to sleep. Later, when the synchron* *ization condition no longer holds, the method awakens and proceeds normally. 3. Design Our tool makes two steps through each correct sequential object. Both steps us* *e some well known compiler technology so we mention the used technology. In the first * *step, our tool inserts synchronization mechanisms. We describe the algorithm that we use * *to insert synchronization mechanisms below. In the second step, mutual exclusion mechanis* *ms are inserted. In the last part of this section, we describe four specific algorithm* *s for inserting mutual exclusion mechanisms, a modification that can be applied to all four, an* *d a deadlock avoidance algorithm. Different algorithms will yield better speedup with differ* *ent objects. The algorithms differ in the amount of overhead they impose as well as the amou* *nt of concurrency they permit. Our system makes use of some well known techniques from compiler theory. * *For brevity, we do not describe those techniques but refer the reader to appropriat* *e references on definition-use graphs [1] and data flow dependence [20]. These algorithms ca* *n be made very efficient [22, 15]. 5 3.1 Synchronization Sequential objects are not normally thought of as having to perform synchroniza* *tion because there are no concurrently executing threads of control. In this section we giv* *e a general overview of our algorithms used to insert synchronization mechanisms. More deta* *ils can be found in another paper [23]. The algorithm determines the conditions that cause exceptions to be raised* *. However, exceptions that are raised due to bad parameters, such as a value out of range,* * do not become synchronization conditions. The algorithm inserts statements at the beginning o* *f a method to determine whether these synchronization conditions are true. It also inserts* * statements to wait for these conditions to become false in case they are found to be true.* * In the above mentioned paper [23], we consider two cases. The first case is when no object * *variables are modified before an exception is raised. This case is straightforward. We al* *so consider the case where some object variables are modified before an exception can be ra* *ised. The problem here is that, once mutual exclusion has been added, write locks will be* * obtained to modify the object's state. If we allow the write locks to be held while the * *method waits to synchronize, deadlock can occur. If we release locks, inconsistent states c* *an be seen and synchronization atomicity would be violated. Therefore, we choose a databas* *e system solution. We make copies of those variables, storing the modified versions in * *temporary variables. Once the method becomes synchronized, we transfer these temporary v* *alues back to the original variables. In this way, we guarantee failure atomicity. 3.2 Mutual Exclusion In this section we consider several mutual exclusion algorithms and show how to* * auto- matically map each one onto a sequential object. The resulting concurrent objec* *t has no points of interference. We take a conservative approach when inserting mutual * *exclusion primitives. That is, our algorithms do not take into account the possibility th* *at the server may not actually invoke methods that could interfere with each other. If two m* *ethods can potentially interfere with each other, then they are prevented from doing s* *o under all circumstances. We give an overview of the algorithms here and refer the interes* *ted reader to another paper [24]. Each of the following algorithms allows no less, and in most cases more, c* *oncurrency than the straightforward technique of enclosing each object in a monitor. While* * monitors prevent interference in concurrent objects, they also prevent methods from over* *lapping with one another. If two methods do not both need to atomically read or update * *an object variable, they can overlap their executions. The first two algorithms are used to obtain fine-grained locking while avo* *iding dead- lock. They differ in the amount of overhead due to locking and deadlock avoidan* *ce, as well as the amount of concurrency that they permit. The next two algorithms use a ve* *ry coarse- grained locking technique. Finally, we give an example illustrating the synchr* *onization mechanism and one of the mutual exclusion techniques. Conservative Two-phase Locking This insertion algorithm is based on traditional two-phase locking [7] and* * hierarchical ranking of locks [10]. The tool inspects the assignment and conditional stateme* *nts in each object method to determine which object variables must be read or write locked.* * Requests 6 for locks are inserted at the beginning of the method according to the hierarch* *ical ranking and released near the end to assure atomicity. Liberal Two-phase Locking This insertion algorithm is similar to conservative two-phase locking. Aft* *er inspecting each method's assignment and conditional statements, the tool inserts requests * *for locks before each object variable is used, unless an appropriate lock has already bee* *n acquired. Locks are held until an object variable has been used for the last time to assu* *re synchroniza- tion atomicity. Read locks are acquired for object variables that are not modif* *ied. Locks are requested in a hierarchical order to avoid deadlock; hence, the tool must m* *ove some lock requests upward in the method to ensure that locks are acquired in the hie* *rarchical ordering. Conditional statements present additional problems. If a lock L is ne* *eded inside a conditional statement and is hierarchically ordered before another lock reque* *sted before the conditional statement, lock L is requested before the first lock. It is pos* *sible that the conditional branch will not be taken and lock L will have been acquired needles* *sly. Care is taken to release all locks. Modula-3's TRY ... FINALLY statements are used to e* *nsure that all locks are released in the event of failure of a method. Another difference between these two-phase locking protocols is that the l* *iberal tech- nique permits write locks to be downgraded to read locks based upon knowledge d* *erived from dependency graphs. Downgrading locks allows more concurrency since write l* *ocks can be downgraded before the end of a method. However, it also carries extra overh* *ead, as eventually the downgraded locks must be completely released. Monitors The monitor is another well known algorithm that is used to build concurre* *nt ob- jects [13]. Implementing a monitor is straightforward. We associate a semaphore* * with the object. Then each object method must acquire the semaphore before executing. Th* *us, at most one method may be executing at any one time. Crowd Monitors The crowd monitor generalizes the monitor to allow some object methods to * *overlap their execution [8]. We use a slight simplification of crowd monitors. First, w* *e associate a single lock with the object. Then if an object method might modify an object va* *riable, the method must acquire a write lock before it begins execution. The write lock ens* *ures that no other method will execute concurrently with the method that owns the write l* *ock. On the other hand, if a method only reads object variables, then it must acquire a* * read lock before execution. Because several threads can hold the read lock, some object m* *ethods will be able to overlap their executions and increase the concurrency in the object. Combining Locks We can parameterize our tool so that, using any of the above algorithms, w* *e can reduce overhead due to lock acquisition and release. We give our tool a percent* *age. If locks for two or more object variables are obtained together a percentage of time at * *least as great as our parameter, we obtain just a single lock, associated with that set of obj* *ect variables, in the event that any of the object variables are used. This merging of object * *variables to use a single lock cuts down on the locking overhead and also some amount of con* *currency. 7 Banker's Algorithm The Banker's algorithm is a well known algorithm for preventing deadlock [* *5]. Since static analysis of methods tells us all of the locks that could be used, we can* * apply this algorithm. Again, there is an increase in both concurrency as well as overhead* * due to deadlock avoidance in using this algorithm, but optimized versions of this algo* *rithm mitigate this problem [14, 17]. The Banker's algorithm originally applied to resources t* *hat were held exclusively. An extension of the Banker's algorithm that uses read and write lo* *cks to permit shared and exclusive access to resources (in our case, object variables) is use* *d in our tool [11]. An Example Finally, we illustrate the use of some of the synchronization and mutual e* *xclusion mechanisms with a simple example. An abbreviated sequential queue object writt* *en in Modula-3 is depicted on the left side of Figure 3. The right side of the Figure* * 3 shows the same queue after it has been transformed using the synchronization and liberal * *two-phase locking mechanisms. Automatically inserted statements appear in boldface. 4. Status of the Tool Our tool processes a sequential object written in Modula-3 [4, 18]. The tool st* *atically an- alyzes each object method to determine what mutual exclusion and synchronizatio* *n primi- tives need to be added to the object to guarantee synchronization atomicity. In* * this section we review some of the limitations of the tool and directions for future work. To date our tool implements the conservative and liberal two-phase locking* * schemes as well as monitors and crowd monitors for mutual exclusion. In addition, synch* *ronization conditions are generated as long as the object state is never modified before a* *n exception could be raised. The tool's static analysis of objects proceeds in a number of steps. First* *, the tools Flex [19] and Bison [6] are used to generate a parser for the Modula-3 source f* *ile and create a definition-use tree. A definition-use tree is simply a parse tree where each* * node in the parse tree lists the variables that are written (defined) and read (used) at th* *at node. The input to the tool is a Modula-3 source file that consists of the defin* *ition of a single object. All the object's methods must be defined in the same file. Param* *eters to each method must be passed by value. The tool supports all legal Modula-3 data types* *, including records and arrays, except references (pointers). References pose unusual probl* *ems, but we see no obstacle in supporting them in the near future. Several, but not all, Modula-3 statements are supported. These statements * *include assignment statements, IF-THEN-ELSE, WHILE, EXIT-LOOP, and RETURN statements. In the current version of our system, object methods are not permitted to invoke o* *ther pro- cedures or methods. There are some additional obstacles in ensuring mutual exc* *lusion without deadlock and synchronization when invoking other procedures. Solutions * *to these problems are currently under investigation. Nevertheless, the statements that a* *re currently supported are sufficient for writing a wide variety of useful servers. The locking granularity is restricted to coarse-grained locking. Locks are* * obtained on entire variables whether they are integers, records, or arrays. Locks are acqui* *red according to a lexicographic hierarchical ordering to avoid deadlock. Since arrays are i* *ndexed by 8 MODULE Queue; MODULE CQueue; IMPORT Locks, Thread; CONST MAX = 100; CONST MAX = 100; EXCEPTION queue_empty; EXCEPTION queue_empty; TYPE TYPE REVEAL REVEAL T = Public BRANDED OBJECT T = Public BRANDED OBJECT values: ARRAY[0..MAX - 1] OF QType;values: ARRAY[0..MAX - 1] OF QType; head,tail: INTEGER := 0; head, tail: INTEGER := 0; lock_head, lock_tail, lock_values: Locks.T; Condition: Thread.Condition OVERRIDES OVERRIDES dequeue := dequeue; dequeue := dequeue; END; END; PROCEDURE dequeue (self: T): QTypeP=ROCEDURE dequeue (self: T): QType = VAR value: QType; VAR value: QType; BEGIN BEGIN TRY (* perform synchronization *) Locks.acquire_read_write_locks(self.lock_head* *); Locks.acquire_readlock(self.lock_tail); WHILE self.head = self.tail DO Locks.release_read_write_locks(self.lock_hea* *d); Locks.release_readlock(self.lock_tail); Thread.Wait(self.Condition); Locks.acquire_read_write_locks(self.lock_hea* *d); Locks.acquire_readlock(self.lock_tail); END; (* procedure is now synchronized *) IF self.head = self.tail THEN IF self.head = self.tail THEN RAISE queue_empty; RAISE queue_empty; END; END; Locks.acquire_readlock(self.lock_values); Locks.release_readlock(self.lock_tail); value := self.values[self.head]; value := self.values[self.head]; Locks.release_readlock(self.lock_values); self.head := (self.head + 1) MOD MAX;self.head := (self.head + 1) MOD MAX; Locks.release_writelock(self.lock_head); Locks.release_readlock(self.lock_head); RETURN value; RETURN value; (* make sure all locks are released *) (* and signal any waiting threads *) FINALLY Locks.release_read_write_locks(self.lock_head* *); Locks.release_read_write_locks(self.lock_tail* *); Locks.release_read_write_locks(self.lock_valu* *es); Thread.Signal(self.Condition); END; END dequeue; END dequeue; Figure 3 Sequential Queue Transformed to Concurrent Queue 9 variables whose values are not known until runtime, it is difficult to hierarch* *ically order lock requests on array elements; hence, we lock entire arrays. We are investigating * *techniques for achieving a finer grain of locking in order to increase the concurrency in * *the objects that the tool generates. Our tool does not support synchronization when it is possible for the obje* *ct's state to be modified before an exception is raised. Although we will eventually lift thi* *s restriction, it is not as harsh as it may sound. It is considered risky programming practice* * to modify the state of an object and subsequently raise an exception [9]. 5. Experiments To test the effectiveness of our tool, we compare six implementations for each * *of two data structures. The first data structure is the queue, whose methods include enqueu* *e, dequeue, and front_of_queue. The other data structure is the search structure, which in* *cludes the methods insert, delete, and search. We implemented these data structures manual* *ly, using eventcounts and sequencers [21], and with our tool using conservative two-phase* * locking, liberal two-phase locking, monitors, and crowd monitors. Our evaluations were c* *onducted on an 8 processor Encore Multimax, a shared memory, bus architecture multiproce* *ssor. Each experiment consists of forking 8 threads that repeatedly invoke methods on* * an object. The object represents the server, each thread represents a client, and threads * *invoking methods represent client requests. 5.1 Queue First, we developed a common implementation of a sequential queue of integers u* *sing an array. Since integers are usually the same size as pointers, this experiment a* *lso shows the effects of a queue of pointers, where each queue element points to a memory* * block of arbitrary size. Next, our tool was used to automatically generate several concu* *rrent versions of the queue. Finally, a concurrent queue based on eventcounts and sequencers * *[21] was developed manually. All the queues are written in Modula-3. In the first experi* *ment each of the 8 threads first enqueues 1,000 integers. Next, each thread invokes front_of* *_queue 8,000 times, which simply returns the element at the front of the queue without deque* *uing it. Finally, each thread dequeues 1,000 values. The front_of_queue method does not * *modify the object, allowing our tool to generate objects that allow several instances of f* *ront_of_queue to run simultaneously. Since the queue's elements are small, the overhead for the concurrency mec* *hanisms are quite significant. In all cases the amount of processor time required for e* *ach queue's concurrency mechanism is far greater than the actual time spent in performing q* *ueuing work. Because each queue operation is computationally inexpensive, each thread* * spends a significant amount of time acquiring and releasing locks. The process of acq* *uiring and releasing a lock involves reading and writing a single memory location. Since a* *ll 8 threads acquire and release the same locks, the threads perform a large number of read * *and write operations to a small number of memory locations. These activities create a bo* *ttleneck, resulting in memory and bus contention. As a consequence of this bottleneck, th* *roughput slows significantly. As a matter of fact, as shown in the timings depicted in T* *able 1, the 8 threads run faster executing on 1 processor than 8 processors. Because there * *are fewer 10 processors, there is less memory contention, bus contention, and fewer executio* *ns of the cache coherence protocol. Table 1 gives the times, in seconds, that each thread spent in executing a* *ll 10,000 object operations, multiplied by 8 threads. The sequential queue ran in just 1.* *4 seconds. However, that time is deceptive. The sequential queue is not a concurrent serve* *r because it does not provide synchronization atomicity. The sequential queue's time app* *ears as a benchmark for comparison against the execution times of the concurrent queues* *. The handwritten queue ran in 60 seconds. The overhead in acquiring and releasing se* *maphores is a significant percentage of the total running time. This overhead is especia* *lly evident for the two-phase locking protocols. These concurrent queues acquire and release se* *veral read and write locks. Each time a lock is acquired or released, a semaphore must be * *acquired and later released. The task of acquiring and releasing so many semaphores expl* *ains these queues' lengthy running times. Finally, we come to the monitors. The monitor * *acquires a single semaphore for each object method _ the fewest of any of the concurrent* * queues. This low overhead explains the monitor's low execution time of 38 seconds. The* * crowd monitor is implemented using a single lock. As with the two-phase locking proto* *cols, the lock is acquired at the beginning of a method and released at the end. Thus, t* *he crowd monitor contains more overhead than the monitor and, as a result, ran slower th* *an the monitor implementation. _______________________________________________________ | Type of Queue | Running Time | |______________________________|1_Processor8_Processors_|_ | Sequential | 1:4 | 1:4 | | Handwritten using Eventcounts |40 | 60 | | Conservative Two-phase Locking |167 | 198 | | Liberal Two-phase Locking |243 | 291 | | Monitor | 24 | 38 | |_Crowd_Monitor________________|__69______|____95______| Table 1 Running Times of Handwritten and Automatically Generated Queues The problems with memory and bus contention are not new. However, they do * *high- light the problem that occurs when the cost of the mutual exclusion mechanisms * *overwhelms the cost of the operation to be performed. This experiment illustrates why it c* *an be ad- vantageous to use the least expensive mutual exclusion mechanism for computatio* *nally inexpensive operations. Even though such a mechanism theoretically provides les* *s concur- rency, in practice it can provide more throughput. This experiment also shows * *how an automatically generated queue can outperform a handwritten one. 5.2 Search Structure Next we address a data structure whose operations are computationally expensive* * and examine its performance using handwritten and automatically generated implement* *ations. A search structure stores and retrieves data. It contains three methods. The me* *thod insert places data in the structure, delete removes data from the structure, and searc* *h explores the structure looking for data that match a set of search criteria. 11 Because search structures are in common use and their methods can be compu* *tation- ally expensive, it is useful to investigate their susceptibility to concurrency* *. In this exper- iment we compare a sequential implementation of a search structure and a handwr* *itten concurrent structure to four search structures automatically generated from the* * sequential implementation. By making use of eventcounts and sequencers, the handwritten co* *ncurrent search structure is able to overlap the executions of the insert, delete, and s* *earch methods to a large extent. However, the automatically generated concurrent search struc* *ture tends to serialize the executions of the insert and delete methods. For each of the experiments involving search structures, we present a tabl* *e, like Ta- ble 2, portraying the results of the experiment. Each table shows the running * *times of threads that invoke search structure methods. The rows in the table show the re* *sults for a sequential search structure, a handwritten one using eventcounts and sequencers* *, followed by four concurrent search structures generated automatically by Concurra. The * *second column of numbers shows the running times, in seconds, on 8 processors. The thi* *rd column of numbers indicates how fast a search structure ran in comparison to the handw* *ritten search structure. Implementations with relative speeds of 1 ran as fast as the * *handwritten structure. Relative speeds less than 1 correspond to slower running implementat* *ions, and relative speeds greater than 1 indicate implementations that ran faster than th* *e handwritten structure. ___________________________________________________________________________ | Type of Data Structure | Running | Speed Relative to | | | Time | Handwritten | |_______________________________|___________|_____________________________|__ | Sequential |164 | 0:47 | | Handwritten using Eventcounts | 77 | 1:00 | | Conservative Two-phase Locking | 80 | 0:96 | | Liberal Two-phase Locking | 77 | 1:00 | | Monitor |168 | 0:46 | |_Crowd_Monitor__________________|_78______|______________0:99____________|_ Table 2 Running Times when Probability of Read-only Request is 1 For example, Table 2 indicates that the speed of the monitor implementatio* *n was about half that of the handwritten implementation. That is, the monitor took tw* *ice as long to execute. On the other hand, the liberal two-phase locking implementation ran* * as fast as the handwritten implementation, as denoted by a 1 in the third column. Read-only Method Occurs with Probability 1 Two of the search structure's methods, insert and delete, modify the struc* *ture. On the other hand, the search method only reads the structure; hence, we call sear* *ch a read- only method. In our first table of results for the search structure, we forked * *8 threads that invoked read-only methods of the search structure with probability 1. That is, * *the threads called the search method exclusively. The search structure was already populated with a large amount of data. Se* *arching for the data was computationally expensive so, unlike the previous queue data s* *tructure, the search structure's processing was not dominated by the acquiring and releas* *ing of locks. 12 Instead, the computation is dominated by searching for data, and the previous p* *roblems with bus and memory contention are avoided. Table 2 shows the execution times of the search structure implementations.* * The probability of a read-only request occurring is 1. Because all 8 processors can* * search through the structure simultaneously, it follows that most of the implementations run a* *s fast as the handwritten search structure. Consequently, the speeds of these implementations* * are close to or match the speed of the handwritten search structure. This fact is denoted* * by relative speeds of 1 or nearly 1 for the automatically generated search structures. The * *two exceptions are the sequential and monitor implementations. These implementations are slowe* *r because they serialize all the methods. Read-only Probability of 0.75 Table 3 shows the results when the probability of invoking a read-only met* *hod of the search structure is 0.75. The handwritten search structure is able to overlap t* *he executions of update methods and methods that only read the state of the structure. Howev* *er, the automatically generated search structures do not perform as well. Automatically* * generated methods that update the data structure tend to become serialized. ___________________________________________________________________________ | Type of Data Structure | Running | Speed Relative to | | | Time | Handwritten | |_______________________________|___________|_____________________________|__ | Sequential |164 | 0:48 | | Handwritten using Eventcounts | 78 | 1:00 | | Conservative Two-phase Locking |114 | 0:68 | | Liberal Two-phase Locking |111 | 0:70 | | Monitor |170 | 0:46 | |_Crowd_Monitor__________________|119______|______________0:66____________|_ Table 3 Running Times when Probability of Read-only Request is 0.75 Our results show that while the handwritten search structure runs about as* * fast as before, the performance of the automatically generated structures declines. The* * fastest of these implementations, using liberal two-phase locking, runs 0.70 times as fast* * as than the handwritten structure whereas it ran as fast as the handwritten structure when * *read-only methods were invoked exclusively. Read-only Probabilities of 0.50, 0.25, and 0 In our final three experiments, we progressively reduced the probability o* *f a thread invoking a read-only method of the search structure. As the probabilities desc* *end, the automatically generated implementations of the search structure become progress* *ively se- rialized. Finally, when the probability reaches 0, all the automatically genera* *ted structures become fully serialized. All the automatically generated search structures run * *as slow as the sequential structure. The results of these experiments are depicted in Tabl* *e 4. Summarizing the Experimental Results As reported above, when the probability that a read-only method will be in* *voked is low, the concurrent search structures generated by Concurra perform worse th* *an the handwritten search structure. As this probability increases, so does the relati* *ve performance 13 of the automatically generated search structures. In fact, when the probability* * of invoking a read-only method approaches 1, almost all of the automatically generated search* * structures perform as well as the handwritten search structure. We document the relative speed of an automatically generated search struct* *ure using liberal two-phase locking in Figure 4. The x-axis represents the probability of* * a read-only method invocation. The y-axis represents the relative speed of this automatical* *ly generated search structure as compared to the handwritten search structure. To reiterate,* * a relative speed of 0.50 means the automatically generated search structure is half as fas* *t (twice as slow) as the handwritten one. When the probability of invoking a read-only meth* *od is close to 0, the handwritten queue handily outperforms the automatically generated one* *. Never- theless, as the probability of invoking a read-only method increases, the relat* *ive performance of the automatically generated search structure increases exponentially. 6. Conclusions We have presented algorithms for inserting mutual exclusion and synchronization* * into se- quential servers. We have built a tool that automatically inserts primitives fo* *r these mech- anisms. We described the rationale, design, and implementation of the tool. F* *inally, we analyzed results from concurrent servers generated with the tool. It is possibl* *e to achieve ___________________________________________________________________________ | Type of Data Structure | Running | Speed Relative to | | | Time | Handwritten | |_______________________________|___________|_____________________________|__ |__________________________Read-only_Probability_0.50_______________________ * * | | Sequential |165 | 0:49 | | Handwritten using Eventcounts | 80 | 1:00 | | Conservative Two-phase Locking |139 | 0:58 | | Liberal Two-phase Locking |134 | 0:60 | | Monitor |168 | 0:48 | |_Crowd_Monitor__________________|143______|______________0:56____________|__ |__________________________Read-only_Probability_0.25_______________________ * * | | Sequential |164 | 0:49 | | Handwritten using Eventcounts | 80 | 1:00 | | Conservative Two-phase Locking |164 | 0:49 | | Liberal Two-phase Locking |158 | 0:51 | | Monitor |168 | 0:48 | |_Crowd_Monitor__________________|157______|______________0:51____________|__ |___________________________Read-only_Probability_0_________________________ * * | | Sequential |164 | 0:49 | | Handwritten using Eventcounts | 80 | 1:00 | | Conservative Two-phase Locking |166 | 0:48 | | Liberal Two-phase Locking |165 | 0:49 | | Monitor |168 | 0:48 | |_Crowd_Monitor__________________|165______|______________0:49____________|_ Table 4 Running Times when Probability of Read-only Request is 0.50, 0.2* *5, and 0 14 Figure 4 Speed of Automatically Generated Search Structure Relative to Handwr* *itten Search Structure better speedup with handwritten servers, but the handwritten servers are not as* * fault tol- erant as the ones generated using our tool; they do not provide failure atomici* *ty when an exception is raised after the state of the server is modified. Handwritten * *concurrent servers also require substantially more man-hours to program and substantially * *more time to debug. Even when our automatically generated concurrent search structure is outcl* *assed by the handwritten concurrent search structure, we recall that the concurrent s* *tructure generated by our tool has synchronization automatically inserted, guarantees sy* *nchroniza- tion atomicity, and guarantees transaction atomicity. These advantages are impo* *rtant and should not discounted. Moreover, for that class of servers where most of the cl* *ient requests do not update the server's state, the automatically generated servers perform n* *early as well as the handwritten server. For this class of servers, the slight gains in * *performance for handwritten servers should be weighed against the advantages of automatically c* *onstructing concurrent servers using Concurra. Future work for Concurra includes investigating how we can automatically g* *enerate concurrent servers that perform well when the probability of a read-only reques* *t is low. While performing the experiments discussed in this paper, we noticed that Concu* *rra does not generate highly concurrent code for common program fragments like the follo* *wing. VAR values: ARRAY[1..Max] OF some_type; BEGIN (* find out which array element will be written to *) index := expression; (* perform some computationally expensive operation involving data *) (* in the "values" array and put result in "solution" *) (* update the array element *) values[index] := solution; END; 15 Because Concurra locks entire arrays, a read lock on values is acquired be* *fore com- puting the solution. Unfortunately, the hierarchical ordering scheme that avoid* *s deadlock forces Concurra to obtain a write lock on values when it acquires the read lock* *. Hence, the entire values array is write locked and this method becomes serialized. No two * *threads may simultaneously invoke this method because only one thread will acquire the writ* *e lock on values. However, by employing the Banker's algorithm to avoid deadlock, Concurra c* *ould acquire locks on individual array elements. Additionally, it could delay acquir* *ing the write lock on values[index] until the method was ready to write to values[index]. Suc* *h a scheme would allow two threads trying to write to different elements of values to exec* *ute concur- rently. Further work in this area could allow Concurra to generate servers that ri* *val the performance of handwritten servers when the probability of invoking a read-only* * method is low. Acknowledgments We thank C. R. Snow of the University of Newcastle upon Tyne for the use and su* *pport of his Modula-3 threads package and the University of Newcastle upon Tyne for the * *use of its Encore Multimax. References [1]A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, a* *nd Tools. Addison Wesley, 1986. [2]G. R. Andrews. Concurrent Programming: Principles and Practice. * * Ben- jamin/Cummings, 1991. [3]G. Barnes. A method for implementing lock-free shared data structures. In P* *roceedings of the 1993 ACM Symposium on Parallel Algorithms and Architectures (June 19* *93). [4]L. Cardelli, J. Donahue, L. Glassman, M. Jordan, B. Kalsow, and G. Nelson. * *Modula- 3 report (revised). Tech. Rep. 52, Systems Research Center, Digitical Equi* *pment Corporation, November 1989. [5]E. W. Dijkstra. Co-operating sequential processes. In Programming Languag* *es, F. Genuys, Ed. Academic Press, 1968, pp. 103-110. [6]C. Donnelly, and R. Stallman. BISON, the YACC-compatible Parser Generator, * *De- cember 1991. [7]K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The notions of c* *onsistency and predicate locks in a database system. Communications of the ACM 19, 11 (Nov* *ember 1976), 624-633. [8]R. A. Finkel. An Operating System VADE MECUM, 2nd ed. Prentice Hall, 1988. [9]S. P. Harbison. Modula-3. Prentice Hall, 1992. 16 [10]J. W. Havender. Avoiding deadlock in multitasking systems. IBM Systems Jour* *nal 7, 2 (1968), 74-84. [11]D. A. Hensgen, D. L. Sims, and D. Charley. A fair Banker's Algorithm for re* *ad and write locks. Information Processing Letters (1993). To appear. [12]M. Herlihy. A methodology for implementing highly concurrent data objects. * *Tech. Rep. 91-10, Cambridge Research Laboratory, Digital Equipment Corporation, O* *ctober 1991. [13]C. A. R. Hoare. Monitors: An operating system structuring concept. Communic* *ations of the ACM 17, 10 (October 1974), 549-557. [14]R. C. Holt. Some deadlock properties of computer systems. ACM Computing Sur* *veys 4, 3 (September 1972), 179-196. [15]M. Jordan. An extensible programming environment for modula-3. In Proceed* *ings of the Fourth ACM SIGSOFT Symposium on Software Development Environments (December 1990). Also appeared in Software Engineering Notes, volume 15, nu* *mber 6. [16]B. Liskov, and R. Scheifler. Guardians and actions: Linguistic support fo* *r robust, distributed programs. ACM Transactions on Programming Languages and Systems* * 5, 3 (July 1983), 381-404. [17]H. Madduri, and R. Finkel. Extension of the banker's algorithm for resource* * allocation in a distributed system. Information Processing Letters 19, 1 (1984), 1-8. [18]G. Nelson, Ed. Systems Programming with Modula-3. Prentice Hall, 1991. [19]V. Paxson. FLEX, Fast Lexical Analyzer Generator. [20]A. Podgurski, and L. A. Clarke. A formal model of program dependences and * *its implications for software testing, debugging, and maintenance. IEEE Transac* *tions on Software Engineering 16, 9 (September 1990), 965-979. [21]D. P. Reed, and R. K. Kanodia. Synchronization with eventcounts and sequen* *cers. Communications of the ACM 22, 2 (February 1979), 115-123. [22]D. J. Richardson, T. O. O'Malley, C. T. Moore, and S. L. Aha. Developing an* *d integrat- ing ProDAG in the Arcadia environment. In Proceedings of the Fifth ACM SIGS* *OFT Symposium on Software Development Environments (December 1992), pp. 109-119. [23]D. L. Sims, and D. A. Hensgen. Automatically mapping sequential objects to * *concur- rent objects: The synchronization problem. Preprint, 1993. [24]D. L. Sims, and D. A. Hensgen. Automatically mapping sequential objects to * *concur- rent objects: The mutual exclusion problem. In Proceedings of the 1993 Inte* *rnational Conference on Parallel Processing (1993). 17