################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	    Proceedings of the USENIX SEDMS IV Conference
      (Experiences with Distributed and Multiprocessor Systems)
	     San Diego, California, September 22-23, 1993


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


[This paper appears in "Proceedings of the 1993 Symposium on
Experience with Distributed and Multiprocessor Systems", pages
195-211.  Postscript and paper copies of the paper are available by
sending email to any one of the authors (e.g. david.sims@uc.edu).]


                Performance of Concurrent Servers Generated

                    Automatically from Sequential Servers


David L. Sims               Debra A. Hensgen                 Lantz Moore

               Department of Electrical and Computer Engineering

                            University of Cincinnati

                             Cincinnati Ohio  USA

david.sims@uc.edu       debra.hensgen@uc.edu       lmoore@leo.ece.uc.edu


                                  Abstract

     Concurrent servers in a multiprocessor system, concurrent graphical user

     interfaces, and operating systems containing concurrent objects are fairer

     on a uniprocessor and can yield better performance on a multiprocessor

     than their sequential counterparts. However, concurrent servers are more

     difficult to implement correctly.  Our contribution is a tool that gener-

     ates correct concurrent servers from correct sequential ones.  The only

     restriction on the sequential servers is that they must be programmed in

     a slightly restricted subset of Modula-3 in the Generic Server Format us-

     ing the Defensive Object Model. Our tool uses well known static analysis

     techniques from compiler theory to insert locks that are guaranteed to be

     free from deadlock. It uses information obtained from exception handling

     statements to insert synchronization primitives.  The tool produces very

     readable Modula-3 programs that are subsequently compiled using stan-

     dard Modula-3 compilers. In addition to describing the rationale, design,

     and implementation of our tool, this paper presents performance compar-

     isons between hand-tuned and automatically generated queues and search

     structures.  Moreover, we report on the lessons that were learned from

     automatically generating various concurrent servers.


1.    Introduction

Concurrent servers in a multiprocessor system, concurrent graphical user interf*
 *aces, and
operating systems containing concurrent objects are fairer on a uniprocessor an*
 *d can yield
better performance on a multiprocessor. However, concurrent objects are more di*
 *fficult to
implement correctly than their sequential counterparts, as evidenced by the abu*
 *ndance of
research in parallel and distributed debugging. Programmers must determine wher*
 *e mutual
exclusion is needed, where synchronization is needed, and how to ensure livenes*
 *s, fairness,
and absence of deadlock. A high degree of confidence in the concurrent implemen*
 *tation of
a server can be obtained only through a formal proof of correctness or thorough*
 * testing.
Thorough testing is extremely computationally complex due to non-determinism [2*
 *].

     Our contribution is a tool that generates correct concurrent servers from *
 *correct se-
quential ones.  The only restriction on the servers is that they must be progra*
 *mmed in

In Proceedings of the Symposium on Experiences with Distributed and Multiproces*
 *sor
Systems. To appear.

Modula-3 [4, 18], be written in the Generic Server Format, and use the Defensiv*
 *e Object
Model. Our tool uses well known static analysis techniques from compiler theory*
 * to insert
read and write locks that are guaranteed to be free from deadlock.  It uses inf*
 *ormation
obtained from exception handling statements to insert synchronization mechanism*
 *s. This
paper describes the rationale for our tool as well as its design and implementa*
 *tion. We also
report on the performance of the concurrent servers generated by the tool.

     Our goal is to aid programmers whose main program fits the Generic Server *
 *Format
shown in Figure 1. Programs usually written in this format include servers in a*
 * multipro-
cessor or distributed system awaiting request messages, graphical user interfac*
 *es awaiting
mouse clicks, and operating systems awaiting service calls. Furthermore, microk*
 *ernel oper-
ating systems are also collections of servers usually written in the Generic Se*
 *rver Format.
Our goal is to allow the programmer to concentrate on implementing a correct, s*
 *equential
version of such a program, where the Thread.Fork calls in Figure 1 have been re*
 *placed by
procedure invocations. This sequential version is easier to implement because i*
 *t frees the
programmer from concerns about synchronization, mutual exclusion, and deadlock.*
 * Using
our system, the sequential server must be developed, and then tested or verifie*
 *d. Then the
sequential server can be automatically mapped to a correct, concurrent implemen*
 *tation.


PROCEDURE Server() =
VAR request: Request;
BEGIN
   Initialize();
   LOOP
     ReceiveRequest(request);
     CASE request.type OF
        _ RequestType1 => Thread.Fork(ServiceRequest1);
        _ RequestType2 => Thread.Fork(ServiceRequest2);
        . . .
        _ RequestTypeN => Thread.Fork(ServiceRequestN);
     END CASE;
   END LOOP;
END Server;


                Figure 1  Generic Server Format for a Concurrent Server


     Our tool requires the programmer to use the Defensive Object Model of prog*
 *ram-
ming described in Section 2.2.  After the programmer is satisfied with the corr*
 *ectness of
the sequential version, our tool can be used to produce the corresponding concu*
 *rrent ver-
sion in two steps. The first pass inserts synchronization mechanisms, and the s*
 *econd pass
inserts mutual exclusion mechanisms. The correctness of the algorithms that ins*
 *ert mutual
exclusion and synchronization mechanisms depends only on the correct placement *
 *of these
mechanisms, not on which mechanism is actually inserted.  Our tool correctly so*
 *lves the
major problems associated with implementing a concurrent server: mutual exclusi*
 *on with-
out deadlock and synchronization. It treats each of the forked threads shown in*
 * Figure 1 as
transactions in order to guarantee synchronization atomicity. We also ensure th*
 *at all locks
are released whether the service request fails or succeeds.

     Our tool is not the first one to handle some of these problems automatical*
 *ly. Database
programmers are usually required to insert BeginTransaction and EndTransaction *
 *primitives
around their transactions. Doing so would correspond to inserting BeginTransact*
 *ion and


                                      2

EndTransaction statements around each Fork statement in the Generic Server Form*
 *at. The
database system then uses a dynamic scheduling technique, such as two-phase loc*
 *king [7],
to obtain locks as needed.  If the DBMS does not use static scheduling, locks a*
 *re not
released until all locks have been obtained.  Deadlock is possible in these sys*
 *tems, but it
is usually detected, and transaction abortion releases the deadlock. Because th*
 *ese systems
do not commit transactions until all locks have been obtained, there is no poss*
 *ibility that
a transaction would be partially committed when it needed to be aborted.

     Herlihy [12] applies this technique to sequential objects to make them con*
 *current,
and, like the database systems, Herlihy's system requires that data structures *
 *be copied
upon modification.  His system is similar to techniques used in the distributed*
 * language
ARGUS [16], which also has atomic object facilities. In the ARGUS system, locks*
 * are held
until transactions are committed, which also requires copying all modified data*
 *. ARGUS
and Herlihy's system require significantly more overhead than our system due to*
 * substantial
copying overhead and because they do not insert cooperative synchronization mec*
 *hanisms.
However, Herlihy's solution is more fault tolerant than ours because it guarant*
 *ees that even
if processes try to hold mutual exclusion forever (or a very long time), progre*
 *ss will be
made. Barnes  [3] has improved upon Herlihy's technique, avoiding some of the c*
 *opying.
Our technique uses copying only when synchronization is required after the serv*
 *er's state
has been modified.

     Our tool differs from database systems in several major ways. Our tool ana*
 *lyzes the
code statically and inserts the locking mechanisms, rather than inserting them *
 *for every ex-
ecution during runtime. Our tool differs from both typical database systems and*
 * Herlihy's
technique because it does not allow deadlock to occur and it inserts cooperativ*
 *e synchro-
nization mechanisms.  Our tool uses well known compiler techniques, and known l*
 *ocking
and synchronization mechanisms. The major contribution of our work lies in the *
 *verifica-
tion of the correctness of the insertion algorithms, deriving synchronization c*
 *onditions from
exception handling statements, and in reporting on the performance of our algor*
 *ithms.

     In the next section, we present the program model, including the Defensive*
 * Object
Model. Then in Section 3., we present our design, that is, our algorithms: four*
 * for the in-
sertion of mutual exclusion mechanisms and one for the insertion of synchroniza*
 *tion mecha-
nisms. We discuss the implementation and current status of our tool in Section *
 *4.. Section 5.
gives results from experiments with sequential servers, hand-coded concurrent s*
 *ervers, and
automatically generated concurrent servers.  Finally, we present our conclusion*
 *s in Sec-
tion 6..


2.    Program Model

We model servers using objects. An object is a set of procedures, or methods, t*
 *hat operates
on a set of data.  Each method implements a service offered to clients.  The da*
 *ta, called
object variables, are shared among all methods of an object, but other objects *
 *can access
the data only through approved methods. Each method can contain local data (par*
 *ameters
and local variables) that are private.

     Threads are asynchronous, sequential processes that execute concurrently w*
 *ith other
threads. Each thread's environment is the same as its parent's environment, tha*
 *t is, the
parent and child share data and code segments although the stack segments are i*
 *ndependent.
A client thread accesses the server by invoking methods of a server object. Hen*
 *ce, in the


                                      3

most general case, several instances of each object's methods may be executing *
 *concurrently
although individual object methods are sequential.

     A concurrent server, then, is an object whose methods may execute simultan*
 *eously
using threads. Conversely, a sequential server is an object whose methods execu*
 *te one at a
time. The programmer of a sequential object may assume that while a method is e*
 *xecuting,
no other method can execute.


2.1   Modula-3

The examples throughout this paper are expressed in Modula-3 [4, 18], a simple *
 *program-
ming language that supports object-orientation, threads, synchronization, mutua*
 *l exclusion,
exception handling, and garbage collection. Since Modula-3 supports threads and*
 * synchro-
nization, it neatly fits our server model.

     Our concurrent object generator, Concurra, takes as input Modula-3 program*
 *s written
using the Defensive Object Model, as explained in the next section. The notion *
 *of an object
in Modula-3 is represented in Figure 2. The object's data are declared after th*
 *e keywords
BRANDED OBJECT. These data are not accessible from outside the object.  They may
be accessed only by the methods listed below the keyword OVERRIDES. In Figure 2*
 * the
method insert is implemented by the procedure my_insert. Procedures reference t*
 *he object's
data through the self parameter. For example, the object variable some_data is *
 *referenced
via the expression self.some_data.


REVEAL T = Public BRANDED OBJECT
              some_data : Some_Type;
              more_data : Some_Other_Type;
       OVERRIDES
              insert := my_set;
              delete := my_delete;
       END;

PROCEDURE my_insert(self : T; newvalue : Some_Type) =
BEGIN
  ...
END my_insert;

PROCEDURE my_delete(self : T) : Some_Type =
BEGIN
  ...
END my_delete;


                   Figure 2  Implementation of a Modula-3 Object


2.2   Defensive Object Model

We now present the Defensive Object Model that sequential objects must adhere t*
 *o in order
to guarantee the correctness of our transformations. The model that we describe*
 * is simply
good programming practice, but we explicitly describe it here because our syste*
 *m depends
upon it to correctly insert mutual exclusion and synchronization mechanisms.

     When errors occur in object methods, they should be handled in an organize*
 *d man-
ner. Exceptions have been devised as a means to interrupt normal program flow w*
 *hen an
error occurs and to recover from the error. When a Modula-3 method discovers an*
 * error


                                      4

condition, it should execute the statement RAISE ExceptionName, where Exception*
 *Name
has been previously declared as an exception. For example, if a method intends *
 *to insert
elements into a table, but finds that the table is full, it might execute a RAI*
 *SE TableFull
statement to indicate the error condition to the caller of the method. Once an *
 *exception has
been raised, the Modula-3 runtime system uses dynamic scoping rules to find an *
 *appropriate
response, otherwise known as an exception handler, to the exception.

     Untrusted clients make requests of servers. Good programming practice dict*
 *ates that
a server must neither make any assumptions about the parameters passed to it in*
 * a client
request nor make assumptions about the state of the object when a client reques*
 *t arrives,
except that the object state is consistent. A properly written server method al*
 *ways verifies
that client requests are legal.

     As an additional example, we consider a stack server. Typical methods incl*
 *ude push
and pop. When push and pop are invoked, they assume nothing about the state of *
 *the stack,
except that it is consistent. Therefore, the pop method must check that the sta*
 *ck is not
empty before proceeding with the rest of the method. If this requirement is not*
 * met, the
server typically raises an exception to the client such as RAISE IsEmpty. The p*
 *op method
cannot complete successfully unless the stack is non-empty.


2.3   Synchronizing Methods

Sequential object methods never need to synchronize.  In order to extract synch*
 *roniza-
tion conditions from sequential methods, we rely on the sequential programmer t*
 *o use
exceptions when indicating error conditions. As described in the next section, *
 *we use the
exception handling information within a method to generate synchronization cond*
 *itions
when constructing concurrent objects. Basically, a synchronization condition ho*
 *lds when
an exception might be raised.  As long as a synchronization condition holds, we*
 * consider
the method unsynchronized and put the method to sleep. Later, when the synchron*
 *ization
condition no longer holds, the method awakens and proceeds normally.


3.    Design

Our tool makes two steps through each correct sequential object.  Both steps us*
 *e some
well known compiler technology so we mention the used technology. In the first *
 *step, our
tool inserts synchronization mechanisms. We describe the algorithm that we use *
 *to insert
synchronization mechanisms below. In the second step, mutual exclusion mechanis*
 *ms are
inserted. In the last part of this section, we describe four specific algorithm*
 *s for inserting
mutual exclusion mechanisms, a modification that can be applied to all four, an*
 *d a deadlock
avoidance algorithm. Different algorithms will yield better speedup with differ*
 *ent objects.
The algorithms differ in the amount of overhead they impose as well as the amou*
 *nt of
concurrency they permit.

     Our system makes use of some well known techniques from compiler theory.  *
 *For
brevity, we do not describe those techniques but refer the reader to appropriat*
 *e references
on definition-use graphs [1] and data flow dependence [20]. These algorithms ca*
 *n be made
very efficient [22, 15].


                                      5

3.1   Synchronization

Sequential objects are not normally thought of as having to perform synchroniza*
 *tion because
there are no concurrently executing threads of control.  In this section we giv*
 *e a general
overview of our algorithms used to insert synchronization mechanisms. More deta*
 *ils can be
found in another paper [23].

     The algorithm determines the conditions that cause exceptions to be raised*
 *. However,
exceptions that are raised due to bad parameters, such as a value out of range,*
 * do not become
synchronization conditions. The algorithm inserts statements at the beginning o*
 *f a method
to determine whether these synchronization conditions are true. It also inserts*
 * statements
to wait for these conditions to become false in case they are found to be true.*
 * In the above
mentioned paper [23], we consider two cases.  The first case is when no object *
 *variables
are modified before an exception is raised. This case is straightforward. We al*
 *so consider
the case where some object variables are modified before an exception can be ra*
 *ised. The
problem here is that, once mutual exclusion has been added, write locks will be*
 * obtained
to modify the object's state. If we allow the write locks to be held while the *
 *method waits
to synchronize, deadlock can occur.  If we release locks, inconsistent states c*
 *an be seen
and synchronization atomicity would be violated. Therefore, we choose a databas*
 *e system
solution.  We make copies of those variables, storing the modified versions in *
 *temporary
variables.  Once the method becomes synchronized, we transfer these temporary v*
 *alues
back to the original variables. In this way, we guarantee failure atomicity.


3.2   Mutual Exclusion

In this section we consider several mutual exclusion algorithms and show how to*
 * auto-
matically map each one onto a sequential object. The resulting concurrent objec*
 *t has no
points of interference.  We take a conservative approach when inserting mutual *
 *exclusion
primitives. That is, our algorithms do not take into account the possibility th*
 *at the server
may not actually invoke methods that could interfere with each other.  If two m*
 *ethods
can potentially interfere with each other, then they are prevented from doing s*
 *o under all
circumstances. We give an overview of the algorithms here and refer the interes*
 *ted reader
to another paper [24].

     Each of the following algorithms allows no less, and in most cases more, c*
 *oncurrency
than the straightforward technique of enclosing each object in a monitor. While*
 * monitors
prevent interference in concurrent objects, they also prevent methods from over*
 *lapping
with one another. If two methods do not both need to atomically read or update *
 *an object
variable, they can overlap their executions.

     The first two algorithms are used to obtain fine-grained locking while avo*
 *iding dead-
lock. They differ in the amount of overhead due to locking and deadlock avoidan*
 *ce, as well
as the amount of concurrency that they permit. The next two algorithms use a ve*
 *ry coarse-
grained locking technique.  Finally, we give an example illustrating the synchr*
 *onization
mechanism and one of the mutual exclusion techniques.

                         Conservative Two-phase Locking

     This insertion algorithm is based on traditional two-phase locking [7] and*
 * hierarchical
ranking of locks [10]. The tool inspects the assignment and conditional stateme*
 *nts in each
object method to determine which object variables must be read or write locked.*
 * Requests


                                      6

for locks are inserted at the beginning of the method according to the hierarch*
 *ical ranking
and released near the end to assure atomicity.

                            Liberal Two-phase Locking

     This insertion algorithm is similar to conservative two-phase locking. Aft*
 *er inspecting
each method's assignment and conditional statements, the tool inserts requests *
 *for locks
before each object variable is used, unless an appropriate lock has already bee*
 *n acquired.
Locks are held until an object variable has been used for the last time to assu*
 *re synchroniza-
tion atomicity. Read locks are acquired for object variables that are not modif*
 *ied. Locks
are requested in a hierarchical order to avoid deadlock; hence, the tool must m*
 *ove some
lock requests upward in the method to ensure that locks are acquired in the hie*
 *rarchical
ordering. Conditional statements present additional problems. If a lock L is ne*
 *eded inside
a conditional statement and is hierarchically ordered before another lock reque*
 *sted before
the conditional statement, lock L is requested before the first lock. It is pos*
 *sible that the
conditional branch will not be taken and lock L will have been acquired needles*
 *sly. Care is
taken to release all locks. Modula-3's TRY ... FINALLY statements are used to e*
 *nsure that
all locks are released in the event of failure of a method.

     Another difference between these two-phase locking protocols is that the l*
 *iberal tech-
nique permits write locks to be downgraded to read locks based upon knowledge d*
 *erived
from dependency graphs. Downgrading locks allows more concurrency since write l*
 *ocks can
be downgraded before the end of a method.  However, it also carries extra overh*
 *ead, as
eventually the downgraded locks must be completely released.

                                    Monitors

     The monitor is another well known algorithm that is used to build concurre*
 *nt ob-
jects [13]. Implementing a monitor is straightforward. We associate a semaphore*
 * with the
object. Then each object method must acquire the semaphore before executing. Th*
 *us, at
most one method may be executing at any one time.

                                 Crowd Monitors

     The crowd monitor generalizes the monitor to allow some object methods to *
 *overlap
their execution [8]. We use a slight simplification of crowd monitors. First, w*
 *e associate a
single lock with the object. Then if an object method might modify an object va*
 *riable, the
method must acquire a write lock before it begins execution. The write lock ens*
 *ures that
no other method will execute concurrently with the method that owns the write l*
 *ock. On
the other hand, if a method only reads object variables, then it must acquire a*
 * read lock
before execution. Because several threads can hold the read lock, some object m*
 *ethods will
be able to overlap their executions and increase the concurrency in the object.

                                Combining Locks

     We can parameterize our tool so that, using any of the above algorithms, w*
 *e can
reduce overhead due to lock acquisition and release. We give our tool a percent*
 *age. If locks
for two or more object variables are obtained together a percentage of time at *
 *least as great
as our parameter, we obtain just a single lock, associated with that set of obj*
 *ect variables,
in the event that any of the object variables are used. This merging of object *
 *variables to
use a single lock cuts down on the locking overhead and also some amount of con*
 *currency.


                                      7

                               Banker's Algorithm

     The Banker's algorithm is a well known algorithm for preventing deadlock [*
 *5]. Since
static analysis of methods tells us all of the locks that could be used, we can*
 * apply this
algorithm.  Again, there is an increase in both concurrency as well as overhead*
 * due to
deadlock avoidance in using this algorithm, but optimized versions of this algo*
 *rithm mitigate
this problem [14, 17]. The Banker's algorithm originally applied to resources t*
 *hat were held
exclusively. An extension of the Banker's algorithm that uses read and write lo*
 *cks to permit
shared and exclusive access to resources (in our case, object variables) is use*
 *d in our tool [11].

                                   An Example

     Finally, we illustrate the use of some of the synchronization and mutual e*
 *xclusion
mechanisms with a simple example.  An abbreviated sequential queue object writt*
 *en in
Modula-3 is depicted on the left side of Figure 3. The right side of the Figure*
 * 3 shows the
same queue after it has been transformed using the synchronization and liberal *
 *two-phase
locking mechanisms. Automatically inserted statements appear in boldface.


4.    Status of the Tool

Our tool processes a sequential object written in Modula-3 [4, 18]. The tool st*
 *atically an-
alyzes each object method to determine what mutual exclusion and synchronizatio*
 *n primi-
tives need to be added to the object to guarantee synchronization atomicity. In*
 * this section
we review some of the limitations of the tool and directions for future work.

     To date our tool implements the conservative and liberal two-phase locking*
 * schemes
as well as monitors and crowd monitors for mutual exclusion. In addition, synch*
 *ronization
conditions are generated as long as the object state is never modified before a*
 *n exception
could be raised.

     The tool's static analysis of objects proceeds in a number of steps. First*
 *, the tools
Flex [19] and Bison [6] are used to generate a parser for the Modula-3 source f*
 *ile and create
a definition-use tree.  A definition-use tree is simply a parse tree where each*
 * node in the
parse tree lists the variables that are written (defined) and read (used) at th*
 *at node.

     The input to the tool is a Modula-3 source file that consists of the defin*
 *ition of a
single object. All the object's methods must be defined in the same file. Param*
 *eters to each
method must be passed by value. The tool supports all legal Modula-3 data types*
 *, including
records and arrays, except references (pointers). References pose unusual probl*
 *ems, but we
see no obstacle in supporting them in the near future.

     Several, but not all, Modula-3 statements are supported. These statements *
 *include
assignment statements, IF-THEN-ELSE, WHILE, EXIT-LOOP, and RETURN statements. In
the current version of our system, object methods are not permitted to invoke o*
 *ther pro-
cedures or methods.  There are some additional obstacles in ensuring mutual exc*
 *lusion
without deadlock and synchronization when invoking other procedures. Solutions *
 *to these
problems are currently under investigation. Nevertheless, the statements that a*
 *re currently
supported are sufficient for writing a wide variety of useful servers.

     The locking granularity is restricted to coarse-grained locking. Locks are*
 * obtained on
entire variables whether they are integers, records, or arrays. Locks are acqui*
 *red according
to a lexicographic hierarchical ordering to avoid deadlock.  Since arrays are i*
 *ndexed by


                                      8


MODULE Queue;                    MODULE CQueue;
                                 IMPORT Locks, Thread;

CONST MAX = 100;                 CONST MAX = 100;
EXCEPTION queue_empty;           EXCEPTION queue_empty;

TYPE                             TYPE
 REVEAL                           REVEAL
 T = Public BRANDED OBJECT        T = Public BRANDED OBJECT
  values: ARRAY[0..MAX - 1] OF QType;values: ARRAY[0..MAX - 1] OF QType;
  head,tail: INTEGER := 0;         head, tail: INTEGER := 0;
                                   lock_head, lock_tail, lock_values: Locks.T;
                                   Condition: Thread.Condition
 OVERRIDES                        OVERRIDES
  dequeue := dequeue;              dequeue := dequeue;
 END;                             END;

PROCEDURE dequeue (self: T): QTypeP=ROCEDURE dequeue (self: T): QType =
VAR value: QType;                VAR value: QType;
BEGIN                            BEGIN
                                 TRY

                                  (* perform synchronization *)
                                  Locks.acquire_read_write_locks(self.lock_head*
 *);
                                  Locks.acquire_readlock(self.lock_tail);
                                  WHILE self.head = self.tail DO
                                   Locks.release_read_write_locks(self.lock_hea*
 *d);
                                   Locks.release_readlock(self.lock_tail);

                                   Thread.Wait(self.Condition);

                                   Locks.acquire_read_write_locks(self.lock_hea*
 *d);
                                   Locks.acquire_readlock(self.lock_tail);
                                  END;
                                  (* procedure is now synchronized *)

 IF self.head = self.tail THEN    IF self.head = self.tail THEN
  RAISE queue_empty;               RAISE queue_empty;
 END;                             END;

                                  Locks.acquire_readlock(self.lock_values);
                                  Locks.release_readlock(self.lock_tail);
value := self.values[self.head];  value := self.values[self.head];
                                  Locks.release_readlock(self.lock_values);
self.head := (self.head + 1) MOD MAX;self.head := (self.head + 1) MOD MAX;
                                  Locks.release_writelock(self.lock_head);
                                  Locks.release_readlock(self.lock_head);
RETURN value;                     RETURN value;

                                 (* make sure all locks are released *)
                                 (* and signal any waiting threads *)
                                 FINALLY
                                  Locks.release_read_write_locks(self.lock_head*
 *);
                                  Locks.release_read_write_locks(self.lock_tail*
 *);
                                  Locks.release_read_write_locks(self.lock_valu*
 *es);
                                  Thread.Signal(self.Condition);
                                 END;
END dequeue;                     END dequeue;
            Figure 3  Sequential Queue Transformed to Concurrent Queue


                                    9

variables whose values are not known until runtime, it is difficult to hierarch*
 *ically order lock
requests on array elements; hence, we lock entire arrays. We are investigating *
 *techniques
for achieving a finer grain of locking in order to increase the concurrency in *
 *the objects that
the tool generates.

     Our tool does not support synchronization when it is possible for the obje*
 *ct's state to
be modified before an exception is raised. Although we will eventually lift thi*
 *s restriction,
it is not as harsh as it may sound. It is considered risky programming practice*
 * to modify
the state of an object and subsequently raise an exception [9].


5.    Experiments

To test the effectiveness of our tool, we compare six implementations for each *
 *of two data
structures. The first data structure is the queue, whose methods include enqueu*
 *e, dequeue,
and front_of_queue.  The other data structure is the search structure, which in*
 *cludes the
methods insert, delete, and search. We implemented these data structures manual*
 *ly, using
eventcounts and sequencers [21], and with our tool using conservative two-phase*
 * locking,
liberal two-phase locking, monitors, and crowd monitors. Our evaluations were c*
 *onducted
on an 8 processor Encore Multimax, a shared memory, bus architecture multiproce*
 *ssor.
Each experiment consists of forking 8 threads that repeatedly invoke methods on*
 * an object.
The object represents the server, each thread represents a client, and threads *
 *invoking
methods represent client requests.


5.1   Queue

First, we developed a common implementation of a sequential queue of integers u*
 *sing an
array.  Since integers are usually the same size as pointers, this experiment a*
 *lso shows
the effects of a queue of pointers, where each queue element points to a memory*
 * block of
arbitrary size. Next, our tool was used to automatically generate several concu*
 *rrent versions
of the queue.  Finally, a concurrent queue based on eventcounts and sequencers *
 *[21] was
developed manually. All the queues are written in Modula-3. In the first experi*
 *ment each of
the 8 threads first enqueues 1,000 integers. Next, each thread invokes front_of*
 *_queue 8,000
times, which simply returns the element at the front of the queue without deque*
 *uing it.
Finally, each thread dequeues 1,000 values. The front_of_queue method does not *
 *modify the
object, allowing our tool to generate objects that allow several instances of f*
 *ront_of_queue
to run simultaneously.

     Since the queue's elements are small, the overhead for the concurrency mec*
 *hanisms
are quite significant. In all cases the amount of processor time required for e*
 *ach queue's
concurrency mechanism is far greater than the actual time spent in performing q*
 *ueuing
work.  Because each queue operation is computationally inexpensive, each thread*
 * spends
a significant amount of time acquiring and releasing locks.  The process of acq*
 *uiring and
releasing a lock involves reading and writing a single memory location. Since a*
 *ll 8 threads
acquire and release the same locks, the threads perform a large number of read *
 *and write
operations to a small number of memory locations.  These activities create a bo*
 *ttleneck,
resulting in memory and bus contention. As a consequence of this bottleneck, th*
 *roughput
slows significantly. As a matter of fact, as shown in the timings depicted in T*
 *able 1, the
8 threads run faster executing on 1 processor than 8 processors. Because there *
 *are fewer


                                      10

processors, there is less memory contention, bus contention, and fewer executio*
 *ns of the
cache coherence protocol.

     Table 1 gives the times, in seconds, that each thread spent in executing a*
 *ll 10,000
object operations, multiplied by 8 threads. The sequential queue ran in just 1.*
 *4 seconds.
However, that time is deceptive. The sequential queue is not a concurrent serve*
 *r because
it does not provide synchronization atomicity.  The sequential queue's time app*
 *ears as
a benchmark for comparison against the execution times of the concurrent queues*
 *.  The
handwritten queue ran in 60 seconds. The overhead in acquiring and releasing se*
 *maphores
is a significant percentage of the total running time. This overhead is especia*
 *lly evident for
the two-phase locking protocols. These concurrent queues acquire and release se*
 *veral read
and write locks. Each time a lock is acquired or released, a semaphore must be *
 *acquired
and later released. The task of acquiring and releasing so many semaphores expl*
 *ains these
queues' lengthy running times.  Finally, we come to the monitors.  The monitor *
 *acquires
a single semaphore for each object method _ the fewest of any of the concurrent*
 * queues.
This low overhead explains the monitor's low execution time of 38 seconds.  The*
 * crowd
monitor is implemented using a single lock. As with the two-phase locking proto*
 *cols, the
lock is acquired at the beginning of a method and released at the end.  Thus, t*
 *he crowd
monitor contains more overhead than the monitor and, as a result, ran slower th*
 *an the
monitor implementation.


            _______________________________________________________
            |        Type of Queue        |     Running Time     |
            |______________________________|1_Processor8_Processors_|_

            | Sequential                    |  1:4    |     1:4    |
            | Handwritten using Eventcounts  |40      |    60      |
            | Conservative Two-phase Locking |167     |   198      |
            | Liberal Two-phase Locking      |243     |   291      |
            | Monitor                      |  24      |    38      |
            |_Crowd_Monitor________________|__69______|____95______|

        Table 1  Running Times of Handwritten and Automatically Generated Queues


     The problems with memory and bus contention are not new. However, they do *
 *high-
light the problem that occurs when the cost of the mutual exclusion mechanisms *
 *overwhelms
the cost of the operation to be performed. This experiment illustrates why it c*
 *an be ad-
vantageous to use the least expensive mutual exclusion mechanism for computatio*
 *nally
inexpensive operations. Even though such a mechanism theoretically provides les*
 *s concur-
rency, in practice it can provide more throughput.  This experiment also shows *
 *how an
automatically generated queue can outperform a handwritten one.


5.2   Search Structure

Next we address a data structure whose operations are computationally expensive*
 * and
examine its performance using handwritten and automatically generated implement*
 *ations.
A search structure stores and retrieves data. It contains three methods. The me*
 *thod insert
places data in the structure, delete removes data from the structure, and searc*
 *h explores
the structure looking for data that match a set of search criteria.


                                      11

     Because search structures are in common use and their methods can be compu*
 *tation-
ally expensive, it is useful to investigate their susceptibility to concurrency*
 *. In this exper-
iment we compare a sequential implementation of a search structure and a handwr*
 *itten
concurrent structure to four search structures automatically generated from the*
 * sequential
implementation. By making use of eventcounts and sequencers, the handwritten co*
 *ncurrent
search structure is able to overlap the executions of the insert, delete, and s*
 *earch methods
to a large extent. However, the automatically generated concurrent search struc*
 *ture tends
to serialize the executions of the insert and delete methods.

     For each of the experiments involving search structures, we present a tabl*
 *e, like Ta-
ble 2, portraying the results of the experiment.  Each table shows the running *
 *times of
threads that invoke search structure methods. The rows in the table show the re*
 *sults for a
sequential search structure, a handwritten one using eventcounts and sequencers*
 *, followed
by four concurrent search structures generated automatically by Concurra.  The *
 *second
column of numbers shows the running times, in seconds, on 8 processors. The thi*
 *rd column
of numbers indicates how fast a search structure ran in comparison to the handw*
 *ritten
search structure. Implementations with relative speeds of 1 ran as fast as the *
 *handwritten
structure. Relative speeds less than 1 correspond to slower running implementat*
 *ions, and
relative speeds greater than 1 indicate implementations that ran faster than th*
 *e handwritten
structure.


  ___________________________________________________________________________
  |    Type of Data Structure     |  Running  |     Speed Relative to       |
  |                               |   Time    |        Handwritten          |
  |_______________________________|___________|_____________________________|__

  | Sequential                     |164      |              0:47            |
  | Handwritten using Eventcounts  | 77      |              1:00            |
  | Conservative Two-phase Locking | 80      |              0:96            |
  | Liberal Two-phase Locking      | 77      |              1:00            |
  | Monitor                        |168      |              0:46            |
  |_Crowd_Monitor__________________|_78______|______________0:99____________|_

            Table 2  Running Times when Probability of Read-only Request is 1


     For example, Table 2 indicates that the speed of the monitor implementatio*
 *n was
about half that of the handwritten implementation. That is, the monitor took tw*
 *ice as long
to execute. On the other hand, the liberal two-phase locking implementation ran*
 * as fast as
the handwritten implementation, as denoted by a 1 in the third column.

                   Read-only Method Occurs with Probability 1

     Two of the search structure's methods, insert and delete, modify the struc*
 *ture.  On
the other hand, the search method only reads the structure; hence, we call sear*
 *ch a read-
only method. In our first table of results for the search structure, we forked *
 *8 threads that
invoked read-only methods of the search structure with probability 1. That is, *
 *the threads
called the search method exclusively.

     The search structure was already populated with a large amount of data. Se*
 *arching
for the data was computationally expensive so, unlike the previous queue data s*
 *tructure,
the search structure's processing was not dominated by the acquiring and releas*
 *ing of locks.


                                      12

Instead, the computation is dominated by searching for data, and the previous p*
 *roblems
with bus and memory contention are avoided.

     Table 2 shows the execution times of the search structure implementations.*
 *  The
probability of a read-only request occurring is 1. Because all 8 processors can*
 * search through
the structure simultaneously, it follows that most of the implementations run a*
 *s fast as the
handwritten search structure. Consequently, the speeds of these implementations*
 * are close
to or match the speed of the handwritten search structure. This fact is denoted*
 * by relative
speeds of 1 or nearly 1 for the automatically generated search structures. The *
 *two exceptions
are the sequential and monitor implementations. These implementations are slowe*
 *r because
they serialize all the methods.

                          Read-only Probability of 0.75

     Table 3 shows the results when the probability of invoking a read-only met*
 *hod of the
search structure is 0.75. The handwritten search structure is able to overlap t*
 *he executions
of update methods and methods that only read the state of the structure.  Howev*
 *er, the
automatically generated search structures do not perform as well. Automatically*
 * generated
methods that update the data structure tend to become serialized.


  ___________________________________________________________________________
  |    Type of Data Structure     |  Running  |     Speed Relative to       |
  |                               |   Time    |        Handwritten          |
  |_______________________________|___________|_____________________________|__

  | Sequential                     |164      |              0:48            |
  | Handwritten using Eventcounts  | 78      |              1:00            |
  | Conservative Two-phase Locking |114      |              0:68            |
  | Liberal Two-phase Locking      |111      |              0:70            |
  | Monitor                        |170      |              0:46            |
  |_Crowd_Monitor__________________|119______|______________0:66____________|_

          Table 3  Running Times when Probability of Read-only Request is 0.75


     Our results show that while the handwritten search structure runs about as*
 * fast as
before, the performance of the automatically generated structures declines. The*
 * fastest of
these implementations, using liberal two-phase locking, runs 0.70 times as fast*
 * as than the
handwritten structure whereas it ran as fast as the handwritten structure when *
 *read-only
methods were invoked exclusively.

                    Read-only Probabilities of 0.50, 0.25, and 0

     In our final three experiments, we progressively reduced the probability o*
 *f a thread
invoking a read-only method of the search structure.  As the probabilities desc*
 *end, the
automatically generated implementations of the search structure become progress*
 *ively se-
rialized. Finally, when the probability reaches 0, all the automatically genera*
 *ted structures
become fully serialized. All the automatically generated search structures run *
 *as slow as
the sequential structure. The results of these experiments are depicted in Tabl*
 *e 4.

                      Summarizing the Experimental Results

     As reported above, when the probability that a read-only method will be in*
 *voked
is low, the concurrent search structures generated by Concurra perform worse th*
 *an the
handwritten search structure. As this probability increases, so does the relati*
 *ve performance


                                      13

of the automatically generated search structures. In fact, when the probability*
 * of invoking a
read-only method approaches 1, almost all of the automatically generated search*
 * structures
perform as well as the handwritten search structure.

     We document the relative speed of an automatically generated search struct*
 *ure using
liberal two-phase locking in Figure 4. The x-axis represents the probability of*
 * a read-only
method invocation. The y-axis represents the relative speed of this automatical*
 *ly generated
search structure as compared to the handwritten search structure. To reiterate,*
 * a relative
speed of 0.50 means the automatically generated search structure is half as fas*
 *t (twice as
slow) as the handwritten one. When the probability of invoking a read-only meth*
 *od is close
to 0, the handwritten queue handily outperforms the automatically generated one*
 *. Never-
theless, as the probability of invoking a read-only method increases, the relat*
 *ive performance
of the automatically generated search structure increases exponentially.


6.    Conclusions

We have presented algorithms for inserting mutual exclusion and synchronization*
 * into se-
quential servers. We have built a tool that automatically inserts primitives fo*
 *r these mech-
anisms.  We described the rationale, design, and implementation of the tool.  F*
 *inally, we
analyzed results from concurrent servers generated with the tool. It is possibl*
 *e to achieve


  ___________________________________________________________________________
  |    Type of Data Structure     |  Running  |     Speed Relative to       |
  |                               |   Time    |        Handwritten          |
  |_______________________________|___________|_____________________________|__

  |__________________________Read-only_Probability_0.50_______________________ *
 *  |

  | Sequential                     |165      |              0:49            |
  | Handwritten using Eventcounts  | 80      |              1:00            |
  | Conservative Two-phase Locking |139      |              0:58            |
  | Liberal Two-phase Locking      |134      |              0:60            |
  | Monitor                        |168      |              0:48            |
  |_Crowd_Monitor__________________|143______|______________0:56____________|__

  |__________________________Read-only_Probability_0.25_______________________ *
 *  |

  | Sequential                     |164      |              0:49            |
  | Handwritten using Eventcounts  | 80      |              1:00            |
  | Conservative Two-phase Locking |164      |              0:49            |
  | Liberal Two-phase Locking      |158      |              0:51            |
  | Monitor                        |168      |              0:48            |
  |_Crowd_Monitor__________________|157______|______________0:51____________|__

  |___________________________Read-only_Probability_0_________________________ *
 * |

  | Sequential                     |164      |              0:49            |
  | Handwritten using Eventcounts  | 80      |              1:00            |
  | Conservative Two-phase Locking |166      |              0:48            |
  | Liberal Two-phase Locking      |165      |              0:49            |
  | Monitor                        |168      |              0:48            |
  |_Crowd_Monitor__________________|165______|______________0:49____________|_

      Table 4  Running Times when Probability of Read-only Request is 0.50, 0.2*
 *5, and 0


                                      14


Figure 4   Speed of Automatically Generated Search Structure Relative to Handwr*
 *itten Search
Structure


better speedup with handwritten servers, but the handwritten servers are not as*
 * fault tol-
erant as the ones generated using our tool; they do not provide failure atomici*
 *ty when
an exception is raised after the state of the server is modified.  Handwritten *
 *concurrent
servers also require substantially more man-hours to program and substantially *
 *more time
to debug.

     Even when our automatically generated concurrent search structure is outcl*
 *assed
by the handwritten concurrent search structure, we recall that the concurrent s*
 *tructure
generated by our tool has synchronization automatically inserted, guarantees sy*
 *nchroniza-
tion atomicity, and guarantees transaction atomicity. These advantages are impo*
 *rtant and
should not discounted. Moreover, for that class of servers where most of the cl*
 *ient requests
do not update the server's state, the automatically generated servers perform n*
 *early as
well as the handwritten server. For this class of servers, the slight gains in *
 *performance for
handwritten servers should be weighed against the advantages of automatically c*
 *onstructing
concurrent servers using Concurra.

     Future work for Concurra includes investigating how we can automatically g*
 *enerate
concurrent servers that perform well when the probability of a read-only reques*
 *t is low.
While performing the experiments discussed in this paper, we noticed that Concu*
 *rra does
not generate highly concurrent code for common program fragments like the follo*
 *wing.


VAR values: ARRAY[1..Max] OF some_type;
BEGIN
  (* find out which array element will be written to *)
  index := expression;

  (* perform some computationally expensive operation involving data *)
  (* in the "values" array and put result in "solution"            *)

  (* update the array element *)
  values[index] := solution;
END;


                                      15

     Because Concurra locks entire arrays, a read lock on values is acquired be*
 *fore com-
puting the solution. Unfortunately, the hierarchical ordering scheme that avoid*
 *s deadlock
forces Concurra to obtain a write lock on values when it acquires the read lock*
 *. Hence, the
entire values array is write locked and this method becomes serialized. No two *
 *threads may
simultaneously invoke this method because only one thread will acquire the writ*
 *e lock on
values.

     However, by employing the Banker's algorithm to avoid deadlock, Concurra c*
 *ould
acquire locks on individual array elements. Additionally, it could delay acquir*
 *ing the write
lock on values[index] until the method was ready to write to values[index]. Suc*
 *h a scheme
would allow two threads trying to write to different elements of values to exec*
 *ute concur-
rently.

     Further work in this area could allow Concurra to generate servers that ri*
 *val the
performance of handwritten servers when the probability of invoking a read-only*
 * method is
low.


Acknowledgments

We thank C. R. Snow of the University of Newcastle upon Tyne for the use and su*
 *pport of
his Modula-3 threads package and the University of Newcastle upon Tyne for the *
 *use of its
Encore Multimax.


References

 [1]A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, a*
 *nd Tools.
    Addison Wesley, 1986.


 [2]G.  R.  Andrews.    Concurrent  Programming:   Principles  and  Practice.  *
 *  Ben-
    jamin/Cummings, 1991.


 [3]G. Barnes. A method for implementing lock-free shared data structures. In P*
 *roceedings
    of the 1993 ACM Symposium on Parallel Algorithms and Architectures (June 19*
 *93).


 [4]L. Cardelli, J. Donahue, L. Glassman, M. Jordan, B. Kalsow, and G. Nelson. *
 *Modula-
    3 report (revised).  Tech. Rep. 52, Systems Research Center, Digitical Equi*
 *pment
    Corporation, November 1989.


 [5]E. W. Dijkstra.  Co-operating sequential processes.  In Programming Languag*
 *es,
    F. Genuys, Ed. Academic Press, 1968, pp. 103-110.


 [6]C. Donnelly, and R. Stallman. BISON, the YACC-compatible Parser Generator, *
 *De-
    cember 1991.


 [7]K. P. Eswaran, J. N. Gray, R. A. Lorie, and I. L. Traiger. The notions of c*
 *onsistency and
    predicate locks in a database system. Communications of the ACM 19, 11 (Nov*
 *ember
    1976), 624-633.


 [8]R. A. Finkel. An Operating System VADE MECUM, 2nd ed. Prentice Hall, 1988.


 [9]S. P. Harbison. Modula-3. Prentice Hall, 1992.


                                      16

[10]J. W. Havender. Avoiding deadlock in multitasking systems. IBM Systems Jour*
 *nal 7,
    2 (1968), 74-84.


[11]D. A. Hensgen, D. L. Sims, and D. Charley. A fair Banker's Algorithm for re*
 *ad and
    write locks. Information Processing Letters (1993). To appear.


[12]M. Herlihy. A methodology for implementing highly concurrent data objects. *
 *Tech.
    Rep. 91-10, Cambridge Research Laboratory, Digital Equipment Corporation, O*
 *ctober
    1991.


[13]C. A. R. Hoare. Monitors: An operating system structuring concept. Communic*
 *ations
    of the ACM 17, 10 (October 1974), 549-557.


[14]R. C. Holt. Some deadlock properties of computer systems. ACM Computing Sur*
 *veys
    4, 3 (September 1972), 179-196.


[15]M. Jordan.  An extensible programming environment for modula-3.  In Proceed*
 *ings
    of the Fourth ACM SIGSOFT Symposium on Software Development Environments
    (December 1990). Also appeared in Software Engineering Notes, volume 15, nu*
 *mber
    6.


[16]B. Liskov, and R. Scheifler.  Guardians and actions:  Linguistic support fo*
 *r robust,
    distributed programs. ACM Transactions on Programming Languages and Systems*
 * 5,
    3 (July 1983), 381-404.


[17]H. Madduri, and R. Finkel. Extension of the banker's algorithm for resource*
 * allocation
    in a distributed system. Information Processing Letters 19, 1 (1984), 1-8.


[18]G. Nelson, Ed. Systems Programming with Modula-3. Prentice Hall, 1991.


[19]V. Paxson. FLEX, Fast Lexical Analyzer Generator.


[20]A. Podgurski, and L. A. Clarke.  A formal model of program dependences and *
 *its
    implications for software testing, debugging, and maintenance. IEEE Transac*
 *tions on
    Software Engineering 16, 9 (September 1990), 965-979.


[21]D. P. Reed, and R. K. Kanodia.  Synchronization with eventcounts and sequen*
 *cers.
    Communications of the ACM 22, 2 (February 1979), 115-123.


[22]D. J. Richardson, T. O. O'Malley, C. T. Moore, and S. L. Aha. Developing an*
 *d integrat-
    ing ProDAG in the Arcadia environment. In Proceedings of the Fifth ACM SIGS*
 *OFT
    Symposium on Software Development Environments (December 1992), pp. 109-119.


[23]D. L. Sims, and D. A. Hensgen. Automatically mapping sequential objects to *
 *concur-
    rent objects: The synchronization problem. Preprint, 1993.


[24]D. L. Sims, and D. A. Hensgen. Automatically mapping sequential objects to *
 *concur-
    rent objects: The mutual exclusion problem. In Proceedings of the 1993 Inte*
 *rnational
    Conference on Parallel Processing (1993).


                                      17