Check out the new USENIX Web site.

Home About USENIX Events Membership Publications Students
USENIX 2002 Annual Technical Conference Paper    [USENIX 2002 Technical Program Index]

Pp. 289-302 of the Proceedings

Cooperative Task Management
without Manual Stack Management

or, Event-driven Programming is Not the Opposite of Threaded Programming

Atul Adya
Jon Howell
Marvin Theimer
William J. Bolosky
John R. Douceur
Microsoft Research
1 Microsoft Way, Redmond, Washington 98052
{adya, howell, theimer, bolosky, johndo}@microsoft.com

Abstract

Cooperative task management can provide program architects with ease of reasoning about concurrency issues. This property is often espoused by those who recommend ``event-driven'' programming over ``multithreaded'' programming. Those terms conflate several issues. In this paper, we clarify the issues, and show how one can get the best of both worlds: reason more simply about concurrency in the way ``event-driven'' advocates recommend, while preserving the readability and maintainability of code associated with ``multithreaded'' programming.

We identify the source of confusion about the two programming styles as a conflation of two concepts: task management and stack management. Those two concerns define a two-axis space in which ``multithreaded'' and ``event-driven'' programming are diagonally opposite; there is a third ``sweet spot'' in the space that combines the advantages of both programming styles. We point out pitfalls in both alternative forms of stack management, manual and automatic, and we supply techniques that mitigate the danger in the automatic case. Finally, we exhibit adaptors that enable automatic stack management code and manual stack management code to interoperate in the same code base.

Introduction

[*] Our team embarked on a new project and faced the question of what programming model to use. Each team member had been burned by concurrency issues in the past, encountering bugs that were difficult to even reproduce, much less identify and remove. We chose to follow the collective wisdom of the community as we understood it, which suggests that an ``event-driven'' programming model can simplify concurrency issues by reducing opportunities for race conditions and deadlocks [cite Ousterhout]. However, as we gained experience, we realized that the popular term ``event-driven'' conflates several distinct concepts; most importantly, it suggests that a gain in reasoning about concurrency cannot be had without cumbersome manual stack management. By separating these concerns, we were able to realize the ``best of both worlds.''

In Section [->], we define the two distinct concepts whose conflation is problematic, and we touch on three related concepts to avoid confusing them with the central ideas. The key concept is that one can choose the reasoning benefits of cooperative task management without sacrificing the readability and maintainability of automatic stack management. Section [->] focuses on the topic of stack management, describing how software evolution exacerbates problems both for code using manual stack management as well as code using automatic stack management. We show how the most insidious problem with automatic stack management can be alleviated. Section [->] presents our hybrid stack-management model that allows code using automatic stack management to coexist and interoperate in the same program with code using manual stack management; this model helped us find peace within a group of developers that disagreed on which method to use. Section [->] discusses our experience in implementing these ideas in two different systems. Section [->] relates our observations to other work and Secton [->] summarizes our conclusions.

Definitions

[*]

In this section, we define and describe five distinct concepts: task management, stack management, I/O response management, conflict management, and data partitioning. These concepts are not completely orthogonal, but considering them independently helps us understand how they interact in a complete design. We tease apart these concerns and then return to look at how the concepts have been popularly conflated.

Task management

[*]

One can often divide the work a program does into conceptually separate tasks: each task encapsulates a control flow, and all of the tasks access some common, shared state. High-performance programs are often written with preemptive task management, wherein execution of tasks can interleave on uniprocessors or overlap on multiprocessors. The opposite approach, serial task management, runs each task to completion before starting the next task. Its advantage is that there is no conflict of access to the shared state; one can define inter-task invariants on the shared state and be assured that, while the present task is running, no other tasks can violate the invariants. The strategy is inappropriate, however, when one wishes to exploit multiprocessor parallelism, or when slow tasks must not defer later tasks for a long time.

A compromise approach is cooperative task management. In this approach, a task's code only yields control to other tasks at well-defined points in its execution; usually only when the task must wait for long-running I/O. The approach is valuable when tasks must interleave to avoid waiting on each other's I/O, but multiprocessor parallelism is not crucial for good application performance.

Cooperative task management preserves some of the advantage of serial task management in that invariants on the global state only need be restored when a task explicitly yields, and they can be assumed to be valid when the task resumes. Cooperative task management is harder than serial in that, if the task has local state that depends on the global state before yielding, that state may be invalid when the task resumes. The same problem appears in preemptive task management when releasing locks for the duration of a slow I/O operation [cite birrell:threads].

One penalty for adopting cooperative task management is that every I/O library function called must be wrapped so that instead of blocking, the function initiates the I/O and yields control to another task. The wrapper must also arrange for its task to become schedulable when the I/O completes.

Stack management

The common approach to achieving cooperative task management is to organize a program as a collection of event handlers. Say a task involves receiving a network message, reading a block from disk, and replying to the message. The receipt of the message is an event; one procedure handles that event and initiates the disk I/O. The receipt of the disk I/O result is a second event; another procedure handles that event and constructs the network reply message. The desired task management is achieved, in that other tasks may make progress while the present task is waiting on the disk I/O.

We call the approach just described manual stack management. As we argue in Section [->], the problem is that the control flow for a single conceptual task and its task-specific state are broken across several language procedures, effectively discarding language scoping features. This problem is subtle because it causes the most trouble as software evolves. It is important to observe that one can choose cooperative task management for its benefits while exploiting the automatic stack management afforded by a structured programming language. We describe how in Section [->].

Some languages have a built-in facility for transparently constructing closures; Scheme's call-with-current-continuation is an obvious example [cite Haynes84, Continuations]. Such a facility obviates the idea of manual stack management altogether. This paper focuses on the stack management problem in conventional systems languages without elegant closures.

I/O management

While this paper focuses on the first two axes, we explicitly mention three other axes to avoid confusing them with the first two. The first concerns the question of synchronous versus asynchronous I/O management, which is orthogonal to the axis of task management. An I/O programming interface is synchronous if the calling task appears to block at the call site until the I/O completes, and then resume execution. An asynchronous interface call appears to return control to the caller immediately. The calling code may initiate several overlapping asynchronous operations, then later wait for the results to arrive, perhaps in arbitrary order. This form of concurrency is different than task management because I/O operations can be considered independently from the computation they overlap, since the I/O does not access the shared state of the computation. Code obeying any of the forms of task management can call either type of I/O interface. Furthermore, with the right primitives, one can build wrappers to make synchronous interfaces out of asynchronous ones, and vice versa; we do just that in our systems.

Conflict management

Different task management approaches offer different granularities of atomicity on shared state. Conflict management considers how to convert available atomicity to a meaningful mechanism for avoiding resource conflicts. In serial task management, for example, an entire task is an atomic operation on shared state, so no explicit mechanism is needed to avoid inter-task conflicts on shared resources. In the limiting case of preemptive task management, where other tasks are executing concurrently, tasks must ensure that invariants hold on the shared state all the time.

The general solution to this problem is synchronization primitives, such as locks, semaphores, and monitors. Based on small atomic operations supplied by the machine or runtime environment, synchronization primitives let us construct mechanisms that maintain complex invariants on shared state that always hold. Synchronization mechanisms may be pessimistic or optimistic. A pessimistic mechanism locks other tasks out of the resources it needs to complete a computation. An optimistic primitive computes results speculatively; if the computation turns out to conflict with a concurrent task's computation, the mechanism retries, perhaps also falling back on a pessimistic mechanism if no forward progress is being made.

Cooperative (or serial) task management effectively provides arbitrarily large atomic operations: all of the code executed between two explicit yield points is executed atomically. Therefore, it is straightforward to build many complex invariants safely. This approach is analogous to the construction of atomic sequences with interrupt masking in uniprocessor OS kernels. We discuss in Section [->] how to ensure that code dependent on atomicity stays atomic as software evolves.

Data partitioning

Task management and conflict management work together to address the problem of potentiallyconcurrent access to shared state. By partitioning that state, we can reduce the number of opportunities for conflict. For example, task-specific state needs no concurrency considerations because it has been explicitly partitioned from shared state. Data is transferred between partitions by value; one must be careful to handle implicit references (such as a value that actually indexes an array) thoughtfully.

Explicitly introducing data partitions to reduce the degree of sharing of shared state can make it easier to write and reason about invariants on each partition; data partitioning is an orthogonal approach to those mentioned previously.

How the concepts relate

We have described five distinct concepts. They are not all precisely orthogonal, but it is useful to consider the effects of choices in each dimension separately. Most importantly, for the purposes of this paper, the task management and stack management axes are indeed orthogonal (see Figure [->]).


Two axes that are frequently conflated. [*]

The idea behind Figure [<-] is that conventional concurrent programming uses preemptive task management and exploits the automatic stack management of a standard language. We often hear this point in the space referred to by the term ``threaded programming.'' The second interesting point in the space is ``event-driven programming,'' where cooperative tasks are organized as event handlers that yield control by returning control to the event scheduler, manually unrolling their stacks. This paper is organized around the observation that one can choose cooperative task management while preserving the automatic stack management that makes a programming language ``structured;'' in the diagram, this point is labeled the ``sweet spot.''

Stack management

[*]

Given our diagram one might ask, ``what are the pros and cons of the two forms of stack management? We address that question here. We present the principal advantages and disadvantages of each form, emphasizing how software evolution exacerbates the disadvantages of each. We also present a technique that mitigates the principal disadvantage of automatic stack management.

Automatic versus manual

[*]

Programmers can express a task employing either automatic stack management or manual stack management. With automatic stack management, the programmer expresses each complete task as a single procedure in the source language. Such a procedure may call functions that block on I/O operations such as disk or remote requests. While the task is waiting on a blocking operation, its current state is kept in data stored on the procedure's program stack. This style of control flow is one meaning often associated with the term ``procedure-oriented.''

In contrast, manual stack management requires a programmer to rip the code for any given task into event handlers that run to completion without blocking. Event handlers are procedures that can be invoked by an event-handling scheduler in response to events, such as the initiation of a task or the response from a previously-requested I/O. To initiate an I/O, an event handler ``E_1'' schedules a request for the operation but does not wait for the reply. Instead, E_1 registers a task-specific object called a continuation [cite Continuations] with the event-handling scheduler. The continuation bundles state indicating where E_1 left off working on the task, plus a reference to a different event-handler procedure E_2 that encodes what should be done when the requested I/O has completed. After having initiated the I/O and registering the continuation, E_1 returns control to the event-handling scheduler.

When the event representing the I/O completion occurs, the event-handling scheduler calls E_2, passing E_1's bundled state as an argument. This style of control flow is often associated with the term ``event-driven.''

To illustrate these two stack-management styles, consider the code for a function, GetCAInfo, that looks in an in-memory hash table for a specified certificate-authority id and returns a pointer to the corresponding object. A certificate authority is an entity that issues certificates, for example for users of a file system.

CAInfo GetCAInfo(CAID caId) {
    CAInfo caInfo = LookupHashTable(caId);
    return caInfo;
}

Suppose that initially this function was designed to handle a few globally known certificate authorities and hence all the CA records could be stored in memory. We refer to such a function as a compute-only function: because it does not pause for I/O, we need not consider how its stack is managed across an I/O call, and thus the automatic stack management supplied by the compiler is always appropriate.

Now suppose the function evolves to support an abundance of CA objects. We may wish to convert the hash table into an on-disk structure, with an in-memory cache of the entries in use. GetCAInfo has become a function that may have to yield for I/O. How the code evolves depends on whether it uses automatic or manual stack management.

Following is code with automatic stack management that implements the revised function:

CAInfo GetCAInfoBlocking(CAID caId) {
    CAInfo caInfo = LookupHashTable(caId);
    if (caInfo != NULL) {
        // Found node in the hash table
        return caInfo;
    }

    caInfo = new CAInfo();
    // DiskRead blocks waiting for 
    // the disk I/O to complete.
    DiskRead(caId, caInfo);
    InsertHashTable(caId, CaInfo);
    return caInfo;
}

To achieve the same goal using manual stack management, we rip the single conceptual function GetCAInfoBlocking into two source-language functions, so that the second function can be called from the event-handler scheduler to continue after the disk I/O has completed. Here is the continuation object that stores the bundled state and function pointer:

class Continuation {
    // The function called when this
    // continuation is scheduled to run.
    void (*function)(Continuation cont);
    // Return value set by the I/O operation.
    // To be passed to continuation.
    void *returnValue
    // Bundled up state
    void *arg1, *arg2, ...;
}

Here is the original function, ripped into the two parts that function as event handlers:

void GetCAInfoHandler1(CAID caId, Continuation *callerCont) {
    // Return the result immediately if in cache
    CAInfo *caInfo = LookupHashTable(caId);    
    if (caInfo != NULL) {
        // Call caller's continuation with result
        (*callerCont->function)(caInfo);
        return;
    }

    // Make buffer space for disk read
    caInfo = new CAInfo();  
    // Save return address & live variables
    Continuation *cont =
        new Continuation( &GetCAInfoHandler2, caId, caInfo, callerCont);
    // Send request
    EventHandle eh = InitAsyncDiskRead(caId, caInfo);    
    // Schedule event handler to run on reply
    // by registering continuation
    RegisterContinuation(eh, cont); 
}

void GetCAInfoHandler2(Continuation *cont) {
    // Recover live variables
    CAID caId = (CAID) cont->arg1;
    CAInfo *caInfo = (CAInfo*) cont->arg2;
    Continuation *callerCont = (Continuation*) cont->arg3;
    // Stash CAInfo object in hash
    InsertHashTable(caId, caInfo);    
    // Now ``return'' results to original caller
    (*callerCont->function)(callerCont);
}

Note that the signature of GetCAInfo is different from that of GetCAInfoHandler1. Since the desired result from what used to be GetCAInfo will not be available until GetCAInfoHandler2 runs sometime later, the caller of GetCAInfoHandler1 must pass in a continuation that GetCAInfoHandler2 can later invoke in order to return the desired result via the continuation record. That is, with manual stack management, a statement that returns control (and perhaps a value) to a caller must be simulated by a function call to a continuation procedure.

Stack Ripping

[*]

In conventional systems languages, such as C++, which have no support for closures, the programmer has to do a substantial amount of manual stack management to yield for I/O operations. Note that the function in the previous section was ripped into two parts because of one I/O call. If there are more I/O calls, there are even more rips in the code. The situation gets worse still with the presence of control structures such as for loops. The programmer deconstructs the language stack, reconstructs it on the heap, and reduces the readability of the code in the process.

Furthermore, debugging is impaired because when the debugger stops in GetCAInfoHandler2, the call stack only shows the state of the current event handler and provides no information about the sequence of events that the ripped task performed before arriving at the current event handler invocation. Theoretically, one can manually recover the call stack by tracing through the continuation objects; in practice we have observed that programmers hand-optimize away tail calls, so that much of the stack goes missing.

In summary, for each routine that is ripped, the programmer will have to manually manage procedural language features that are normally handled by a compiler:

function scoping
Now two or more language functions represent a single conceptual function.
automatic variables
Variables once allocated on the stack by the language must be moved into a new state structure stored on the heap to survive across yield points.
control structures
The entry point to every basic block containing a function that might block must be reachable from a continuation, and hence must be a separate language-level function. That is, conceptual functions with loops must be ripped into more than two pieces.
debugging stack
The call stack must be manually recovered when debugging, and manual optimization of tail calls may make it unrecoverable.

Software evolution substantially magnifies the problem of function ripping: when a function evolves from being compute-only to potentially yielding, all functions, along every path from the function whose concurrency semantics have changed to the root of the call graph may potentially have to be ripped in two. (More precisely, all functions up a branch of the call graph will have to be ripped until a function is encountered that already makes its call in continuation-passing form.) We call this phenomenon ``stack ripping'' and see it as the primary drawback to manual stack management. Note that, as with all global evolutions, functions on the call graph may be maintained by different parties, making the change difficult.

Hidden concurrency assumptions

[*]

The huge advantage of manual stack management is that every yield point is explicitly visible in the code at every level of the call graph. In contrast, the call to DiskRead in GetCAInfo hides potential concurrency. Local state extracted from shared state before the DiskRead call may need to be reevaluated after the call. Absent a comment, the programmer cannot tell which function calls may yield and which local state to revalidate as a consequence thereof.

As with manual stack management, software evolution makes the situation even worse. A call that did not yield yesterday may be changed tomorrow to yield for I/O. However, when a function with manual stack management evolves to yield for I/O, its signature changes to reflect the new structure, and the compiler will call attention to any callers of the function unaware of the evolution. With automatic stack management, such a change is syntactically invisible and yet it affects the semantics of every function that calls the evolved function, either directly or transitively.

The dangerous aspect of automatic stack management is that a semantic property () of a called procedure dramatically affects how the calling procedure should be written, but there is no check that the calling procedure is honoring the property. Happily, concurrency assumptions can be declared explicitly and checked statically or dynamically.

A static check would be ideal because it detects violations at compile time. Functions that yield are tagged with the property, and each block of a calling function that assumes that it runs without yielding is marked atomic. The compiler or a static tool checks that functions that call functions are themselves marked , and that no calls to functions appear inside atomic blocks. In fact, one could reasonably abuse an exception declaration mechanism to achieve this end.

A dynamic check is less desirable than a static one because violations are only found if they occur at runtime. It is still useful in that violations cause an immediate failure, rather than subtly corrupting system state in a way that is difficult to trace back to its cause. We chose a dynamic check because it was quick and easy to implement. Each block of code that depends on atomicity begins with a call to and ends with a call to . The function increments a private counter and decrements it. When any function tries to block on I/O, yield() asserts that the counter is zero, and dumps core otherwise.

Note that in evolving code employing automatic stack management, we may also have to modify every function extending along every path up the call graph from a function whose concurrency semantics have changed. However, whereas manual stack management implies that each affected function must be torn apart into multiple pieces, automatic-stack-management code may require no changes or far less intrusive changes. If the local state of a function does not depend on the yielding behavior of a called function, then the calling function requires no change. If the calling function's local state is affected, the function must be modified to revalidate its state; this surgery is usually local and does not require substantial code restructuring.

Hybrid approach

[*]

In our project there are passionate advocates for each of the two styles of stack management. There is a hybrid approach that enables both styles to coexist in the same code base, using adaptors to connect between them. This hybrid approach also enables a project to be written in one style but incorporate legacy code written in the other.

In the Windows operating system, ``threads'' are scheduled preemptively and ``fibers'' are scheduled cooperatively. Our implementation achieves cooperative task management by scheduling multiple fibers on a single thread; at any given time, only one fiber is active.

In our design, a scheduler runs on a special fiber called MainFiber and schedules both manual stack management code (event handlers) and automatic stack management code. Code written with automatic stack management, that expects to block for I/O, always runs on a fiber other than MainFiber; when it blocks, it always yields control back to MainFiber, where the scheduler selects the next task to schedule. Compute-only functions, of course, may run on any fiber, since they may be freely called from either context.

Both types of stack management code are scheduled by the same scheduler because the Windows fiber package only supports the notion of explicitly switching from one fiber to another specified fiber; there is no notion of a generalized yield operation that invokes a default fiber scheduler. Implementing a combined scheduler also allowed us to avoid the problem of having two, potentially conflicting, schedulers running in parallel: one for event handlers and one for fibers.

There are other ways in which the two styles of code can be made to interact. We aimed for simplicity and to preserve our existing code base that uses manual stack management. Our solution ensures that code written in either style can call a function implemented in the other style without being aware that the other stack management discipline even exists.

To illustrate the hybrid approach, we show an example that includes calls across styles in both directions. The example involves four functions: FetchCert, GetCertData, VerifyCert, and GetCAInfo. (GetCAInfo was introduced in Section [<-]). FetchCert fetches a security certificate using GetCertData and then calls VerifyCert in order to confirm its validity. VerifyCert, in turn, calls GetCAInfo in order to obtain a CA with which to verify a certificate. Here is how the code would look with serial task management:

bool FetchCert(User user, Certificate *cert) {
    // Get the certificate data from a
    // function that might do I/O
    certificate = GetCertData(user);
    if (!VerifyCert(user, cert)) {
        return false;
    }
}

bool VerifyCert(User user, Certificate *cert) {
    // Get the Certificate Authority (CA)
    // information and then verify cert
    ca = GetCAInfo(cert);
    if (ca == NULL) { return false; }
    return CACheckCert(ca, user, cert);
}

Certificate* GetCertData(User user) {
    // Look up certificate in the memory
    // cache and return the answer.
    // Else fetch from disk/network
    if (Lookup(user, cert)) {
        return certificate;
    }
    certificate = DoIOAndGetCert();
    return certificate;
}

Of course, we want to rewrite the code to use cooperative task management, allowing other tasks to run during the I/O pauses, with different functions adhering to each form of stack management. Suppose that VerifyCert is written with automatic stack management and the remaining functions (FetchCert, GetCertData, GetCAInfo) are implemented with manual stack management (using continuations). We will define adaptor functions that route control flow between the styles.

Manual calling automatic

[*]

Figure [->] is a sequence diagram illustrating how code with manual stack management calls code with automatic stack management. In the figure, the details of a call in the opposite direction are momentarily obscured behind dashed boxes. The first event handler for FetchCert1 calls the function GetCertData1, which initiates an I/O operation, and the entire stack unrolls in accordance with manual stack management. Later, when the I/O reply arrives, the scheduler executes the GetCertData2 continuation, which ``returns'' (by a function call) to the second handler for FetchCert. This is pure manual stack management.


GetCertData, code with manual stack management, calls VerifyCert, a function written with automatic stack management. [*]


When a function written with manual stack management calls code with automatic stack management, we must reconcile the two styles. The caller code is written expecting never to block on I/O; the callee expects to block I/O always. To reconcile these styles, we create a new fiber and execute the callee code on that fiber. The caller resumes (to manually unroll its stack) as soon as the first burst of execution on the fiber completes. The fiber may run and block for I/O several times; when it finishes its work on behalf of the caller, it executes the caller's continuation to resume the caller's part of the task. Thus, the caller code does not block and the callee code can block if it wishes.

In our example, the manual-stack-management function FetchCert2 calls through an adapter to the automatic-stack-management function VerifyCert. FetchCert2 passes along a continuation pointing at FetchCert3 so that it can eventually regain control and execute the final part of its implementation. The following code is for the CFA adaptor, ripped into its call and return parts; CFA stands for ``ContinuationToFiber adaptor.''

void VerifyCertCFA(CertData certData, Continuation *callerCont) {
    // Executed on MainFiber
    Continuation *vcaCont = new Continuation(VerifyCertCFA2, callerCont);
    Fiber *verifyFiber = new
    VerifyCertFiber(certData, vcaCont);
    // On fiber verifyFiber, start executing
    // VerifyCertFiber::FiberStart
    SwitchToFiber(verifyFiber);
    // Control returns here when
    // verifyFiber blocks on I/O
}

void VerifyCertCFA2(Continuation *vcaCont) {
    // Executed on MainFiber.
    // Scheduled after verifyFiber is done
    Continuation *callerCont = (Continuation*) vcaCont->arg1;
    callerCont->returnValue = vcaCont->returnValue;
    // ``return'' to original caller (FetchCert)
    (*callerCont->function)(callerCont);
}

The first adaptor function accepts the arguments of the adapted function and a continuation (``stack frame'') for the calling task. It constructs its own continuation vcaCont and creates a object called verifyFiber that represents a new fiber (VerifyCertFiber is a subclass of the Fiber class); this object keeps track of the function arguments and vcaCont so that it can transfer control to VerifyCertCFA2 when verifyFiber's work is done. Finally, it performs a fiber-switch to verifyFiber. When verifyFiber begins, it executes glue routine VerifyCertFiber::FiberStart to unpack the parameters and pass them to VerifyCert, which may block on I/O:

VerifyCertFiber::FiberStart() {
    // Executed on a fiber other than MainFiber
    // The following call could block on I/O.
    // Do the actual verification.
    this->vcaCont->returnValue = VerifyCert(this->certData);
    // The verification is complete.
    // Schedule VerifyCertCFA2
    scheduler->schedule(this->vcaCont);
    SwitchTo(MainFiber);
}

This start function simply calls into the function VerifyCert. At some point, when VerifyCert yields for I/O, it switches control back to the MainFiber using a SwitchTo call in the I/O function (not the call site shown in the FiberStart() routine above). Control resumes in VerifyCertCFA, which unrolls the continuation stack (i.e., GetCertData2 and FetchCert2) back to the scheduler. Thus, the hybrid task has blocked for the I/O initiated by the code with automatic stack management while ensuring that event handler FetchCert2 does not block.

Later, when the I/O completes, verifyFiber is resumed (for now, we defer the details on how this resumption occurs). After VerifyCert has performed the last of its work, control returns to FiberStart. FiberStart stuffs the return value into VerifyCertCFA2's continuation, schedules it to execute, and switches back to the MainFiber a final time. At this point, verifyFiber is destroyed. When VerifyCertCFA2 executes, it ``returns'' (with a function call, as code with manual stack management normally does) the return value from VerifyCert back to the adaptorcaller's continuation, FetchCert3.

Automatic calling manual


VerifyCert, code with automatic stack management, calls GetCAInfo, a function written with manual stack management. [*]


We now discuss how the code interactions occur when a function with automatic stack management calls a function that manually manages its stack. In this case, the former function needs to block for I/O, but the latter function simply schedules the I/O and returns. To reconcile these requirements, we supply an adaptor that calls the manual-stack-management code with a special continuation and relinquishes control to the MainFiber, thereby causing the adaptor's caller to remain blocked. When the I/O completes, the special continuation runs on the MainFiber and resumes the fiber of the blocked adaptor, which resumes the original function waiting for the I/O result.

Figure [<-] fills in the missing details of Figure [<-] to illustrate this interaction. In this example, VerifyCert blocks on I/O when it calls GetCAInfo, a function with manual stack management. VerifyCert calls the adaptor GetCAInfoFCA, which hides the manual-stack-management nature of GetCAInfo (FCA means FibertoContinuation Adaptor):

Boolean GetCAInfoFCA(CAID caid) {
    // Executed on verifyFiber
    // Get a continuation that switches control
    // to this fiber when called on MainFiber
    FiberContinuation *cont = new FiberContinuation(FiberContinue, this);
    GetCAInfo(caid, cont);
    if (!cont->shortCircuit) {
        // GetCAInfo did block.
        SwitchTo(MainFiber);
    }
    return cont->returnValue;
}

void FiberContinue(Continuation *cont) {
    if (!Fiber::OnMainFiber()) {
        // Manual stack mgmt code did not perform
        // I/O: just mark it as short-circuited
        FiberContinuation *fcont = (FiberContinuation) *cont;
        fcont->shortCircuit = true;
    } else {
        // Resumed after I/O: simply switch
        // control to the original fiber
        Fiber *f = (Fiber *) cont->arg1;
        f->Resume();
    }
}

The adaptor, GetCAInfoFCA, sets up a special continuation that will later resume verifyFiber via the code in FiberContinue. It then passes this continuation to GetCAInfo which initiates an I/O operation and returns immediately to what it believes to be the event-handling scheduler; of course, in this case, the control returns to GetCAInfoFCA. Since I/O was scheduled and short-circuiting did not occur (discussed later in this section), GetCAInfoFCA must ensure that control does not yet return to VerifyCert; to achieve this effect, it switches control to the MainFiber.

On the MainFiber, the continuation code that started this burst of fiber execution, VerifyCertCFA, returns several times to unroll its stack and the scheduler runs again. Eventually, the I/O result arrives and the scheduler executes GetCAInfo2, the remaining work of GetCAInfo. GetCAInfo2 fills the local hash table (recall its implementation from Section [<-]) and ``returns'' control by calling a continuation. In this case, it calls the continuation (FiberContinue) that had been passed to GetCAInfo.

FiberContinue notices that verifyFiber has indeed been blocked and switches control back to that fiber, where the bottom half of the adaptor, GetCAInfoFCA, extracts the return value and passes it up to the automatic-stack-management code that called it (VerifyCert).

The short circuit branch not followed in the example handles the case where GetCAInfo returns a result immediately without waiting for I/O. When it can do so, it must not allow control to pass to the scheduler. This is necessary so that a caller can optionally determine whether or not a routine has yielded control and hence whether or not local state must be revalidated. Without a short circuit path, this important optimization and an associated design pattern that we describe in Section [->] cannot be achieved. Figure [->] illustrates the short-circuit sequence: The short-circuit code detects the case where GetCAInfo runs locally, performs no I/O, and executes (``returns to'') the current continuation immediately. FiberContinue detects that it was not executed directly by the scheduler, and sets the shortCircuit flag to prevent the adaptor from switching to the MainFiber.


A variation where GetCAInfo does not need to perform I/O. [*]

Discussion

An important observation is that, with adaptors in place, each style of code is unaware of the other. A function written with automatic stack management sees what it expects: deep in its stack, control may transfer away, and return later with the stack intact. Likewise, the event-handler scheduler cannot tell that it is calling anything other than just a series of ordinary manualstackmanagement continuations: the adaptors deftly swap the fiber stacks around while looking like any other continuation. Thus, integrating code in the two styles is straightforward: fiber execution looks like a continuation to the event-driven code, and the continuation scheduler looks like any other fiber to the procedure-oriented code. This adaptability enables automatic-stack-management programmers to work with manual-stack-management programmers, and to evolve a manual-stack-management code base with automatic-stack-management functions and vice versa.

Implementation Experience

[*]

We have employed cooperative task management in two systems: Farsite [cite FarsiteRef], a distributed, secure, serverless file system running over desktops, and UCoM [cite shih:ucom], a wireless phone application for handheld devices such as PDAs. The Farsite code is designed to run as a daemon process servicing file requests on the Windows NT operating system. The UCoM system, designed for the Windows CE operating system, is a client application that runs with UI and audio support.

The Farsite system code was initially written in event-driven style (cooperative task management and manual stack management) to enable simplified reasoning about the concurrency conditions of the system. As our code base grew and evolved over a period of two years, we came to appreciate the costs of employing manual stack management and devised the hybrid approach discussed in the previous section to introduce automatic stack management code into our system. The UCoM system uses automatic stack management exclusively.

Farsite uses fibers, the cooperative threading facility available in Windows NT. With Windows fibers, each task's state is represented with a stack, and control is transferred by simply swapping one stack pointer for another, as with setjmp and longjmp. Since fibers are unavailable in the Windows CE operating system, UCoM uses preemptive threads and condition variables to achieve a cooperative threading facility: each thread blocks on its condition variable and the scheduler ensures that at most one condition variable is signalled at any moment. When a thread yields, it blocks on its condition variable and signals the scheduler to continue; the scheduler selects a ready thread and signals its condition variable.

We implemented the hybrid adaptors in each direction with a series of mechanically-generated macros. There are two groups of macros, one for each direction of adaptation. Within each group, there are variations to account for varying numbers of arguments, void or non-void return type, and whether the function being called is a static function or an object method; multiple macros are necessary to generate the corresponding variations in syntax. Each macro takes as arguments the signature of the function being adapted. The macros declare and create appropriate Fiber and Continuation objects.

Our experience with both systems has been positive and our subjective impression is that we have been able to preempt many subtle concurrency problems by using cooperative task management as the basis for our work. Although the task of wrapping I/O functions (see Section [<-]) can be tedious, it can be automated, and we found that paying an up-front cost to reduce subtle race conditions was a good investment.

Both systems use extra threads for converting blocking I/O operations to non-blocking operations and for scheduling I/O operations, as is done in many other systems, such as Flash [cite Flash]. Data partitioning prevents synchronization problems between the I/O threads and the state shared by cooperatively-managed tasks.

Cooperative task management avoids the concurrency problems of locks only if tasks can complete without having to yield control to any other task. To deal with tasks that need to perform I/O, we found that we could often avoid the need for a lock by employing a particular design pattern. In this pattern, which we call the Pinning Pattern, I/O operations are used to pin resources in memory where they can be manipulated without yielding. Note that pinning does not connote exclusivity: a pinned resource is held in memory (to avoid the need to block on I/O to access it), but when other tasks run, they are free to manipulate the data structures it contains. Functions are structured in two phases: a loop that repeatedly tries to execute all potentially-yielding operations until they can all be completed without yielding, and an atomic block that computes results and writes them into the shared state.

An important detail of the design pattern is that there may be dependencies among the potentially-yielding operations. A function may need to compute on the results of a previously-pinned resource in order to decide which resource to pin next; for example, in Farsite this occurs when traversing a path in a directory tree. Thus, in the fully general version of the design pattern, a check after each potentially-yielding operation ascertains whether the operation did indeed yield, and if so, restarts the loop from the top. Once the entire loop has executed without interruption, we know that the set of resources we have pinned in memory are related in the way we expect, because the final pass through the loop executed atomically.

Related Work

[*]

Birrell offers a good overview of the conventional threaded programming model with preemptive task management [cite birrell:threads]. Of his reasons for using concurrency (p. 2), cooperative task management can help with all but exploiting multiprocessors, a shortcoming we mention in Section [<-]. Birrell advises that ``you must be fastidious about associating each piece of data with one (and only one) mutex'' (p. 28); consider cooperative task management as the limiting case of that advice. There is the complexity that whenever a task yields it effectively releases the global mutex, and must reestablish its invariants when it resumes. But even under preemptive task management, Birrell comments that ``you might want to unlock the mutex before calling down to a lower level abstraction that will block or execute for a long time'' (p. 12); hence this complexity is not introduced by the choice of cooperative task management.

Ousterhout points out the pitfalls of preemptive task management, such as subtle race conditions and deadlocks [cite Ousterhout]. We argue that his ``threaded'' model conflates preemptive task management with automatic stack management, and his ``event-driven'' model conflates cooperative task management with manual stack management. We wish to convince designers that the choices are orthogonal, that Ousterhout's arguments are really about the task management decision, and that programmers should exploit the easeofreasoning benefits of cooperative task management while exploiting the features of their programming language by using automatic stack management.

Other system designers have advocated non-threaded programming models because they observe that for a certain class of high-performance systems, such as file servers and web servers, substantial performance improvements can be obtained by reducing context switching and carefully implementing applicationspecific cacheconscious task scheduling [cite jaws, Flash, banga98, mishra98thread-based]. These factors become especially pronounced during high load situations, when the number of threads may become so large that the system starts to thrash while trying to give each thread its fair share of the system's resources. We argue that the contextswitching overhead for userlevel threads (fibers) is in fact quite low; we measured the cost of switching in our fiber package to be less than ten times the cost of a procedure call.

Furthermore, applicationspecific cacheconscious task scheduling should be just as achievable with cooperative task management and automatic stack management: the scheduler is given precisely the same opportunities to schedule as in event-driven code; the only difference is whether stack state is kept on stacks or in chains of continuations on the heap.

For the classes of applications we reference here, processing is often partitioned into stages [cite seda, StagedServer]. The partitioning of system state into disjoint stages is a form of data partitioning, which addresses concurrency at the coarse grain. Within each stage, the designer of such a system must still choose a form of conflict management, task management, and stack management. Careful construction of stages avoids I/O calls within a stage; in that case, cooperative task management within the stage degenerates to serial task management, and no distinction arises in stack management. In practice, at the interstage level, a single task strings through multiple stages, and reads as in manual stack management. Typically, the stages are monotonic: once a task leaves a stage, it never returns. This at least avoids the ripping associated with looping control structures.

Lauer and Needham show two programming models to be equivalent up to syntactic substitution [cite needham79]. We describe their models in terms of our axes: their procedureoriented system has preemptive task management, automatic stack management (``a process typically has only one goal or task''), monitors for conflict management, and one big data partition protected by those monitors. Their messageoriented system has manual stack management with task state passed around in messages, and no conflicts to manage due to many partitions of the state so that it is effectively never concurrently shared.

Notably, of the messageoriented system, they say ``neither procedural interfaces nor global naming schemes are very useful,'' that is, the manual stack management undermines structural features of the language. Neither model uses cooperative task management as we regard it, since both models require identicallydetailed reasoning about conflict management. Thus their comparison is decidedly not between the models we associate with multithreaded and event-driven programming.

Conclusions

[*]

In this paper we clarify an ongoing debate about ``event-driven'' versus ``threaded'' programming models by identifying two separable concerns: task management and stack management. Thus separated, the paper assumes cooperative task management and focuses on issues of stack management in that context. Whereas the choice of task management strategy is fundamental, the choice of stack management can be left to individual taste. Unfortunately, the term ``event-driven programming'' conflates both cooperative task management and manual stack management. This prevents many people from considering using a readable automatic-stack-management coding style in conjunction with cooperative task management.

Software evolution is an important factor affecting the choice of task management strategy. When concurrency assumptions evolve it may be necessary to make global, abstraction-breaking changes to an application's implementation. Evolving code with manual stack management imposes the cumbersome code restructuring burden of stack ripping; evolving either style of code involves revisiting the invariant logic due to changing concurrency assumptions and sometimes making localized changes to functions in order to revalidate local state.

Finally, a hybrid model adapts between code with automatic and with manual stack management, enabling cooperation among disparate programmers and software evolution of disparate code bases.

Acknowledgements

We would like to thank Jim Larus for discussing this topic with us at length. Thanks also to the anonymous reviewers for their thoughtful comments.

References

BDET00 William Bolosky, John Douceur, David Ely, and Marvin Theimer. Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. In Proceedings of the ACM Sigmetrics 2000 Conference, pages 34--43, June 2000.

BDM98 Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. Better operating system features for faster network servers. In Proceedings of the Workshop on Internet Server Performance (held in conjunction with ACM SIGMETRICS '98), Madison, WI, 1998.

Bir89 Andrew D. Birrell. An introduction to programming with threads. Technical Report 35, Digital Systems Research Center, January 1989.

FHK84 Daniel P. Friedman, Christopher T. Haynes, and Eugene E. Kohlbecker. Programming with continuations. In P. Pepper, editor, Program Transformation and Programming Environments, pages 263--274. Springer-Verlag, 1984.

HFW84 C. T. Haynes, D. P. Friedman, and M. Wand. Continuations and coroutines. In ACM Symposium on LISP and Functional Programming, pages 293--298, Austin, TX, August 1984.

HS99 J. Hu and D. Schmidt. JAWS: A Framework for High Performance Web Servers. In Domain-Specific Application Frameworks: Frameworks Experience by Industry. Wiley &Sons, 1999.

LN79 H. C. Lauer and R. M. Needham. On the duality of operating system structures. Operating Systems Review, 13(2):3--19, January 1979.

LP01 J. Larus and M. Parkes. Using Cohort Scheduling to Enhance Server Performance. Technical Report MSR-TR-2001-39, Microsoft Research, March 2001.

MY98 S. Mishra and R. Yang. Thread-based vs event-based implementation of a group communication service. In Proceedings of the 12th International Parallel Processing Symposium, Orlando, FL, 1998.

Ous96 John Ousterhout. Why threads are a bad idea (for most purposes). In USENIX Technical Conference (Invited Talk), Austin, TX, January 1996.

PDZ99 V. S. Pai, P. Druschel, and W. Zwaenepoel. Flash: An efficient and portable web server. In USENIX Technical Conference, Monterey, CA, June 1999.

SBS02 E. Shih, P. Bahl, and M. Sinclair. Wake on Wireless: An event driven power saving strategy for battery operated devices. Technical Report MSR-TR-2002-40, Microsoft Research, April 2002.

WCB01 Matt Welsh, David Culler, and Eric Brewer. Seda: An architecture for well-conditioned, scalable internet services. In Proceedings of the Eighteenth Symposium on Operating Systems Principles (SOSP-18), Banff, Canada, October 2001.


This paper was originally published in the Proceedings of the 2002 USENIX Annual Technical Conference, June 10-15, 2002, Monterey Conference Center, Monterey, California, USA.
Last changed: 16 May 2002 ml
Technical Program
USENIX 2002 Home
USENIX home