Check out the new USENIX Web site. [top | prev | next]

Isolation with Flexibility:
A Resource Management Framework for Central Servers

David G. Sullivan, Margo I. Seltzer
Division of Engineering and Applied Sciences
Harvard University, Cambridge, MA 02138

{sullivan,margo}@eecs.harvard.edu



4. Prototype Implementation

We have implemented the extended framework in VINO 0.50, and we use it to manage CPU time, memory, and disk bandwidth. In the following sections, we describe the key details of our implementation, including the scheduling and allocation mechanisms that we employ.

4.1 Threads and Currencies

A thread can be thought of as a special type of currency. Like all currencies, threads hold backing tickets, and they also issue temporary transfer tickets to threads on which they are waiting (see Section 4.3). By having VINO's thread class inherit from its currency class 4, threads can issue and receive tickets using the same methods as non-thread currencies, and other methods (e.g., the one that recursively computes a ticket's base value) can also treat threads and currencies interchangeably. Threads also reuse currency data members to keep track of the tickets that they hold and have issued.

Currencies that are also threads are identified by the process id of the thread. All other currencies are identified by a unique currency identifier (cid).

4.2 Currency Configuration and Permissions

By default, the system creates one currency for each active user of the system, and it funds these user currencies equally from the base currency. Each user's currency in turn funds the tasks of that user. The user currencies are created by means of a function, cid_for_client(), that is invoked when a process changes its real user id (uid); this function uses a uid to cid mapping to determine which currency should be used to fund the process. If no mapping exists for a given user, a new currency is created and funded. In either case, the process' existing funding is revoked and replaced with funding from the appropriate user currency. Once a user's login shell is correctly funded, processes forked by that shell are funded by the same currency. More generally, child processes are funded by the issuer of the first ticket in the parent's list of backing tickets for that resource. 5

Each user currency is owned by the corresponding user and has a currency mode (see Section 2.3) that allows the user to manipulate it using a set of system calls added for this purpose. A user's tasks receive equal ticket allocations by default; the user can modify these allocations and thereby alter the relative resource rights of the tasks. Users can also create currencies and fund them with tickets from their user currency. However, they cannot increase the funding of their user currency itself, because they do not have permission to issue other currencies' tickets. Thus, each user's tasks are securely isolated from the tasks of other users, and each user has the same total resource rights.

Other currency-configuration policies could also be specified. On extensible operating systems like VINO [Sma98], superusers could safely download specialized versions of the cid_for_client function (which is also called when a process' real group id (gid) changes) to specify arbitrary configuration schemes based on uid and gid, as well as alternative access-control policies for currencies. Approaches that do not involve extensibility could also be employed to accommodate a more limited range of possible configurations and access controls.

4.3 Managing CPU Time

Our prototype uses the original lottery-scheduling algorithm [Wal94] to schedule the CPU, randomly choosing an active ticket and traversing the runnable queue to find the thread that holds the ticket. In searching for the winning thread, the system computes the base value of each thread's CPU backing tickets. We cache these base values to avoid unnecessary computation, although the cached values must be invalidated whenever a change occurs in a currency's count of active tickets (e.g., when a thread starts up, blocks, or exits).

Our prototype also uses two other features of the original lottery-scheduling framework, compensation tickets and ticket transfers. Compensation tickets are issued to threads who do not use their full quantum, temporarily inflating their resource rights to give them a higher probability of being chosen when they next become runnable. Ticket transfers occur when a thread blocks attempting to acquire a kernel mutex or to allocate memory. Because the thread is itself a currency (see Section 4.1), it issues a ticket and uses it to fund the thread on which it is waiting (the holder of the mutex or the pageout daemon), transferring its resource rights to that thread. This can reduce the time that the waiting thread spends blocked, and it prevents priority inversion from occurring. When the thread is made runnable again, its transfer tickets are revoked.

A number of deterministic algorithms, including stride scheduling [Wal95] and EEVDF [Sto96], can also be used within the lottery-scheduling framework to provide increased accuracy and lower response-time variability for interactive processes. Because our work primarily addresses ways of overcoming the limits that proportional-share frameworks impose on flexible resource allocation, the algorithms used are not crucial.

4.4 Managing Memory

Effective proportional-share memory management is complicated by the difficulty of determining which processes are actively competing for memory and by the undesirability of a strict partitioning of memory among processes. When scheduling the CPU, threads that are blocked are simply ignored and the values of their tickets are not counted. Our data [Sul99b] indicates that a similar approach to memory management is not effective. When the system experiences heavy memory pressure, any process that blocks, even momentarily, can lose a large number of pages to the activity of the pageout daemon, resulting in erratic paging behavior and poor throughput. The obvious alternative, namely leaving the memory tickets of all processes active at all times, is also not viable, because pages belonging to idle processes tend to remain in memory indefinitely, reducing the number of pages available to active processes and effectively partitioning the total memory of the system. We have therefore chosen to give memory guarantees only to privileged processes that explicitly request them. Other processes compete for the unreserved portion of memory, which we ensure comprises at least five percent of the memory not wired by the kernel. While this approach is limited (e.g., it can easily lead to thrashing), it allows us to experiment with the resource trade-offs that applications can make.

Processes running as root can obtain hard memory shares from the base currency. Once a currency is funded with memory tickets, resource principals with appropriate permissions can obtain soft or hard subshares of its allocation. As with any resource, hard shares are obtained using the reserve() system call and soft shares using the fund() system call. In the common case of obtaining a hard share from the base currency, users simply specify the size in kilobytes of the memory share they are requesting. In other cases, users either specify a number of memory tickets (for soft shares) or a percentage of the issuing currency's share (for hard shares).

To maintain guaranteed shares, we altered the behavior of VINO's pageout daemon so that pages owned by processes that have not exceeded their memory guarantee are not reclaimed. This approach does not limit processes to their memory shares, but merely ensures that they can receive at least that amount. In the absence of memory pressure, processes receive as much memory as they need. If a process holds soft memory tickets, the number of pages to which it is entitled can change dynamically as the value of its tickets changes. The pageout daemon thus needs to compute the current base value of the processes' memory tickets; cached values are used whenever possible.

4.5 Managing Disk Bandwidth

Our prototype supports proportional sharing of disk bandwidth using the hierarchical YFQ algorithm [Bru99b]. YFQ approximates ideal proportional sharing by maintaining a virtual time function and per-disk queues of outstanding disk requests for each resource principal. Each of the queues has an associated finish tag that reflects the principal's past disk activity, its current share of the disk, and the length of the request at the head of the queue. Requests from queues with the smallest finish tags are forwarded to the device driver in small batches that can be reordered by the driver or device to achieve better aggregate throughput.

A principal's disk tickets are active whenever it has an outstanding request. To adjust to dynamic changes in the number of active tickets, the base value of a principal's disk tickets is recomputed (using cached values if possible) whenever a request reaches the head of its queue, and this value is used to compute the queue's new finish tag.

4.6 Emulating Nice

To support the semantics of nice, we created a utility that runs with root privileges and executes programs with funding from the base currency. This utility reduces the funding of the caller's user currency by the amount requested for the new job, thus preserving isolation. The utility actually creates a new currency for the task, funds that currency with the requested number of tickets, and uses the new currency's tickets to fund the task (see Fig. 4, bottom). This level of indirection is needed in case the task spawns any children; if so, they will share the funding of the new currency. While this utility would typically be used to give a small percentage of the CPU to long-running, CPU-intensive jobs, it can be used with other resources as well. It successfully overcomes the lower limits imposed by currencies without employing VINO's extensibility mechanism, as we had originally planned [Sul99a]. The other broker methods could also be overridden using similar setuid-root utilities.

4.7 Carrying Out Exchanges

As discussed in Section 3.2.1, a number of challenging questions must be answered before a system that fully supports dynamic ticket exchanges can be built. At this point, we have implemented a framework that allows us to easily test the effects of ticket exchanges and thereby gain insight into the issues involved.

We provide two mechanisms for experimenting with exchanges. First, users with appropriate permissions can employ our  reserve(), fund(), and unfund() system calls to implement static exchanges, preset modifications to the default ticket allocations. Second, we have implemented a simple central dealer in the kernel, and we allow applications to propose exchanges dynamically using a system call (exch_offer) added for this purpose. When an exchange is carried out, the four tickets involved (see Section 3.2.2) are linked to each other in a circular queue so that the exchange can be invalidated when one of the principals exits, loses too much funding, or retracts the exchange. If one of the tickets is deleted, all four of them are, thereby cancelling the exchange.

____________________

4. VINO is written in C++.

5. Actually, if a process holds tickets from more than one currency, its children should be funded by all of these currencies. We plan to extend our implementation to deal with this case.

[top | prev | next]