During some sunny Seattle days in October, the program co-chairs Karin Petersen (Xerox PARC) and Willy Zwaenepoel (Rice University) opened the Sec ond Symposium on Operating System Design and Implementation - OSDI `96. OSDI has rapidly become one of the two premier forums for reporting cutting edge OS research. The conference was well-attended, with 365 attendees at the technical sessions. A day of tutorials preceded the technical sessions, covering IPv6, mobile networking, Java, NT internals, SimOS, and security on the Web. Two provocative invited talks complemented the technical program, on "JavaOS" and "Active Networks," and a panel and BOFs were thrown in as well.
This year, the chairs ran a well-organized process - better than I did as program chair for OSDI `94 - and raised the already high standard of review, resulting in 19 high quality papers, including several papers that were downright exciting to many. Willy and Karin attracted 110 full paper submissions, obtained some 750 usually lengthy reviews, and attained a remarkable average of nearly 9 lengthy reviews per second-round paper. Twelve other program committee members and 184 external referees helped in this process. In a process sometimes painful to both sides, all of the accepted papers were fully shepherded, further improving them. The co-chairs reported that they found working together always helpful and never painful. As a PC member, I can vouch that they did a great job from our point of view, also.
Karin announced the award papers, two that stood out at every stage of the pro cess: "Automatic Compiler-Inserted I/O Prefetching for Out-Of-Core Applica tions" from the University of Toronto, and "Safe Kernel Extensions Without Run-Time Checking" from Carnegie Mellon University. Both papers are from compiler and programming language researchers, who take their sometimes novel work and apply it in always novel and clever ways to traditional and recent OS problems. One may speculate as to the reasons papers from those communi ties were chosen, and there are many, but the bottom line is that these are two lovely, clear papers, reporting innovative work with impressive results. For me, it was a personal honor to shepherd these papers, and the authors cooperated by making significant improvements to their already strong papers. The other 17 papers were also strong, and many were substantially revised during shepherd ing, incorporating additional results and writing revisions. A strong and interest ing program resulted.
After the opening remarks, Dr. Jim Mitchell, Sun Fellow and VP of Technology and Architecture at JavaSoft, kicked off OSDI `96 proper with a provocative "Invited Talk." Mitchell said that future generations will look back and see that now is the time when computers started to "disappear" into other things. Small, embedded systems are now becoming domi nant in the world of computing, and the Internet of the future will contain not only business networks but also "home nets" and personal wireless networks linking TVs, telephones, car navigation systems, remote monitoring systems, and even light switches! This is the world for which Java and JavaOS were designed.
Java is a "write once, run anywhere" programming language. However, in order to fit into the myriad computing environments of the future, Java defines different APIs for different domains: applications and applets, embedded systems, consumer electronics, factory automation, telephony, and so on. Sun and JavaSoft are encouraging partners to write open APIs for new computing domains. Sun is committed to Java as an open but governed standard. This means, for example, that Java licensees agree not to change or remove core APIs. JavaSoft is developing conformance and compliance tests to validate different Java implementations. Most Java imple mentations today are embedded in specific applications (e.g., Web browsers), but this will soon change. Practically all major OS vendors have agreed to provide Java services as part of their OSes. JavaOS from JavaSoft is a new OS designed around Java.
JavaOS implements Java directly on a hardware platform (e.g., x86, SPARC, and ARM) and provides "just enough" to implement Java, including the Java virtual machine, threads, interrupts, device drivers, and boot facilities. JavaOS has a small footprint: 1MB of ROM and 1.5MB of RAM. For embedded systems the footprint is even smaller: just 512KB of ROM and 128KB of RAM while retaining the ability to download new Java code. JavaOS runs in a single address space along with all other Java code; the Java language, virtual machine, and runtime provide for system integrity and security. JavaOS works on small machines - local file sys tems are not required - and is itself implemented almost entirely in the Java language. (Virtual memory is not sup ported, apparently with the implication that it is and will not be necessary; in later hallway conversations many found that highly unlikely in many environments envisioned for JavaOS, such as "Network Computers.") A principal goal of JavaOS is to reduce the cost of ownership of computing devices such as network computers (NCs). JavaOS provides ease of installation and administration, thereby eliminating most of the estimated $2,500/year cost of maintaining a typi cal PC. The first JavaOS-based product, Sun's JavaStation NC, was announced in New York during the OSDI `96 conference.
After presenting JavaOS, Mitchell opened the floor to ques tions. The first question compared Java and JavaOS to Lisp machines: systems that had many Java-like features but which failed in the marketplace. Mitchell said that Java will succeed where Lisp machines failed because Java is more popular than Lisp and will be available in small and cheap devices. A similar question from the audience compared the new class of network computers to previous devices such as X terminals. Mitchell answered that NCs are directed toward a different market. NCs will succeed because they are not designed to replace high-performance workstations. Rather, they will replace desktop PCs that are used in very stylized ways, principally those dedicated to running in-house appli cations. NCs will replace PCs in circumstances in which a complete PC is not required. NCs with JavaOS will be cost effective because they eliminate most network and device administration.
Another audience member asked about incorporating subsets of the Java language into very small devices such as smart cards. Mitchell said that subsetting Java is a dangerous con cept in that it violates the "write once, run anywhere" principle. On the other hand, that principle is unreasonable in a world where computers range from thermostats to worksta tions. The solution is for Java to provide a broad range of APIs for different classes of computing as described previ ously. A final question asked about compiling Java to native target CPUs rather than to the Java virtual machine. Mitchell answered that this issue is a trade-off. Native compilation of critical JavaOS sections will yield significant speedups on devices such as NCs. However, native compilation isn't an option for devices with very limited memory such as smart cards.
The session on Caching and Prefetching in I/O Systems, opened with the presentation of one of the award papers, "Automatic Compiler-Inserted I/O Prefetching for Out-Of- Core Applications," by Todd C. Mowry, Angela Demke (pre senter), and Orran Krieger, all from the University of Tor onto. The work addresses a problem encountered by numeric applications that have data sets larger than a machine's main memory. When such applications rely on the local virtual memory system to migrate data between main memory and the disk(s), they frequently suffer from poor performance caused by excessive I/O stall time. Traditionally, this has meant that programmers who wished to optimize the perfor mance of these applications had to do it by hand, modifying their programs to explicitly transfer data to and from disk at the appropriate times. This labor intensive task had the additional negative side effects of reducing application portability (because of built in assumptions about the underlying memory system) and sometimes reducing the numerical stability of the application algorithms.
To address these problems, the authors built a system that addresses the performance problems of these large out-of- core applications. Their goal was to improve application per formance with a minimal amount of operating system support, while maintaining the traditional virtual memory abstractions that free programmers from explicitly managing I/O. Their solution was to use the compiler to automatically insert prefetch directives into the application binaries.
In the resulting system, the programmer writes an application without any extra thought or concern for how its I/O operations will be managed. Compiler analysis is used to determine the application's data reference patterns, and the appropriate prefetch and release operations are inserted into the resulting binary. The compiler techniques are not described in detail in the paper, but are derived from known techniques for prefetching data into the processor cache from memory. At runtime, the release and prefetch operations are inter cepted by a runtime library. This library maintains a shared bitmap with the kernel, specifying which pages are already resident in memory. Using this bitmap, the library filters out redundant requests, and passes the rest to the operating system.
In addition to maintaining the shared bitmap, the operating system also provides support for prefetch and release requests. A prefetch request loads the specified page into memory, if there are free pages available. A release operation specifies that a page is no longer needed, and the operating system adds the page to the free list.
To analyze the performance of their system, the authors ran a suite of large numerical applications with out-of-core data set sizes. The applications were run both with and without the compiler inserted prefetching operations. In all of the applications, the prefetching improved end-to-end applica tion performance; most of the page faults incurred by the original versions of the applications were replaced by prefetch operations, reducing the I/O stall time of the applications at the cost of a slight increase in user execution time. The resulting performance increases ranged from 9 to 270 percent.
In their future work, the authors plan to examine the behavior of their system with multiprogrammed and parallel work loads.
During the question and answer period a member of the audience asked whether the authors had compared the per formance of their compiler-inserted prefetching with hand optimized versions of the same programs. The authors hope to perform this comparison as part of their future work.
Another questioner asked whether the authors' techniques are also applicable to database applications. Since their com piler analysis exploits references to array-based structures, the authors felt that their techniques would not be as useful in applications, such as databases, that make extensive use of pointers.
The second paper of the session, "A Trace-Driven Compari son of Algorithms for Parallel Prefetching and Caching," would have won "most-author," "most-institution," and "most-faculty" awards. Graduate students were the first three authors: Tracy Kimbrel (University of Washington), Andrew Tomkins (Carnegie Mellon University), and R. Hugo Patter son (CMU), with six advisors from UW, CMU, the Univer sity of Wisconsin, and Princeton. There was a reason if not a method to this madness, as the paper was the result of the shotgun marriage of two separate submissions to the conference. While the logistics of coordinating nine authors at four different sites must have been nightmarish, the result is an excellent piece of research. Tracy Kimbrel presented the paper.
This paper provided an interesting contrast with the previous paper on compiler-inserted I/O prefetching. In contrast with the Toronto work, Kimbrel and his colleagues focused on the question of how the operating system can best exploit advance knowledge of an application's I/O request stream.
The authors used trace driven simulation to evaluate four dif ferent algorithms for prefetching and caching file data on systems with multiple disks. The simulations used a variety of application workloads and varied the number of disks in the I/O system. It was assumed that the operating system has advance knowledge of the application's complete read request patterns, although not all of the algorithms required this much knowledge.
The first two algorithms, "fixed horizon" and "aggressive," were based on prefetching strategies used in two previously published studies. The third algorithm, "reverse aggressive," was a more complex strategy that is provably close to optimal. The comparison of these three algorithms showed that there were several test cases where either the fixed horizon or aggressive algorithm did not perform as well as the near-optimal reverse aggressive algorithm. In general, the fixed horizon algorithm did not prefetch aggressively enough on I/O bound workloads, and the aggressive algorithm issued too many unnecessary prefetch requests on CPU bound workloads, resulting in excessive driver overhead.
The fourth algorithm, "forestall," combines the best features of the fixed horizon and aggressive algorithms. (The prov able near-optimal reverse aggressive algorithm is too complex to be implemented in practice.) Comparing the forestall algorithm with the fixed horizon and aggressive strategies, the authors found that in each test case their new prefetching strategy performed as well as the better of the other two algorithms.
In their future work, the authors plan to study the perfor mance of prefetching algorithms when less complete look ahead information is available and in the presence of multiprogrammed workloads. They also hope to develop an API to allow applications to inform the operating system of their future read accesses.
During the question and answer period, a member of the audience asked about the performance of applications that issue writes as well as reads. The authors have not studied write performance, focusing instead on a large class of read-intensive applications and believing that existing techniques, such as delayed writes, are sufficient to insulate applications from the write performance of the I/O system.
Another questioner asked whether the authors had consid ered allowing speculative reads. They had not, observing that at the lookahead depths that they are using the penalty for an unnecessary read is too great.
The final paper of the session was "Efficient Cooperative Caching Using Hints," by Prasenjit Sarkar and John Hartman (presenter), of the University of Arizona. The idea behind cooperative caching is to add an additional layer to the file caching hierarchy, allowing client machines to access file blocks cached by other clients on the network instead of requiring that all file requests to be sent to a central file server. The goal is to exploit the large amount of total memory on the client machines, reducing the load on the server by allowing client machines to handle a large share of the file system request traffic.
Because a cooperative file cache is distributed across multi ple machines, managing the cache can potentially involve every machine in the system. In this paper, the authors propose the use of "hints" to provide the information individual nodes need to make caching decisions. Hints provide clients with imprecise information about the global state of the cooperative cache, allowing them to make cache management decisions locally. This local decision making reduces the load on server machines by eliminating centralized con trol of the cooperating cache.
The use of hints involves a trade-off; eliminating centralized control reduces the overhead of the system, but the system is penalized for each inaccurate hint that it tries to use. Thus, the more accurate the hints are, the more successful a hint-based system is likely to be.
The authors use two types of hints to make caching decisions. First, for each file that a client uses, it maintains a set of hints describing the probable location of each block in that file. Whenever the client opens a file, the server gives the client a set of hints for that file. The server gets these hints from the last client to access the file. When a client has a local cache miss for a block, it forwards the request to another client based on the hinted location for that block. If the initial client's hint is incorrect, then the second client forwards the request based on its own hints, and so on until the block is found.
Clients also use hints to approximate a global LRU replacement algorithm for the cooperative cache. Whenever a block is evicted from the local cache of a client, the client can either discard the block, or forward it to another node in the system. Ideally, the client would discard the block if it was the oldest block in the entire system, and otherwise would forward it to the node that did have the oldest block in the system. In order to approximate this behavior, each client maintains a list of the oldest blocks on each other client, and uses this list to make replacement decisions. These lists contain hints, and are not guaranteed to be correct. Each time a block is forwarded from one client to another, the clients exchange the age of their current oldest blocks, updating each others' oldest block lists.
Sarkar and Hartman used a simulation-based study to compare hint-based cooperative caching with four other cache management schemes, two taken from earlier research in the field, and two idealized algorithms. They found that the average block access time for their hint-based algorithm was very close to the optimal performance of the two idealized algorithms. Compared to the two real algorithms, the hint- based scheme substantially reduced the replacement-related network traffic and the average load on the server.
Unlike most software DSM sessions at OS conferences, two of these three papers focused on other issues than performance. The first looked at using a DSM system for debugging concurrent programs and the second exploited the redundant information available in DSM systems in order to provide fault tolerance very cheaply. The final paper focused on performance, examining a new protocol as well as multiprocessor issues.
The first talk was "Online Data-Race Detection via Coherency Guarantees," presented by Dejan Perkovic (University of Maryland). The work focuses on data-race detection in explicitly parallel shared memory applications that are written for the lazy release consistent (LRC) memory model. A data-race condition arises when multiple unsynchronized accesses to the same shared data, with at least one being write, occur. Perkovic described a technique to detect data- race conditions dynamically using consistency information from the underlying software distributed shared memory.
This technique is different from previous work in that it does not require compiler support. Similar to other work, on the other hand, the new technique can only detect data-races that occur in a given execution. The system consists of a software distributed shared memory system called CVM and instru mentation code inserted into application programs by the ATOM code re-writer.
The LRC model divides the execution of a process into intervals (the execution phases between an acquire and a release) and collects information about write accesses to shared data inside them. In the proposed system, the LRC protocols are modified to collect information on read accesses in addition to write accesses, since a data-race may occur involving both access types. Although data-race detection requires comparing every pair of memory access, the system only considers those that occur in concurrent intervals, because the underlying memory model (LRC) guarantees race-free access in other stages of (parallel) execution. During the execution of an application, CVM detects and collects information for shared memory accesses. A race detection algorithm is then used to report actual data-races at existing synchronization points within the application.
Four shared memory applications (Water, SOR, FFT, and TSP) were analyzed using the system and data-races were detected in two of them. The TSP program allows data-races to improve performance without violating correctness; however, the data-race in Water was a bug! Perkovic claimed that the system can eliminate over 70% of the program execution from online analysis using the information already collected by the LRC protocol. Despite its approximately 3-fold overhead, the new technique can be used for debugging complex applications when traditional methods are insufficient.
The second talk in this session was "Lightweight Logging for Lazy Release Consistent Distributed Shared Memory," presented by Manuel Costa (INESC). The motivation of this work is that distributed shared memory (DSM) systems generally execute long running, highly compute-bound applications, thus suffer increased probability of node failures. Therefore, fault tolerance mechanisms are required to allow successful completion of such applications.
Costa's group developed an algorithm by applying check pointing and logging and playback, the well known techniques for recovery, to distributed shared memory. This new algorithm tolerates single node failures by maintaining online log of data dependencies and provides recovery in a lazy release consistent (LRC) DSM system. Only a small amount of data are collected and logged in main memory. Also, since the algorithm does not require any messages other than those used by the LRC protocols, the performance overhead is minimal.
The system model considers a reliable message exchange platform,
fail-stop operations, data race free applications, and deterministic
memory accesses, except for read accesses, to shared data. The
algorithm is composed of three modules: (1) logging to tolerate single
node failures, (2) independent checkpointing to speed up recovery, and
(3) recovery. The logging module is implemented by making small
changes, which is basically adding code to log vector clock (VC) and
logical time pairs, to the TreadMarks distributed shared memory
system. Independent checkpointing is implemented with the
libckpt. Currently, the recovery
module is only partially implemented.
Experiments have been done to measure the execution time and disk space overhead of logging compared to the Message Logging (ML) approach. ML logs copies of acquire messages and references to page differences at access misses and flushes them on to a local disk. The time overhead of VC logging is found to be less than 1% and the space overhead varies between 3-8% of the log sizes of ML. Similarly, inde pendent checkpointing adds only another 2-22% overhead depending on the application tested. Another set of experi ments show that recovery time with the new approach is between 72-95% of the normal execution of test applica tions. Costa concluded that good performance is achieved by tightly integrating recovery with LRC protocols, and recov ery is potentially faster with the new algorithm than pure consistent checkpointing.
Liviu Iftode (Princeton University) presented the session's last talk, "Performance Evaluation of Two Home-Based Lazy Release Consistency Protocols for Shared Virtual Memory Systems." Two new relaxed memory consistency protocols were evaluated, although time permitted presentation of only one.
There are two major factors that limit the performance of page-based software shared memory: false sharing and protocol overhead. Although false sharing can be substantially reduced by the use of relaxed memory consistency models, with such protocols as Lazy Release Consistency (LRC), these models do not scale well due to their high protocol overhead. Iftode's group studied the scalability of the LRC protocol and developed two new home based LRC protocols (HLRC), where one copy of the shared page is eagerly updated.
The new protocol is similar to the authors' earlier work, the Automatic Update Release Consistency (AURC), and the LRC protocols. The AURC protocol eliminates the expensive diff (difference) processing by propagating the updates automatically to a home node as they occur, by a hardware mechanism. The HLRC protocol uses the "home" concept of AURC and maintains a home node for every shared page to which all the updates (diffs) are sent. The distinction between LRC and HLRC is that the diffs are sent to the home node at the end of each interval (i.e., when an acquire message arrives at a node that holds a diff). This approach keeps the pages up-to-date sooner with a single round-trip message and also eliminates the hardware support requirement of the AURC protocol. The advantages of HLRC are low diff processing overhead as the diffs are applied only once; a simpler protocol eliminating multiple diffs; low memory overhead because of the short lifetime of diffs; and reduced number of messages to update a shared page.
Both the LRC and the new HLRC protocols are implemented on a 64-node Paragon. Experiments show that the new protocol outperforms the LRC protocol by more than a factor of 2 on the implementation platform.
The variant of HLRC that was not described was using a communication processor that is available on each Paragon node to do some protocol processing in parallel, providing modest improvement.
Tuesday afternoon closed with a fast-paced session of 17 presentations of Works in Progress, moderated by Karin Petersen, who kept it running on time. Speakers were allotted a big 4 minutes for presentation and, in an interesting departure from usual practice, an additional two minutes was given for questions and answers.
Jason Nieh (Stanford/Sun) presented "The SMART Scheduler," a scheduler for continuous media processes that provides efficient CPU usage to both real-time and conventional activities.
Patrick Sobalvarro (MIT) described "FM-DCS: An Implementation of Dynamic Co-scheduling on a Workstation Cluster." Dynamic co-scheduling coordinates the scheduling of parallel processes. Message arrivals are used to trigger scheduling decisions, which implies that the co-scheduling only affects those processes that would benefit from co-scheduling. The project is to develop an adaptive mechanism for ensuring fairness to other processes on a system where some processes are influenced by dynamic co-scheduling. More information is available at: http://www.research.digital.com/SRC/scheduling.
Anurag Acharya (University of Maryland) described "Resource-aware Mobile Programs" that can adapt to changes in resource availability dynamically, specifically network latency and bandwidth. They are working on cheap-to-compute measurements of such. Someone asked what applications they had looked at: they had implemented an adaptive multi-user talk application that migrated between nodes.
Daniel Hagimont (INRIA) presented "A Protection Scheme for Java Mobile Code." While Java provides the mobility for code (in platform independence), greater security will be need for servers to host arbitrary mobile code. This work proposes a capability based protection scheme for Java applets to utilize server resources. Protected interfaces to server objects are provided to the Java applets.
Chengjie Liu (University of Wisconsin, Madison) presented "Strong Cache Consistency for the World-Wide Web." This work hopes to answer the question "Is strong cache consistency feasible on the Web?" Three kinds of consistency: adaptive time-to-live, poll-every-time, and invalidation were modeled on the Harvest cache system. More information is available at http://cs.wisc.edu/~chengjie/web.
Amin Vahdat (University of California, Berkeley) presented "WebFS: A Global File System for Fine-Grained Sharing." Several collaborative applications (Internet chat, video conferencing, a shared whiteboard, among others) have been prototyped on the WebFS which aims to address limitations in sharing over the Web. They are looking at application specific caching, fault tolerance issues, adaptive caching, and using multicast to maintain cache consistency. More information can be found at http://now.cs.berkeley.edu/WebOS.
Jeanna Neefe Matthews (University of California, Berkeley) presented "Improving the Performance of Log-Structured File Systems with Adaptive Methods." This work looks at improving the performance of LFS (Log Structured File Systems) across a wider section of workloads. LFS have been shown to have very good performance under some work loads, but poor performance under other common workloads. This work will look at adaptive methods (specifically adaptive "hole-plugging") to improve LFS performance. One listener asked how the proposed adaptive method could identify the point at which to switch to/from hole-plugging. Answer: by using empirically estimated costs of cleanup and plugging. Another audience member asked how the system will improve write clustering (to improve read performance). Answer: the proposed system will store an access graph, and a partitioning algorithm will expose clustered writes.
Miguel Castro (MIT) presented "Efficient Client Caching for Distributed Object Storage Systems." This work combines the features of coarse page-grained caching with fine object-grained caching by using 1KB subpages. Adaptive fetching tries to maintain the good effects of coarse page-grained caching (under good locality). Fragmentation is avoided by compacting and discarding objects on subpages. This com putation is performed while waiting on fetch replies. Some one asked a question clarifying that this system needs semantic information about the objects (like a fine grained caching system).
Marrik Vin (University of Texas, Austin) presented "The Case for an Integrated Multimedia File System." A prototype system, Peacock, was introduced that hopes to address the shortcomings of current file systems for continuous media such as audio and video. It targets both best-effort and real-time demands. Datatype specific placement policies are allowed, as is are specialized recovery mechanisms.
Garth Gibson (Carnegie Mellon University) presented "Network Attached Secure Disks." These are devices which will allow low-level file accesses to be given directly to the disk device which will sit directly on the network as a first-class entity. Higher-level access will be through a traditional server. An astute observer noted that no explanation for the "Security" in NASD was mentioned. There are security issues involved in having a raw disk on the network; it is proposed that code will be downloadable to the NASD via a private connection. Much more information is available at http://www.pdl.cs.cmu.edu/NASD.
Elizabeth Borowsky presented "Eliminating Storage Headaches Through Self-Management." This work describes a system administration tool that will configure, manage and optimize very large disk arrays. A feedback system will optimize based on usage patterns. A simple tool will allow the administrator to specify goals and tradeoffs. A prototype is implemented; future work involves more complicated workload analysis and more devices. One listener asked if this would scale down to smaller systems (yes). Another asked what the costs of administration/maintenance are; the answer was that generally 2-3 times the capital cost of the system. Jon Shapiro of the University of Pennsylvania won the "Best Question" award of the conference (worth a Riddler (TM) action figure) by then asking: "Well then, why not throw out the system every year and just replace it with a new one?" More information is available at http://www.hpl.hp.com/SSP/.
Olin Shivers (MIT) presented "Express," a Standard ML- based operating system which provides only threads, continuations, and events. Interrupts are not needed (they don't parallelize, nor do they scale). Because the entire system is based on ML, it is amenable to lambda-calculus analysis and optimizations. Three basic operators provide a policy-free mechanism for scheduling. One shameless questioner (Jay Lepreau from the University of Utah) asked how Olin was able to host Standard ML directly on the hardware. The answer is the University of Utah's OS Toolkit.
Patty Kostkova (City University, London) presented "Com ponent Based Operating System." This work describes a system for increasing the flexibility of microkernels by late binding component interfaces. Additionally, the layers of abstraction may be bypassed on a per-application basis. All layers are available to an application, thus avoiding unnecessary or extraneous abstractions. It aims to provide flexibility, extensibility and dynamic reconfigurability.
Phil Winterbottom (Lucent) presented "Inferno: la Commedia Interattiva." Inferno is an operating system that provides a new concurrent programming language (Limbo) and virtual machine (Dis) for that language. Inferno provides standard interfaces to network access, standard security protocols, naming, and graphical interfaces. Dis runs as either a user process or on bare hardware. The abstractions defined by Inferno borrow heavily from Plan 9 - everything is a file. Dis also incorporates a real-time garbage collector. One audience member asked for a sample of network access code: it is available at the Inferno home page: http://www.inferno.lucent.com.
Greg Ganger (MIT) presented "System Software Support for High Performance Servers." In the context of the exokernel system, this work explores new system abstractions targeted at providing high performance for servers, while retaining portability. Specialized resource management and tighter integration of key abstractions (memory, disk and network) provide performance. A prototype HTTP server demonstrates some of the advantages of this system.
Dylan McNamee (University of Washington and OGI) pre sented "Virtual memory alternatives for client buffer management in transaction systems." This is a technique for integrating database buffer management with operating system virtual memory support, between the two extremes in current systems. (No cooperation, or an OS that provides database memory management functionality in the system.)
Carl Waldspurger (DEC SRC) presented "CPI: Continuous Profiling Infrastructure," a profiling system for the Digital Alpha that profiles unmodified system workloads. Profiling includes all system services, including the kernel. A compre hensive profile of an entire workload is efficiently generated with single-instruction accuracy, including memory access stalls. More information is available at http://www.research.digital.com/SRC/cpi.
Another provocative "Invited Talk" opened activities on Wednesday, as David Tennenhouse of MIT presented his vision of an "Active Network Architecture," in which network packets - so-called "capsules" - contain user-supplied code that is executed at network nodes along its route. Routers would receive capsules into a transient execution environment where they could perform computations on the capsule contents, possibly with access to some "external" resources such as local storage, and then schedule the retransmission of capsules.
This work is motivated on several fronts. The first is the recognition that networks have been progressing from more passive entities such as dedicated circuit-switched networks to more "active" networks where some computation must be performed at each node, such as the packet-switched model. The next logical step in this direction is to allow the packet to do controlled specific computation on its own behalf at each node.
Tennenhouse suggests another motivation for an active net work architecture is the network users' "pull." Network users and developers have requirements that tend to suggest an active network implementation, such as firewalls, web proxies, and mobile / nomadic computing systems, in which the message may control aspects of the communication and processing depending on where and how the end user is currently connected.
Finally, there is a technological "push" providing increasing support for active networks. Earlier examples of "active" technologies include PostScript, in which print "files" are actually programs sent by the user and executed by the printer, and Active Messages, in which the message specifies its handler on the receive side. The current development of secure or containable execution environments is providing the foundations for the transient capsule execution environment. The Active Network architecture leverages off these new technologies to bring safety, efficiency, flexibility, and mobility together.
Two architectures were proposed: the first is a discrete approach which would separate the data packets from the "processing" (instruction/control) packets. Data packets would remain essentially the same, while the routers would be programmed via special packets and a "back-door" mechanism. The second approach would be an integrated approach using the described "capsules." There would be a standard router API and instruction set for the encapsulated environment, and available node resources would be specified and controlled. There would have to be methods for authentication and authorization, as well as safe resource allocation and scheduling.
Initial work at MIT includes an IP-based proof-of-concept called "ActiveIP" which uses the standard IPv4 option facility to include variable-length Tcl code (!) within an IP packet. Other on-going work is the Active Node Transfer System (ANTS) that will provide a Java-based set of base classes and a class hierarchy that will support capsule assembly, transfer, and interpretation; and a capsule evaluation engine for active nodes.
Active Networks provide a mechanism by which core network functionality can be easily extended and new network services may be implemented. With a standard Active Net work architecture, developers can more quickly and easily provide new network services without dependence on hardware and router changes. For instance, strategically located active nodes could support "active boundaries" between domains, while traditional nodes could operate unaffected in between. Tennenhouse contends that network scale, diversity, and heterogeneity will be more easily managed.
The question was raised about security and the potential for denial of service of the whole network should a malicious or faulty capsule program reach an infinite loop. This is essentially a resource or "commodity" management issue and as such must be dealt with by the resource specification and allocation mechanism. Resource allocation could perhaps include a maximum number of CPU cycles allowed to any capsule.
Another question raised the cost of computation on network performance. It was noted that IP was not particularly fast originally, but has been improved over time by identifying the critical paths. The response was that an Active Network could provide "performance hooks" to the critical paths, such as a default-forward mechanism with minimal computation. One of the strengths of Active Networks is the trade-off of computation for bandwidth. If messages may make intelligent decisions along the way, they may be able to decrease the overall load on the network and therefore improve the performance of the network as a whole.
Besides at MIT, Active Networks are already being researched at several universities today. I observe that research activity is guaranteed to increase because DARPA has initiated a significant program in that area. The challenges are great, but so is the potential.
The session on "Scheduling and Synchronization," kicked off with a talk by Bryan Ford, from the University of Utah, on "CPU Inheritance Scheduling," a processor scheduling framework. This framework allows different policies to peacefully coexist. It supports better usage accounting, localized policies, avoidance of priority inversion, multiproces sors, processor affinity and scheduler activations. The key idea of the framework is that arbitrary threads may "donate" their CPU time to another thread. Thus any thread may be a scheduler for other threads, facilitating hierarchical compo sition of policies. A low-level policy-free dispatcher in the kernel implements blocking and unblocking. Experimental results for a prototype implementation (in user mode on BSD) demonstrated the composition of policies, and provided an analysis of the associated overheads, which were projected to be reasonable.
Someone asked about composition of policies, and if that always makes sense. Answer: this is a mechanism that allows composition, but that not all compositions make sense. Another asked if any work had been done on a logic for composing policies. Again, all that has been worked out is a flexible mechanism for composing scheduling policies.
"A Hierarchical CPU Scheduler for Multimedia Operating Systems" was then presented by Pawan Goyal of the University of Texas, Austin. It describes a method and algorithm for hierarchical partitioning of CPU bandwidth. The authors describe three classes of applications, hard real-time, soft real-time and best effort, that have distinct requirements of the CPU. An algorithm, Start-time Fair Queuing (SFQ), was presented, along with results from an implementation under Solaris 2.4. Someone asked what guarantees on delay bound could be made under SFQ. The response indicated that weights in SFQ control bandwidth and fairness, not deadlines.
The last paper in this session provoked interesting controversy in both question and in hallway conversations. It was "The Synergy Between Non-blocking Synchronization and Operating System Structure," presented by Michael Greenwald of Stanford University. The authors' Cache Kernel uses Type Safe Memory (TSM), which requires that the type of an object never changes (even over a delete). This inherently makes traversing structures with pointers safer. It also facilitates non-blocking synchronization (NBS). Additionally, the atomic Double Compare and Swap (DCAS) instruction provides important mechanism. NBS provided several key features to the Cache Kernel. First, it allowed asynchronous signal handlers to update synchronized data structures without deadlock. Second, it avoids various forms of priority inversion, and finally, provides greater insulation from failures.
Several people in the audience were aghast at the use of a goto in the example. It is a simple work-around for the lack of a labeled break in C. Others questioned the dependence on TSM, and in general, the claim that programming with non-blocking synchronization is easier. All in all, an interesting paper and thesis.
This session contained two papers. Each was the first detailed publication describing the new operating systems being developed by two major ARPA-funded OS projects, at the Universities of Arizona and Utah.
The first talk was "Making Paths Explicit in the Scout Operating System," presented by David Mosberger (University of Arizona).
Today's computer application environments are moving towards "network appliance" environments. A network appliance (NA) can be a smart card, a network disk, a multimedia station, or even a scalable server node. In essence, NAs have several distinctive properties: they are communications oriented; they have special and diverse functionality; their performance is predictable, and they have scarce resources. One way to build software for NAs is to use modular (horizontal structuring) design, forming flexible and reusable modules. However, this approach suffers from high overhead from modularity. An alternative way is to adopt more complex and expensive vertical integration approach (using tightly integrated, task specific routines) to achieve better performance.
Using paths provides a better solution. A path gives the "vertical component" in a horizontally layered system, to solve problems requiring global context. The fundamental idea behind paths is exposing and exploiting non-local context in a systematic manner. For example, knowing code sequence(s) to execute in advance can be used as a hint to improve code quality and to specialize the code for a given path (task). A path is a linear flow of data that starts at a source device and ends at a destination device. A path architecture is composed of a "router" graph, defined at build time, describing dependency information among modules (routers); invariants; and a path defining a sequence of "fixed" routing decisions.
A path is created incrementally using local routing decisions and optimized globally by applying (data) transformations that match the created path. Paths can be narrow (per data item) or wide (per connection), and either long for best optimization, or short for generality, where dynamic routing decisions can be made.
To demonstrate the benefits of using paths, a simple application was implemented in the new Scout operating system. The application receives and displays MPEG video streams. The router graph for this application consists of seven routers. ETH, IP, and UDP implement standard networking protocols, MFLOW implements a flow control protocol, while MPEG decompresses the messages received from MFLOW and passes them to the DISPLAY router. The seventh router, SHELL, sits on top of the UDP router at the same level as MFLOW, and is used to create paths dynamically. For example, consider a package arriving at the ETH router. It is known that the packet belongs to the MPEG router, thus it can be scheduled for processing with an appropriate policy. Also, if it is not possible to display that frame on time, it can immediately be discarded.
This work shows how paths can be made explicit OS abstractions and the benefits of this approach: early segregation of work and early elimination of (potentially) unnecessary work. Also, packets can be scheduled promptly as they arrive, with an appropriate policy.
The second talk of the session was "Microkernels Meet Recursive Virtual Machines," presented by Bryan Ford (University of Utah).
This work introduces a new approach to providing modular and extensible functionality in operating systems that combine the benefits of two orthogonal concepts (microkernels and virtual machines) to support efficient decomposition of traditional system services. A microkernel moves OS services outside the traditional kernel and runs them in user mode, but without inherent structure between servers. In contrast, a recursive virtual machine approach implements basic functionality in stackable "virtual machine monitors" (or "hypervisors"), each of which provides an interface to the next underlying (virtual) machine.
Central to this approach is the "nested process model," in which a parent process has complete control over its child processes by encapsulating them. The nested process model provides OS modularity, extensibility, flexibility, security, and strong resource control. A software architecture called Fluke (Flux u-Kernel Environment) has been specified that can support the efficient implementation, by a microkernel, of this process model. The Fluke microkernel allows user processes to provide hierarchical management of the basic system resources, such as CPU and memory, as well as direct management of low level kernel abstractions (e.g., threads and ports). The Fluke architecture allows creation of virtual machine monitors, called nesters, efficiently providing an execution environment for applications or other virtual machines monitors to run on top of each other.
There are three components of the Fluke architecture. The Basic Instruction Set allows all processes to directly execute ordinary unprivileged x86 instructions. The Low Level System Call API implements the base kernel objects and ser vices, also directly executable by any process. Finally, the Common Protocols API defines higher level communication conventions such as file access and memory allocation. An important difference between the first two components and the last is that only the Common Protocols API component can be (re-)implemented differently by separate virtual machines; the others are fixed for a particular kernel implementation and base computer.
Besides the Fluke microkernel, basic Common Protocol libraries for the client and server sides are implemented in a prototype. Also implemented are several nesters including a debugger, a tracer, a process manager, a virtual memory manager, a checkpointer and a file system nester. In addition to several test programs that were developed to evaluate the performance effects of the nesting of processes, many GNU applications, such as compile, binutils, and fileutils, have also been ported to Fluke. Several micro benchmarks were performed on Fluke, comparing the kernel's absolute performance with FreeBSD's: some were faster, some slower. More evaluation was done of the overhead from layering, and preliminary results were encouraging, showing 0--30% overhead per layer.
Wednesday's regular sessions concluded with two papers covering new performance measurement methods and analyses. The first was "Studies of Windows NT Performance Using Dynamic Execution Traces," from DEC SRC. The first author, Sharon Perl, was engaged in something even more important that an OSDI presentation - childbirth -so the paper was presented by her co-author Dick Sites, who gave a stimulating talk.
By patching application binaries, it is possible to obtain an accurate trace of the execution of a program. This concept allowed performance measurements to be taken on DEC Alpha computers running Windows NT. As a result of these measurements, it was observed that the available pin (processor to memory) bandwidth is often inadequate for existing processors when running large applications. It was shown that under some real applications, this was the limiting factor in the processor's performance. Every new version of the Alpha has since had increased pin bandwidth. It was also noted that on a small-scale multiprocessor, significant time was wasted under NT due to lock conflicts. This locking problem significantly impacted scalability and degraded the performance. The most critical lock was in the operating system itself, in the dispatcher. NT has since been modified to reduce the locking problem in the dispatcher.
The second paper was "Using Latency to Evaluate Interac tive System Performance," presented by Yasuhiro Endo of Harvard University. Conventionally, performance measurements have focused on the throughput of a system. However, it is the user-visible latency in processing an individual request that matters most for common interactive workloads. The Harvard researchers have developed simple and general techniques for measuring latency and for visualizing the results. Running Windows 95 and NT on a Pentium PC, the latency of several common tasks was measured in the context of real applications. A program would simulate a key-press or other event, and measure the time until the computer had responded; this usually involved redrawing part of the screen. Differences in latencies between the operating systems were discovered and pointed out. There was an over view of some of the architectural differences in the OS, but the goal was not to pinpoint the cause of all the differences, but rather to point out the differences.
Wednesday afternoon closed with a panel discussing "What the OS industry wants from OS research." It featured six people from industry, primarily high-ranking engineers and managers who have major responsibilities for developing the operating system, with one from a research lab. IBM (Marc Auslander), Chorus (Michel Gien), Sun (Rob Gingell), SGI (Larry McVoy), DEC (Jeff Mogul), HP (John Sontag), and Microsoft (Rob Short) were the panelists, moderated by Jay Lepreau from the University of Utah.
To this observer, two points of consensus seemed to emerge. The first was that the management of complexity in OS code was the single most important and intractable problem they faced. In other words, they need solutions to the classic soft ware engineering problems of handling a long-lived, huge, and complex "ball of mud" of software. For example, a couple of panel members expressed interest in a solution to the ever present problem of real component software. The second point of consensus was that industry doesn't want to hear about 5-10% performance improvements; they want to hear about improvements by orders of magnitude. They "know many places where they could squeeze another 10% out of their own systems" if they decide it's worth the investment. However, in general, the software engineering costs of integrating small optimizations is usually not worth it. Some hallway conversation was skeptical of this claim of wanting only revolutionary advances, noting that incremental improvements are the (only) ones the most vendors adopt.
Finally, to spice up the event, Lepreau incited people to comment on the relevance of the research topics represented at OSDI's regular sessions. From the audience, Dave Black took the challenge, putting the panelists on the spot to choose the session least relevant to industry. Software DSM was the unfortunate winner of this contest, but the reader is cautioned about flawed methodology.
Attendance on Thursday did not decline even though it was the last official conference day. This can certainly be attributed to the papers chosen for the first morning session on "Extensibility and Safety."
Przemyslaw Pardyak, a graduate student at the University of Washington, presented the first paper, "Dynamic Binding for an Extensible System." The work was done in the context of SPIN, an extensible operating system that allows untrusted applications to dynamically add code to the kernel in order to provide new services, or to replace old services with imple mentations that better fit the application's needs. These extension must be flexible, simple, fast and safe. There are a lot of existing approaches: dynamic linking, service-specific interfaces, object-oriented languages, and publish-subscribe communications. SPIN uses a event-based approach that includes language support and runtime code generation to combine the best of all these worlds. In this model, events are described in a procedure syntax; more specifically, as MODULA-3 procedures. An event handler is an implementa tion corresponding to a given event signature - this naturally matches a programmer's view in which everything is a procedure.
Unlike conventional procedures, however, procedure invocation can be controlled by guards, and it also allows more advanced features such as broad- or multicast. To avoid the runtime costs usually associated with dispatch, the dis patcher code is generated on the fly. Guards are optimized into jump tables where possible. In a reply to questions from Linus Kamb from the University of Utah and Laszlo Boszormenyi from the University of Klagenfurt, the speaker pointed out that this runtime code generation is done is a manner similar to loop-unrolling, based on the data structures used for guards and arguments, and that this code generation can be done incrementally, so that adding and removing event handlers does not require regeneration of the complete dispatcher.
While the type safety of MODULA-3 makes the invocation of handlers safe, it does not prevent denial-of-service attacks. For this purpose, procedures may be declared ephemeral, meaning that they run only for a short time and are prepared to be killed at any time afterwards. As David Black from the Open Group pointed out, this might cause problems if ephemeral handlers invoke other system procedures with possible damaging side-effects. The solution that circumvents this problem is to restrict the kind of procedures ephemeral procedures can call.
Margo Seltzer of Harvard University presented the second paper in the session, describing their work on VINO, an extensible kernel. They named their paper: "Dealing with Disaster: Surviving Misbehaved Kernel Extensions."
One of VINO's base assumptions is that "minor tweaks can fix major problems" - therefore, the main goal in developing an extensible kernel should be to minimize the effort required to modify the kernel's behavior. Extensions should be fine-grained (i.e., procedure calls), they should look just like kernel code, and they should be able to call kernel functions.
The author identified several possible ways in which kernel extensions, which are appropriately called grafts in VINO, can misbehave and devised solutions against each of these scenarios. Grafts are prevented from accessing illegal data through the use of software fault isolation. A digital signature ensures that the graft code has been processed by MiSFIT, a software fault isolation tool for the i386. Resource hoarding can occur when grafts try to maliciously consume system resources, for instance by infinitely looping. Since these can't be detected statically, grafts can be preempted and, if necessary, terminated. If a graft is terminated, all changes that this graft has made to kernel state must be undone.
The technique employed is borrowed from databases: grafts are essentially transactions, managed by VINO's transaction managers. The only way grafts can manipulate kernel state is through accessor functions. These accessor functions push a corresponding undo operation on the undo call stack of the currently running thread. If a transaction needs to be aborted, the transaction manager invokes each undo operation found on this stack.
VINO was written from scratch in C++ with the device drivers and some lower level parts taken from NetBSD. It is highly object-oriented; in fact, every feature or policy decision that might conceivably be extended or changed was cast as an object method. Using C++ was a decision Seltzer didn't want to elaborate further - clearly, using a language with a more constrained interface and garbage collection such as MODULA-3 would have made the task of cleaning up after misbehaved grafts easier.
In the paper as well as in the speech, a great deal of detail was devoted to performance issues. A variety of possible paths was considered and measured: the base path, with all extra indirection and graft-support removed, multiple VINO paths, and a path in which grafts had to aborted. It was found that the overhead of their techniques is high relative to the cost of the extensions themselves, but low relative to the benefits that extensibility brings.
In the question period, Jeff Chase from Duke University remarked that transactions are usually associated with lots of baggage, and since side-effects must not be visible during the transaction, special locking protocols such as two-phase locking need to be employed. Seltzer responded that experience has shown that this poses relative little restrictions on the graft designer and does not seem to increase the overall number of locks.
Brian Bershad of the University of Washington pointed out a very important philosophical difference in the approaches to extensibility as pursued by the SPIN and VINO projects. By requiring the VINO kernel to provide undo functions for every possible state change, a lot of responsibility is pushed to the base service. SPIN, on the other hand, does not want to complicate the design of the base OS - even if that makes the design of extensions harder. Margo Seltzer acknowledged the importance of this difference: yes, VINO wants to make grafting as easy as possible.
The third and last paper in the session, one of the two award papers, came from George Necula (presenter) and his advisor Peter Lee from Carnegie Mellon University, entitled "Safe Kernel Extensions Without Run-Time Checking."
When unknown code is executed by a code consumer, be it a kernel downloading special extensions or a Web browser interpreting mobile content, there is always the potential that the foreign code might do harm to the consumer's system. Different solutions to this problem have been proposed and implemented, including hardware protection, software fault isolation, and typesafe languages. All of these solutions carry a performance penalty or depend on cryptography.
On the other hand, the technique described in this paper, "Proof-Carrying Code," or PCC, can be hand-optimized assembly code without any runtime checks. What makes this code special is that it comes with a proof that it is safe for the consumer to execute it. This proof shows compliance with a given safety policy defined by the code consumer. For the example of packet filters, this policy could define that only memory within the packet's bounds may be accessed, and in addition, that some scratch memory provided by the OS may be written to.
PCC does not require a cryptographic signature to ensure that the proof has not been tampered with. Instead, the represen tation of a proof in LF0, a subset of the LF (Logical Framework) language, can be checked to see whether the proof is valid and corresponds to the native code in the PCC binary. Moreover, this check is nothing more than simple typechecking. The LF object can be viewed as a "witness" to the proof, certifying that the proof was performed in accordance with first-order logic proof rules, possibly extended by a set of application-specific rules previously agreed upon by both the code producer and the consumer.
PCC uses a great deal of type-theory, logic, and formal verifi cation. For instance, the safety predicate that is computed for the code (and whose proof proves its safety) is just the verification condition in a Floyd style VC generator for the program sequence. The safety policy provided by the code consumer appears as the precondition in this model.
Generating proofs, on the other hand, is more difficult. For tunately, this can be done offline before the code is shipped. Currently, proofs for simple applications such as the packet filter can often be found automatically; in other applications, such as applications containing backward branches, the manual specification of loop invariants is needed.
A nice metaphor was given to compare the difficulty of finding a proof vs. checking one - a maze. While finding a way through a maze may be puzzling, checking that a given way actually leads through it is easy. As one might expect, the PCC packet filter outperforms all other tested packet filters: the standard BSD packet filter, packet filters using SFI, and those written in a typesafe language. The initial startup cost involved in verifying the proof is quickly amortized.
Andrew Hume from AT&T Research wanted to know what the biggest interesting thing was that they were able to prove so far, and upon what the complexity of proofs depends. It depends mostly on how complex the safety policy is. Phil Winterbottom from Bell Labs asked for an example in which PCC was used in the presence of composition, that is, multiple interacting components. Unfortunately, this hasn't been tried yet. Jay Lepreau from the University of Utah pointed out that the nonreliance of PCC on cryptography is good, since cryptographic techniques are proving vulnerable. Necula elaborated by noting that PCC does not require a notion of trust. Even if the code was changed in an allowed manner, such as by an instruction scheduler, the validity of the safety proof can still be decided.
"Lazy Receiver Processing (LRP): A Network Subsystem Architecture for Server Systems" was presented by Peter Druschel (Rice University). He notes that in traditional network servers, with increasing inbound network traffic the traffic actually delivered to applications reaches a maximum and then declines as the system undergoes congestive collapse. Problems like this one are the result of eager, interrupt-driven receive processing, lack of effective load shedding, lack of inter-application traffic separation and inappropriate resource accounting. LRP provides stable overload behavior by performing early demultiplexing at receive interrupt priority, which leads to a feedback mechanism for early packet discarding. Receive CPU usage is charged to the receiving process. The performance of a sample LRP implementation is comparable to traditional implementations when under low network load, and remains stable under high load. See http://www.cs.rice.edu/CS/Systems/LRP/ for more information.
Jose Carlos Brustoloni (Carnegie Mellon University) spoke on the "Effects of Buffering Semantics on I/O Performance." He described a new taxonomy of software-only and hard ware-assisted data passing techniques in the context of network data buffering. The taxonomy is structured as the Cartesian product of Buffer Allocation strategy (System or Application) with Integrity model (Strong or Weak), to yield traditional UNIX "Copy" semantics (Application/Strong), the "Move" semantics of V (System/Strong), "Share" semantics (Application/Weak) visible in Fbufs, and "Weak Move" semantics (System/Weak). He presented a number of optimizations that produce semantically similar but better performing implementations, and can be used without modifying applications. The CMU group's Genie system is an I/O framework that implements these semantic models and optimizations for performance comparison. He reports that using Emulated (optimized) Copy in place of traditional Copy semantics results in an almost threefold throughput improvement on an OC-12 network.
Last modified 2/11/97ah