Check out the new USENIX Web site.


1999 USENIX Annual Technical Conference
June 6-11, 1999
Monterey, California, USA

These reports were originally published in the December 1999 issue of ;login:.

Keynote Address
Refereed Papers Track
Session: Virtual Memory
Session: Web Servers
Session: Caching
Invited Talks
Session: File Systems
Session: Device Drivers
Session: File Systems II
Session: Networking
Session: Business
Session: Systems
Session: Kernel
Session: Applications
Session: Kernel II
  Our thanks to the summarizers:
Aaron Brown
Peter Collinson
Jeffrey Hsu
Bruce Jones
Brian Kurotsuchi
Art Mulder
David Oppenheimer
Jerry Peek
Arthur Richardson
Chris van den Berg


Integration Applications: The Next Frontier in Programming
John Ousterhout, Scriptics Corporation

Summary by Peter Collinson

The 1999 USENIX conference at Monterey opened on a somewhat cooler-than-expected Wednesday morning in early June. After the usual rounds of announcements and thanks, including the announcement of the USENIX Lifetime Achievement Award and the Software Tools User Group Award, John Ousterhout came to the podium to give his keynote presentation. Ousterhout's been a frequent visitor to USENIX conferences over the years, presenting papers that reflect diverse interests that sprang originally from his academic base as a professor of computer science at the University of California, Berkeley.

Ousterhout's life shifted from the campus into industry by his creation of Tcl (pronounced "tickle"), which he created to be an extensible scripting language that could be embedded easily to control both hardware and software. Tcl gave rise to Tk, the GUI-builder library, that's had a wide impact, providing a GUI-builder API for other scripting languages, notably Perl.

I'll guess that Ousterhout's move from the groves of academe surprised both the university and Ousterhout. He became a distinguished engineer at Sun and recently has moved again to be the CEO of his own company, Scriptics. Scriptics promotes Tcl and the use of Tcl, and is part of the new wave of companies that are based on open-source software.

The talk concentrated on, you guessed it, scripting languages and their impact. He started with the premise that much of today's software development is actually the integration of applications to make a greater whole, and that many programmers are engaged in the often difficult business of making different software components interact.

Several application areas are driving the need to provide integration interfaces: we oust have the ongoing shift from the command-line interface to the GUI; Web sites are providing a need to integrate legacy applications such as databases with the user on a network either local or remote; special-purpose black boxes are on the rise and there's a need to configure such embedded devices; many applications are consolidating legacy applications within an enterprise, perhaps a department in a hospital or when there are mergers and acquisitions; finally, there are the various component frameworks—COM, EJB, or CORBA.

His contention is that scripting languages are better at providing the necessary flexible glue than traditional system-programming languages because these integration tasks have different characteristics from traditional programming tasks. The fundamental problem is not the algorithms or data structures of the application but how to connect, coordinate, and customize different parts. The integration exercise must support a variety of interfaces, protocols, and formats; it often involves automating business processes and requires rapid and unpredictable evolution. Finally, integration often involves less sophisticated programmers.

Ousterhout went on to give a short history of system-programming languages in terms of their original design goals and compared their approach with scripting languages. He highlighted two areas where he feels that the traditional languages fail when used for integration. First, the traditional languages use strong variable typing designed to reduce errors by using compile-time checking. Second, the design of traditional languages encourages the generation of errors, which can be avoided by providing sensible defaults.

Ousterhout dismissed the traditional concerns about scripting languages. The main one is usually performance. This is no longer a problem—machines are 500 times faster now than they were in 1980, and anyway most expensive operations can be done in libraries. Second, people often complain that it's hard to find errors in scripting languages because there are fewer compile-time checks. He counters this problem by saying that there is better runtime checking in scripting languages and that they are "safe" because they provide sensible defaults. (I had to "hmm" a bit at this. I've played the game of "find the missing bracket" a little too often in both Perl and Tcl.) Finally, he dismissed the notion that scripting code is hard to maintain, on the ground that there is much less code to deal with in the first place.

Ousterhout then moved on to talk about Tcl. Tcl arose because Ousterhout wanted to create a simple command language for applications, one that could be reused in many different applications. The ideas gave birth to the Tool Command Language, or Tcl, which is a simple language that's embeddable and extensible. Tcl provides generic programming facilities while anything really hard or performance-impacting can be placed in a library for the application and accessed by invoking a command.

Tcl today has more than 500,000 developers worldwide. The Scriptics site is supplying 40,000 downloads every month. There's an active open-source community with strong grassroots support. There are thousands of commercial applications: automated testing, Web sites, electronic data automation, finance, media control, health care, animation, and industrial control.

Ousterhout concluded by saying that we are experiencing a fundamental shift in software development, moving toward more integration applications. These applications are better served by scripting languages that are supporting a new style of programming. The programming style seeks to minimize differences between components and to eliminate special cases. The use of simple interchangeable types in the language helps to keep the size of the code down and aids the programmer. It also helps to minimize errors. Scripting languages are providing a challenge to the community by making programming available to more people and also by allowing more to be done with less code.

(Thanks to John for the slides of his talk, whose contents I have cheerfully stolen for this report.)


Session: Virtual Memory
Summary by Brian Kurotsuchi

The Region Trap Library: Handling Traps on Application-Defined Regions of Memory
Tim Brecht, University of Waterloo; Harjinder Sandhu, York University

Tim Brecht presented a library in which a user-level program can mark arbitrarily sized memory regions as being invalid, read-only, or read-write. The advantage of placing this capability into a library is that the protection can be assigned at whatever granularity the application wants, not just the page-by-page basis that the operating system provides.

This library does not use the mprotect mechanism because that form of protection is page based and does not offer the flexibility this library is looking for. Instead, the library takes a different approach based on address "swizzling" and custom trap handlers. Applications that choose to use this library allocate memory by using custom functions inside the library.

Inside the library, memory is allocated but the application is provided with a address that has been swizzled to point into kernel space. Clearly the application will cause a page fault when it tries to access that memory region in the future. When that segmentation violation occurs, the preassigned trap handler will look up the requested memory address in an AVL tree and unswizzle the address into the appropriate register. The application can then continue with its work as if there were no problem.

The Case for Compressed Caching in Virtual Memory Systems
Scott Kaplan, Paul R. Wilson, and Yannis Smaragdakis, University of Texas at Austin

Memory subsystems in modern computers still suffer from the fact that the CPU can use memory much faster than RAM can supply it. The general solution today is to place faster caches between the CPU and main memory, but this only gets so far and can hold only a limited amount of data. In this presentation, Scott Kaplan made a case for adding yet another level in between main memory and the CPU with the hope of increasing performance.

They added a cache of compressed memory between the main memory subsystem and the paging mechanism. Kaplan pointed out that this strategy was attempted in the past but ended up with results that show little to no benefit. He asserts that with modern-day hardware we can compress/decompress the data fast enough to gain a significant advantage when compared to the cost of paging the data to disk, whereas past experiments have lacked the spare CPU cycles to make the scheme feasible.

In their implementation of this new cache, they created the Wilson-Kaplan compression algorithm, further improving on past work in this field. They claim to have a 2:1 compression ratio when compressing the type of data found in a typical chunk of memory. Measuring their results, they claim to have found a gain of about 40% in paging requirements, but they are unable to present a comparison to previous work because the previous work could not be reproduced.

Session: Web Servers
Summary by Aaron Brown

Web++: A System for Fast and Reliable Web Service
Radek Vingralek and Yuri Breitbart, Lucent Technologies—Bell Laboratories; Mehmet Sayal and Peter Scheuermann, Northwestern University

Radek Vingralek's presentation described Web++, a system that addresses the problems of Web-server response time and reliability by using "Smart Clients" and cooperating servers to balance and replicate Web content dynamically across a group of distributed Web servers. Vingralek began with several motivating examples that illustrated the problems of poor and inconsistent Web performance, especially when transcontinental links are involved. He then presented the solution of content replication across geographically distributed servers, which addresses both the performance and reliability problems of single-site servers and which does not suffer the flaws of the proxy-caching approach (low hit rates and cache bypassing by providers) or of the server-clustering approach (single dispatcher and no way to avoid network bottlenecks).

The Web++ approach combines a Smart Client, implemented as a signed Java applet and downloaded on demand to the user's browser, with Java servlet-based server extensions that maintain the replicas and provide clients with information on how to find them. The server preprocesses all HTML files sent to the clients, replacing each HTTP URL with a list of URLs pointing to the various replicas of the original object. It keeps a persistent directory of replica locations and uses a genealogy-tree-based algorithm to maintain eventual consistency with other servers; all communication among servers is handled with HTTP/1.1 put, delete, and post commands. The client architecture is based on a Java applet that intercepts the JavaScript event handlers to capture all page requests. For every page request, the applet uses the replica information embedded by the server to select the actual destination for the request. The replica-selection algorithm attempts to select the replica located on the server with the overall best request latency in the recent past; to do this, it keeps a persistent per-server latency table on the client machine. To avoid suboptimal selections due to stale data, the client periodically and asynchronously polls servers at a rate selected to balance overhead with data freshness. This selection algorithm outperforms most standard algorithms, including random selection, selection based on the RTT or number of network hops, and probabilistic selection.

Vingralek presented a brief performance evaluation of Web++ that demonstrated an average of 47% improvement in response time using a workload of fixed-size files and three geographically distributed servers (in California, Kentucky, and Germany). Server response time degraded by at most 14% because of the servlet extension. The Web++ client algorithms underperformed the optimal algorithm (sending each request to every replica simultaneously and using the fastest response) by only 2.2%. The speaker concluded by bemoaning the poor support for smart clients in the current Java applet model, and by emphasizing the importance of developing good models and easy-to-use algorithms for replica consistency. One audience member questioned the complexity of the Web++ system relative to the benefits it provides, especially relative to simpler solutions. Vingralek responded that Web++ can significantly outperform most of the standard replica-selection algorithms such as random, number-of-hops, and RTT.

Efficient Support for P-HTTP in Cluster-Based Web Servers
Mohit Aron, Peter Druschel, and Willy Zwaenepoel, Rice University

Mohit Aron described extensions that add support for persistent HTTP (P-HTTP) to the LARD (Locality-Aware Request Distribution) technique for cache-aware load balancing in cluster-based Web servers. The first part of the talk introduced traditional LARD, which relies on a front-end machine to examine each incoming request and route it to the back-end cluster node that is most likely to have the requested content in its cache. LARD outperforms traditional algorithms such as weighted round-robin (WRR) in both load balance (a LARD system remains CPU-bound as the size of the cluster is increased, whereas WRR becomes disk-bound) and cache behavior (the effective cache size of a LARD system is the sum of the sizes of each node's cache, as opposed to WRR, in which the effective cache size is the size of one node's cache).

LARD was developed for the HTTP/1.0 protocol and thus assumes that one TCP connection carries only one request. If used unmodified, simple LARD does not perform well with HTTP/1.1's P-HTTP, since LARD balances load on the granularity of a connection, which with P-HTTP can contain multiple independent requests. In the next part of the talk, Aron described various options for updating LARD to perform request-granular load balancing with P-HTTP. The first option, multiple TCP handoff, requires that the front-end examine each request in the connection and hand each request off to the appropriate back-end node, which then sends the response to just that request directly to the client. This achieves request-granular balancing but adds the overhead of creating a new back-end connection with each request, defeating much of the benefit of P-HTTP. With the other option, back-end forwarding, the front-end hands the entire connection over to a back-end node, which services the first request and then sends additional requests directly to appropriate (potentially different) back-end nodes. In simulation, neither the naive connection-based method nor these two attempts at request-based redistribution come close to the ideal of zero-overhead-per-request redistribution.

Aron investigated a modified version of back-end forwarding for LARD that takes into account the extra cost of forwarding a request between back-end nodes. In this version, a connection is assigned (using LARD) to a back-end node X on the basis of its first request. For each subsequent request, if that request is already cached at X, it is serviced by X. Otherwise, if X's disk utilization (load) is low, then the request is still serviced by X, avoiding the cost of an extra hop. Otherwise, the request is sent to another back-end node that has the appropriate data cached. In simulation, this policy comes very close to the ideal. Aron also described an implementation of this policy in a cluster of FreeBSD machines running Apache; the Web servers were unmodified, but the kernel was enhanced with loadable modules to implement TCP handoff and the front-end dispatcher. In experiments, the enhanced back-end forwarding policy exceeded the performance of simple LARD with HTTP/1.0, simple LARD with P-HTTP, and all weighted-round-robin variants. It outperformed LARD-HTTP/1.0 by 26%, demonstrating the benefit of P-HTTP in a LARD system.

One questioner claimed that 90% of hits on big sites fit in a small cache and can be serviced by a single machine, and asked why a complicated cluster-based solution like LARD was necessary. Aron replied that their experiments were based on real traces that did not have such a small working set, that it was common for sites to have static-request working sets in the gigabytes, and that LARD was targeted at situations where the working set does not fit in a single-machine cache. He also argued that cluster-based solutions provide an easy means of incremental scalability as the working set increases (simply add more machines).

Flash: An Efficient and Portable Web Server
Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel, Rice University

The authors' motivation in building yet another Web server was both to create a server with good portability and high throughput over a range of workloads, and to gain a better understanding of the impact of different concurrency architectures on Web-server performance. To that end, Flash implements several different concurrency architectures in a common implementation, so that architecture can be examined independently of implementation.

In the first part of his talk, Vivek Pai introduced the various architectures for handling multiple concurrent Web requests. The first of these is the Multiple Process (MP) architecture, which uses multiple processes, each handling one request at a time. MP is simple to program but suffers from high context-switch overhead and poor caching. The next option, Multithreaded (MT), uses one process with multiple threads; each thread handles one request at a time. This approach reduces overhead and improves caching, but requires robust kernel-threads support for large numbers of threads, blocking I/O in threads, and synchronization. Another architecture is Single Process Event Driven (SPED), in which only one process/thread is used, and in which multiple requests are handled in an event-driven manner via an event-dispatcher using select() and asynchronous I/O. This model removes the need for threads and synchronization, but often in practice performs poorly because of the lack of asynchronous disk I/O in most OSes. Finally, Pai introduced a new architecture, Asymmetric Multiple-Process Event Driven (AMPED), which uses a SPED-like model of a central event dispatcher, but which also uses independent helper processes to handle disk and network I/O operations asynchronously.

Pai next described an implementation of the AMPED architecture in the Flash Web server. Besides implementing AMPED (as well as several other concurrency models), Flash also incorporates additional optimizations such as the use of memory-mapped files and gather writes. Most important, Flash uses aggressive application-level caching of pathname translations, response headers, and file mappings. In simple experiments in which a single page is repeatedly fetched, this application-level caching is the dominant performance factor (accounting for a doubling in performance in some cases), and the concurrency architecture is not a major factor. In trace-based experiments, Flash with AMPED in general outperformed or was competitive with all other servers and architectures because of its optimizations, application-level caches, and the good cache locality achieved by its single-address-space design. It performed up to 30% faster than the commercial Zeus SPED server, and up to 50% faster than MP-based Apache. In particular, Flash approached SPED performance where SPED performs best (on cacheable workloads) and exceeded MP performance on disk-bound workloads (where MP performs best), demonstrating that AMPED combines the best features of both architectures and works well across a range of workloads.

Session: Caching
Summary by David Oppenheimer

NewsCache—A High-Performance Cache Implementation for Usenet News
Thomas Gschwind and Manfred Hauswirth, Technische Universität Wien

Thomas Gschwind described NewsCache, a USENET news cache. Noting that the network bandwidth requirement for carrying a typical USENET news feed is 3—5 gigabytes per day and growing, Gschwind suggested separating the USENET article-distribution infrastructure from the access infrastructure, with the latter being handled by cache servers and the former being served by a dedicated distribution backbone. NewsCache is designed to serve as part of the access infrastructure: it is accessed by clients using NNRP and itself accesses the news infrastructure using NNRP.

NewsCache uses several techniques to achieve high performance. Most significantly, it stores small articles (by default, those less than 16KB) in memory-mapped databases, one database per newsgroup, with only large articles stored in the file system. Gschwind studied a number of article-replacement strategies for the cache, including BAF (biggest article first), LFU (least frequently used first), LRU (least recently used first), and LETF (least expiration time first). Both the LFU and LRU strategies were studied on a per-article and per-newsgroup basis (i.e., replacement of the least recently used article in the system, or of the entire least recently used newsgroup in the system). Gschwind examined hit rate and bytes transferred as a function of spool size for each of the replacement strategies. In space-constrained situations the LRU-group strategy generally performed the best.

Besides caching, NewsCache provides transparent multiplexing among multiple-source news servers and can perform prefetching. NewsCache is distributed with Debian/GNU Linux. More information is available at <>.

Reducing the Disk I/O of Web Proxy Server Caches
Carlos Maltzahn and Kathy J. Richardson, Compaq Computer Corporation; Dirk Grunwald, University of Colorado, Boulder

Carlos Maltzahn described techniques for reducing the amount of disk I/O required by Web-proxy-server caches running on top of a generic OS filesystem. His study used Squid as its reference Web cache. Squid stores every cached object as a single file in a two-level directory structure and uses a round-robin mechanism to ensure that all directories in the two-level structure remain balanced. Maltzahn compared and contrasted Web-cache workloads with generic-filesystem workloads: the significant differences are that Web-cache workloads show slowly changing object popularity, while filesystem workloads show more temporal locality of reference; the hit rate of Web caches is lower than that of file systems because of a higher fraction of writes in Web caches; and crash recovery is less critical in Web caches than in file systems because the objects stored by Web caches are by definition redundant copies.

Maltzahn described two changes to the Squid cache architecture that were found to reduce disk I/O. The first is to hash each object's URL's hostname to determine where the object is stored, in order to store all objects from the same server in the same directory. This storage scheme reduced the number of disk I/Os to 47% of the number using unmodified Squid. The second change is to store all objects in one large memory-mapped file rather than in individual per-object files. Maltzahn used the original Squid scheme for objects larger than 8KB and a memory-mapped file for all other objects. This scheme reduced disk I/O to 38% of the original value, and in combination with the server-URL-hashing approach yielded 29% of the original number of disk I/Os compared to unmodified Squid.

Maltzahn next compared three replacement strategies for the memory-mapped cache by analyzing the strategies' ability to minimize disk I/O. The strategies studied were LRU, FBC (frequency-based cyclic), and a near-optimal "future-looking" replacement strategy derived from the entire reference stream. Maltzahn found that LRU performed poorly. Compared to LRU, FBC provided an almost identical hit rate, a small reduction in the number of disk I/Os, and a good reduction in wall-clock time (mostly due to a reduction in seek time). The near-optimal policy fared much better than either LRU or FBC, suggesting that more careful coordination of memory and disk could lead to more significant performance improvements.

An Implementation Study of a Detection-Based Adaptive Block Replacement Scheme
Jongmoo Choi, Seoul National University; Sam H. Noh, Hong-Ik University; Sang Lyul Min and Yookun Cho, Seoul National University

Sam Noh described DEAR, DEtection-based Adaptive Replacement, a filesystem buffer cache management scheme that adapts its replacement strategy to the disk block reference patterns of applications. A monitoring module in the kernel VFS layer observes each application's disk block reference pattern over time. The application's reference pattern is inferred by examining the relationship between blocks' backward distance (time to last reference) and reference frequency, and the expected time to the blocks' next reference. DEAR uses this information to categorize an application's reference pattern as sequential, looping, temporally clustered, or probabilistic (the last meaning blocks are associated with a stationary probability of reference). As the application runs, the program's reference pattern is dynamically detected, and the buffer cache block replacement algorithm is updated. A detected sequential or looping pattern triggers MRU replacement, a detected probabilistic pattern triggers LFU replacement, and a temporally clustered or undetectable reference pattern triggers LRU replacement.

DEAR uses a two-level buffer cache management scheme that is implemented in the kernel VFS layer. One application cache manager (ACM) per application performs reference pattern detection and block replacement, and one systemwide system cache manager (SCM) allocates blocks to processes. DEAR was implemented in FreeBSD 2.2.5 and its performance evaluated using a number of applications. Compared to the default LRU replacement scheme, DEAR reduced disk I/Os by an average of 23% and response time by an average of 12% for single applications. When multiple applications were run simultaneously, disk I/Os were reduced by an average of 12% and response time by an average of 8%. A trace-driven simulation was used to compare DEAR with the application-controlled file caching scheme developed by Cao, Felten, and Li, which requires explicit programmer hints to select the replacement policy. DEAR achieved performance comparable to that of application-controlled file caching, but without requiring explicit programmer hints.


IP Telephony—Protocols and Architectures
Melinda Shore, Nokia IP Telephony Division

Summary by Jeffrey Hsu

Melinda Shore began her talk by noting that telecommunications and telephony are undergoing a radical change, and that information on the topic has mostly been tied up in expensive-to-join committees and thus not readily available to the public. The driving factor behind all the interest in IP telephony is the potential cost savings and efficiencies of using a data network to transport voice. In addition to voice, IP telephony is also used for video and to integrate voice and email.

Shore described in detail the various scenarios in which IP telephony can be used, such as end-to-end IP, calls originating in IP network and terminating in switched circuit network, calls originating and terminating in switched circuit network but passing through an IP network, and various other permutations.

IP telephony is heavily standards-driven, since interoperability among different vendors and with the traditional voice networks is key. Two communities are working on the standards: those from the traditional voice networks (the bellheads) and those from an IP networking background (the netheads). The difference in opinion between the two revolves around the issue of centralized versus decentralized call control. The netheads view the intelligence as being in the terminals, while the bellheads view the intelligence as residing in the network.

Shore then described the various standards bodies, such as the European Telecommunications Standards Institute (ETSI), the ITU-T, and the IETF. It turns out that many of these standards groups are attended by the same people, so the standard bodies are not all that different.

Shore discussed in depth the H.323 standard, which is produced by the ITU-T. H.323 is not a technical specification itself, but rather an umbrella specification that refers to other specifications such as H.225 and H.425. H.323 is actually a multimedia conferencing specification, but it is used mainly for voice telephony. H.225 is the call-control part of H.323. It specifies call establishment and call tear-down. H.225 is the connection control part of H.323. It is encoded used ANS.1 PER. H.235 is the security part of H.323.

H.323 is the most widely used IP telephony signaling protocol, but it is very complex and H.323 stacks are very expensive, costing hundreds of thousands of dollars. There is a new open-source H.323 project that can be referenced at <>.

Shore explained the role of a gatekeeper in an IP telephony network. It handles address translation, bandwidth control, and zone management. A gatekeeper is needed for billing purposes. Call signaling may also be routed through a gatekeeper. Shore went over several alternative ways to set up call signaling and the various phases of a telephone call.

Shore wrapped up by talking about some of the addressing issues in IP telephony. The standard that covers this is the ITU-T E-164. There are open issues involved with locating users and telephone numbers in an IP network.

Will There Be an IPv6 Transition?
Allison Mankin, USC/Information Sciences Institute

Summary by Bruce Jones

Allison Mankin's talk explored the problems, concerns, and potentials surrounding the IETF's proposal to move the Internet to IPv6—the "next generation" of the Internet Protocol.

The problem with the current generation of IP—IPv4—is simple: there are not enough addresses for everyone to set up and operate the networks they want. (We won't go into "needs" here, following the as-apolitical-as-possible model of the IETF. As Mankin notes, not everyone is using all of their addresses to best advantage, but . . . )

Compound this shortage with coming uses of IP for things like networks for houses, cars, and Asia—and even with 232 (4 billion) nodes possible in IPv4 you see that you couldn't cover the last one of these if all the users wanted were access to a free email account at Yahoo! and a remote-control refrigerator.

So the IETF, in its infinite wisdom, organized the "IP Next Generation Working Group," whose job it is to see if they can generate a standard for the generation of IP to succeed IPv4—IPv6. Mankin was the co-director of the IETF Steering Group process that led to formation of the working group.

The IETF, prior to the formation of the IPng Working Group, had generated three proposals for a new standard: one proposed by the International Standards Organization, CLNP, a "radical change candidate"; and the successful candidate, "Simple IP." This plan—now called IPv6—is capable of supporting billions of networks and trillions of end-nodes in its 128-bit address space.

The IPv6 address space is broken up in interesting ways. Toss out the three bits for a format prefix and eight bits "Reserved for future use," and you're left with bits for a "Top-level Aggregation" (8K of global ISPs—a number and scheme designed to reduce load on major routers), Site-level Aggregation ISPs (137 billion prefixes), and Site-Level Aggregation with 65.5K networks for every subscriber site. Polish it off with 64 bits for an Interface Identifier (IID), which, if I understood correctly, is just the MAC address of the device, and you have the potential for enough addresses for Internet light switches in every flat in China.

Along the way this plan was modified to include address space for Sub-Top-Level Aggregators because of the demands of the more conservative address managers.

While IPv6 is a coming thing, some strong currents in the Internet world are counteracting the need for a broader address space. Primary among these is NAT (Network Address Translation). A NAT is a device that "connect[s] an isolated address realm with private addresses to an external realm with globally unique registered addresses" (<>). Put simply, if all you have is one address and you want to put several machines on the Internet, then a NAT will handle the job by letting you give your machines pseudo IP addresses while it handles traffic outside your shop via your single address.

NATs are delaying the transition to IPv6 because they offer a solution to address alison_mankin shortages that works in many areas and for most applications. However, NATs will not be able to forestall that transition completely, because they do not work as well at the provider level as IPv6 addresses.

Finally, returning to the question in the title—will there be a transition to IPv6?—Mankin finds that the answer depends on what is meant by transition. If by transition one means that the entire network rushes to embrace IPv6, then the answer is clearly no. People with systems to keep up are understandably loath to replace current working technology with something new simply because the backers of the new tech say it's better. For many, IPv4 serves current needs perfectly well, thank you very much. On the other hand, for those for whom IPv4 is not a solution, there is movement toward IPv6, as would be expected.

The biggest push toward IPv6 will come as the number of places that have made the transition begin to exceed the number of places that haven't. As Asia comes online in really large numbers, if some NAT solution is not discovered for those numbers, there will be business pressures on non-IPv6 users to make the transition too.

The slides from Mankin's talk are at <>.

The Joys of Interpretive Languages: Real Programmers Don't Always Use C
Henry Spencer, SP Systems

Summary by Arthur Richardson

Henry Spencer started his talk by claiming that far too often a programmer will take the wrong approach when trying to solve a problem. In many cases programmers immediately start to code a solution using C as the programming language. This approach usually causes unnecessary work and too complex a solution results from working on too low a level. An example that he mentioned is the use of C to write the program man.

The typical reason someone will use C is the perception that it will result in a more efficient program. Spencer reminds us that writing in C does not guarantee efficient code. The algorithm the programmer uses is more important to efficiency than the language. First-cut C code is not always fast.

One of the characteristics of an interpretive language is having significant smarts at runtime as opposed to compile time. Java, he feels, is a half-breed that tries to be all things to all people.

His reasons for the use of an interpretive language include: fast turnaround times, better debugging tools, dynamic code, and working at a much higher level. Although each of these benefits can be derived from compiled languages, they are usually much harder to achieve with them.

The first goal in using an interpretive language is that it must work to solve the problem. Often, this is good enough. Performance may not always be important, and a working solution can always be used to please management. It also allows for others involved in the project to work on their areas of the solution much sooner in the process. Documen-tation can be created much earlier in the development process. Clean code is a second goal. The programmer should make it easy to alter the code for requirements that change later and for modifications such as improving the user interface. Languages can't demand clear code, but they can encourage it. A third goal is for the program to run fast enough. Often soft realtime is all that is required. Spencer has written a mark sense reader, for entering test answers, that was coded entirely in Awk. There are circumstances when you can just throw hardware at it if performance is still not acceptable. Other times there may be a real performance requirement that requires a lower-level language, but he claims that doesn't happen as often as people would guess.

Spencer went on to describe a few of the interpretive languages and to point out some of their benefits and weaknesses:

SH. SH is often adequate. Although it is slow and clumsy when dealing with low-level operations, it does have good primitives. The only problem with some of those primitives is that they don't consider shells when designed. A lot of times, the program doesn't work in a way that would allow inline use of itself in a script. An example would be the program quota. This language is very weak at doing arithmetic.

Awk. Awk is usually considered a glue language. It is very good at doing small data manipulation. It is capable of doing larger jobs, such as the troff clone that Spencer wrote, but can sometimes be slow. It is very clumsy at doing some things such as splitting strings or subtracting fields. It sometimes feels like development on this language was stopped before it was completely built. There aren't any methods built into the language to allow for extensions, so it will most likely remain as a glue language.

Perl. Spencer described Perl as "Awk with skin cancer." Perl is better evolved and has a better implementation than Awk. The strongest benefits of Perl are the large number of extensions built for it and its large and active user community. Spencer then characterized Perl as having readability problems and said that the structure of the language makes it hard to write good, readable code.

Tcl. Tcl was designed to control other things and not be its own language. There are many extensions available for it, including Tk for programming in X and Expect for controlling interactive programs. Historically, Tcl suffered from performance problems, but Spencer feels that it has improved over time. It suffers from a user community that is not very well organized, but that is improving as well.

Python. Spencer claims that Python is a long step away from scripting. Much more syntax and more data types are used, which makes it feel more like a programming language instead of an interpretive one. The object-oriented design of the language requires much more design consideration prior to beginning the coding process. All of these factors may take Python a step too far away from the traditional interpretive languages.

When you are deciding on what language to use when attacking a problem, one of the things to avoid is language fanaticism. Often a solution that mixes interpretive languages with compiled extensions can often be the best answer.

The downsides of interpretive languages include late error checking, limited data types, and the overhead in using mixed solutions.

In the choice of a language to use, availability is one of the more important considerations. You must also consider what each language is good at. Compare the need for data manipulation and pipes against the need for arithmetic calculations. Familiarity with a language is very overrated, and often the benefit of learning a language that suits the problem outweighs the cost of the time involved in learning it.

E-Mail Bombs, Countermeasures, and the Langley Cyber Attack
Tim Bass, Consultant

Summary by Bruce Jones

"You gotta keep the mail running 100%—the mission is to filter mail, kill the spam, stop the trouble, keep the stuff running 100% of the time and never let the system go down." --Tim Bass

Bass began with a long thank-you to the creators of UNIX, sendmail, and the other tools available to him, pointing out that, while he hadn't invented any of the utilities, facilities, and software packages he used to stop the attack, neither would there have been tools or the network to use them on if it were not for many of the folks in the USENIX audience. This proved a nice segue into the core theme of his talk, that computers and systems and software and users and admins are just interrelated nodes on a system.

"We are in a network of other people, and anything that we want to do that is significant requires other people." In Bass's case, "significant" will come to be defined as anything to do with setting up, running, maintaining, or protecting a network.

Bass then turned his talk to a loose history of the events of the Langley Cyber Attack:

An obviously forged email message from Clinton to Bass's boss alerts Bass to the fact that the logging on his machines is not sufficient for intrusion detection. When he turns up the logging, he finds that Langley machines are relaying the trash of the Internet—porno, hate mail, advertising spam, get-rich-quick schemes, anything and everything.

At Langley, the initial response is to retaliate: bomb the spammers and porno generators back; turn on error messages so they would be bombed automatically. Bass convinces these folks to simply absorb all the traffic but not to retaliate, not to reply. As Bass noted, "Archive all traffic [and] stop all error messages because if you forward to the sender and you send a bad message, the reply goes to the victim." Bass's strategy is the "ooda loop": observe, orient, decide, act.

Like anyone trying to solve a complex problem, these folks had some lessons to learn along the way:

First they try to clean up all outgoing mail. This proves to be an impossible task, as they are getting ~3K messages every couple of minutes. The second alternative is to queue all outgoing mail. Bass notes two problems with this tactic: The first is technical—writing scripts to do the work. This is fairly easy, if labor-intensive. The second is political. Sysadmins have to worry about security—sensitive mail that should not be seen by other than the intended recipients; they have to worry about the privacy rights of the individuals—sysadmins are not allowed to look at someone else's message without good reason; and they have to worry about resource allocation and use.

Bass decides that the mail header files are fair game. He can key on those and decide which messages to dump out of the delivery queue and into a holding area for later use, if such use becomes possible or necessary.

He also immediately stops relaying mail, which has foreseeable effects: "When we cut off the relays it pissed off the hackers. . . . People started bombing and probing us," and they started trying to work around the fences: "Every rule set we came up with, they figured out." Many of the standard methods of defense fail. "We tried firewalls. That worked for about two seconds."

Other lines of defense were in place, even though Bass didn't realize it: "Having a really slow network is a good line of defense."

After finishing his history lesson, Bass then gave a guided tour of some readily available hacker tools and techniques on the Web. He covered four types of mail attack: chain bombing; error-message bombing; covert channel distribution; and mailing-list bombing. Then he ran through a few of the available Windows-based GUI mail-bomber's tools: Unabomber; Kaboom; Avalanche; Death & Destruction; Divine Interven-tion; Genesis.

The conclusion of Bass's talk was an overview of the future of the work of protecting systems against these kinds of attacks: "Intrusion detection systems and firewalls are largely ineffective because you can't understand a network from a GUI. In networks we need to be looking at another paradigm for the future. We need to be teaching our operators and the people on the network awareness of what's happening on the network. We need to take the concepts of situational awareness and begin building awareness systems that allow people to understand network infrastructure. . . . Our systems haven't learned to fuse sensor information with long-term knowledge and then develop mid-term situational awareness."

As Bass noted, much of the hacker/cracker menace is just juveniles engaging in the same kinds of (what used to be mildly destructive) vandalism that kids have engaged in for decades. Unfortunately for the system administrators whose systems are the targets of the bad guys, these kids have more time, tools, and energy (and, in some cases, computer horsepower) available than working sysadmins. Their available resources turn what might look like mild vandalism from the user-interface end into serious problems at the receiving end.

To paraphrase Bass, sysadmins and security people are going to expend a lot of resources in the next few years dealing with this sort of stuff. Sysadmins have to prepare for the day when they have multiple attacks on their network(s) with multiple decoy targets and actual targets. They have to have systems where "the average 17-year-old operator who's working a summer job managing a network is now able to differentiate between what's real and what's just someone having fun on the network."

You can read all the details in Bass's paper at <www/>.

Big Data and the Next Wave of InfraStress Problems, Solutions, Opportunities
John R. Mashey, Chief Scientist, SGI

Summary by Art Mulder

John Mashey, current custodian of the California "UNIX" license plate, presented an overview of where computer technology appears to be heading and outlined areas where we need to be concerned and prepared. A key opening thought was that if we don't understand the upcoming technology trends, then watch out, we'll be like people standing on the shore when a large wave comes rushing in to crash over us.

Mashey began with a definition of the term "infrastress," a word that he made up by combining "infrastructure" and "stress." You experience infrastress when computing subsystems and usage change more quickly than the underlying infrastructure can change to keep up. The symptoms include bottlenecks, workarounds, and instability.

We all know that computer technology is growing: disk capacities, CPU speeds, RAM capacity constantly increase. But we need to understand how those technologies interact, especially if the growth rates are not parallel. The audience looked at a lot of log charts to understand this. For instance, on a log chart we could clearly see that CPU speed was increasing at a rate far larger than DRAM access times.

Most (all?) computer textbooks teach that a memory access is roughly equivalent to a CPU instruction. But with new technologies the reality is that a memory operation, like a cache miss, may cost you 1000 CPU instructions. We need to be aware of this and change our programming practices accordingly. The gap between CPU and disk latency is even worse. Avoid disk access at all costs. For instance, how can I change my program to use more memory and avoid going to disk? Or, similarly, minimize going to the network, since network latency is another concern?

Disk capacity and latency is another area where two technologies are growing at different rates. Disk capacity is growing at a faster rate than disk-access time. We are packing in a lot more data, but our ability to read it back is not speeding up at the same rate. This is a big concern for backups. Mashey suggested that we may need to move from tape backups to other techniques — RAIDs, mirrors, or maybe backup on cartridge disks. We also need to change our disk filesystems and algorithmic practices to deal with the changing technology.

One interesting side comment had to do with digital cameras and backups. Virtually everyone in attendance probably has to deal with backups at work. Yet how many people bother with backups at home? Probably very few, since most people don't generate that much data on their home systems. A few letters or spreadsheets, but for the rest the average home system these days is most likely full of games and other purchased software, all of which are easily restored from CD-ROM after a system crash. Yet very soon, with the proliferation of digital cameras, we can expect that home computer systems are going to become filled with many gigabytes of irreplaceable data in the form of family snapshots and photo albums. Easy and reliable backup systems are going to be needed to handle this.

Mashey's technology summary: On the good side, CPU is growing in MHz, and RAM, disk and tape are all growing in capacity. On the bad side, all those technologies have problems with latency. This means that there is lots of work to be done in software and exciting times for system administrators.

The slides for this talk are available at <>.

What's Wrong with HTTP and Why It Doesn't Matter
Jeffrey C. Mogul, Compaq Western Research Laboratory

Summary by Jerry Peek

You probably know Jeff Mogul or use products of his work, such as subnetting. One of his main projects in the '90s has been the HTTP protocol version 1.1. Even his mother uses HTTP; it carries about 75% of the bytes on the Internet. HTTP didn't have a formal specification until 1.1, and that process took four years. It wasn't an easy four years. Mogul started by saying that the talk is completely his own opinion and that some people would "violently disagree."

The features of HTTP are well documented; the talk covered them briefly. It's a request-response protocol with ASCII headers and an optional unformatted binary body. HTTP/1.1 made several major changes. In 1.1, connections are persistent; the server can pipeline multiple requests per connection, which increases efficiency. Handling of caching is much improved. HTTP/1.1 supports partial transfers—for example, to request just the start of a long document or to resume after an error. It's more extensible than version 1.0. There's digest authentication without 1.0's cleartext passwords. And there's more. (For details, see the paper "Key Differences Between HTTP/1.0 and HTTP/1.1" <>.)

Most of the talk was a long series of critiques, many more than can be mentioned here. Here's an example. Calling HTTP "object-oriented" is a mistake because the terms "object" and "method" aren't used correctly. For instance, HTTP transfers the current resource response instead of the resource itself. A resource can be dynamic and vary over time; there's no cache consistency for updatable resources. There's no precise term for the "thing that a resource gives in response to a GET request at some particular point in time." This and other problems led to "fuzzy thinking" and underspecification. For example, if a award? client requests a file, the connection terminates early, and a partial transfer is used to get the rest of the file—there's no way to make an MD5 checksum of the entire instance (only each of the two messages).

There were procedural problems in the HTTP-WG working group (of which Jeff was a member). The spec took more than four years to write. Lots of players joined relatively late, or moved on. There was a tendency to rush decisions ("gotta finish!") but, on the other hand, architectural issues tended to drift because the group wanted to get the "big-picture" view. The protocol was deployed in 1996 as RFC 2068, an IETF Proposed Standard. Normally, RFCs in this stage aren't ready for widespread deployment; they're usually revised to fix mistakes. The early deployment made it hard to change the protocol. There weren't enough resources for tedious jobs that needed doing. On the good side, the long process gave the group time to reflect, find many bugs in the original design, and come to a consensus. The vendors "behaved themselves," not trying to bias toward their code. ("Engineers cooperate better than marketers," Jeff pointed out.) He said that HTTP/1.1 has a good balance between fixes and compatibility.

The bottom line, he said, is that the bugs in HTTP don't matter. (If technical excellence always mattered, then FORTRAN, Windows, QWERTY keyboards, and VHS tapes would be dead and buried.) HTTP works well enough, and revising it again would be too hard. For instance, poor cache coherence can be fixed by cache-busting or by telling users, "For latest view, hit Shift-Reload." Inefficiency in the protocol (he gave several examples) might be irrelevant, as bandwidth keeps increasing and as "site designers get a clue . . . some day." No single protocol can support every feature; there will be other protocols, such as RealAudio, that suit particular needs. Human nature adapts readily to circumstance. HTTP isn't perfect, but it'll be hard to revise again — especially as the installed base gets massive and sites become mission-critical.

The presentation slides are at <>.

UNIX to Linux in Perspective
Peter Salus, USENIX Historian

Summary by Jerry Peek

USENIX historian Peter Salus gave a warm and fascinating talk full of tidbits, slides showing early UNIX relics, and lots of interaction with the audience (many of whom had stories to contribute).

1999 is the thirtieth anniversary of "everything that makes it possible for us to be here": the birth of both the ARPANET (the predecessor of the Internet) and UNIX. He wove those two histories together into this talk because, without the Internet, we wouldn't have Linux at all.

Peter started by showing the first page of the first technical report that was the foundation of ARPANET, "the best investment that . . . the government has ever made." The bottom line was $1,077,727 (for everything: salaries, phone links, equipment), over a period of 15 months, to set up a network with five nodes. On April 7, 1969, RFC 1 was released. As we think about 128-bit addressing, remember that RFC 1 provided for five-bit addressing; no one had "the foggiest idea" what the possible growth was. The first host was plugged in on September 2, 1969. Peter showed Elmer Shapiro's first network map: a single line. By the end of 1969, the size of the Net had quadrupled . . . to four sites
. . . and each site had a different architecture. The first two network protocols let you log into a remote machine and transfer files. By 1973, there were two important network links by satellite: to Hawaii and to Europe."

In October 1973 the Symposium on Operating Systems Principles at IBM Research in Yorktown Heights was where Ken and Dennis gave the first UNIX paper. (Before that, almost all UNIX use was at Bell Labs.) Peter said that the paper "absolutely blew people away," leaving a lasting mark on people's lives. Lou Katz remembers Cy Levinthal telling him to "get UNIX"; he did, from Ken Thompson, and he was the first outside person to get UNIX (on RK05s, which he didn't have a way to read at the time). There was no support, "no anything"; this forced users to get together—and, eventually, to become USENIX. May 15, 1974, was the first UNIX users' meeting.

The first text editor, and the one that all UNIX systems have, is ed: "my definition of user-hostile," said Peter. George Coulouris, in the UK, had "instant hate" for ed. He rewrote it into em, which stands for "ed for mortals." Then ed came to UC Berkeley, on tape, where George went on sabbatical. One day, George was using em at a glass terminal (Berkeley had two glass terminals!), and a "wild-eyed graduate student" sitting next to him took a copy. Within a couple of weeks, that student, Bill Joy, had rewritten the editor into ex . . . which was released in the next edition of UNIX from Bell Labs. The story here is of software going from a commercial company in New Jersey, to an academic institution in the UK, to another academic institution in California, back to the same company in New Jersey. This kind of exchange gave rise to the sort of user community that fostered Linux . . . and brought many of us to where we are today. (It wasn't the first "open source" software, though. Peter mentioned SHARE, the IBM users software exchange, which began in 1955.)

Ken and Dennis's paper appeared in the July 1974 issue of CACM. At the same time, a group of students at the University of Illinois started the foundation of RFC 681. The students said they were "putting UNIX on the net." In actuality, they were doing the opposite: causing the network to ride on top of UNIX. Suddenly, the community realized that using UNIX as the basis of the network changed everything.

One big meeting of the USENIX Association was in June 1979 in Toronto. The meeting was preceded by a one-day meeting of the Software Tools User Group, STUG. At that meeting, the first speaker was Al Arms from AT&T, who announced a big increase in UNIX licensing prices. Now UNIX V7 would cost $20,000 per CPU, and V32 would be $40,000 per CPU. Although academic institutions paid much less, "I don't think anybody was very happy," Peter quipped. This was "the sort of mistake, . . . a corporate lack of common sense," that drives users to create things like MINIX, which Andy Tanenbaum did that year. And MINIX was what helped Linus Torvalds, a dozen years later, to write Linux. "If it's good and you make it exorbitant, you drive the bright guys to find alternatives."

You can read much more history in Peter's ;login: articles.


Session: File Systems
Summary by Chris van den Berg

Soft Updates: A Technique for Eliminating Most Synchronous Writes
in the Fast Filesystem

Marshall Kirk McKusick, Author and Consultant; Gregory R. Ganger, Carnegie Mellon University

Marshall Kirk McKusick presented soft updates, a project he has been working on for much of the last few years. As the title of the paper suggests, the central intention of soft updates is to increase filesystem performance by reducing the need for synchronous writes in the Fast Filesystem (and its current derivatives, most commonly today's UFS). Soft updates also provide an important alternative to file systems that use write-ahead logging, another common implementation for tracking synchronous writes. Additionally, soft updates can eliminate the need to run a filesystem-check program, such as fcsk, by ensuring that unclaimed blocks or inodes are the only inconsistencies in a file system. Soft updates can also take snapshots of a live filesystem, useful for doing filesystem backups on nonquiescent systems.

The soft updates technique uses delayed writes for metadata changes, tracking the dependencies between the updates and enforcing these dependencies during write-back. Dependency tracking is performed on a per-pointer basis, allowing blocks to be written in any order and reducing circular dependencies that occur when dependencies are recorded only at the block level. Updates in a metadata block can be rolled back before the block is written and rolled forward after write. In this scheme, applications are ensured seeing current metadata blocks, and the disk sees copies that are consistent with its contents.

McKusick discussed the incorporation of soft updates into the 4.4BSD-based Fast File System (FFS) used by NetBSD, OpenBSD, FreeBSD, and BSDI. The three examples of soft updates in real environments were very impressive. These tests compared the speed of a standard BSD FFS, a file system mounted asynchronously, and a file system using soft updates. The first is McKusick 's "filesystem torture test," which showed asynchronous and soft updates requiring 42% fewer writes (with synchronous writes almost nonexistent), and a 28% shorter running time than the tests run on the BSD FFS. The second test involved building and installing the FreeBSD system (known as "make world" in FreeBSD parlance). Soft updates resulted in 75% fewer writes and 21% less running time. The last test involved testing the BSDI central mail server (which compared only the BSD FFS and soft updates, since asynchronous mounts are obviously too dangerous for real systems requiring data coherency). Soft updates required a total of 70% fewer writes than the BSD FFS, dramatically increasing the performance of the file system.

The soft updates code is available for commercial use in BSDI's BSD/OS, versions 4.0 and later, and on FreeBSD, NetBSD, and OpenBSD. Also, McKusick announced that Sun Microsystems has agreed to consider testing and incorporation of soft updates into Solaris.

Design and Implementation of a Transaction-Based Filesystem on FreeBSD
Jason Evans, The Hungry Programmers

Jason Evans discussed transactional database-management systems (DBMSes), which are structured to avoid data loss and corruption. One of the key points in the implementation is that the traditional BSD Fast File System (FFS) doesn't address the data-integrity requirements that are necessary for designers of transactional database-management systems. evans?

Typically, programmers of transaction-based applications must ensure that atomic changes to files occur in order to avoid the possibility of data corruption. In the FFS, the use of triple redundancy is common in order to implement atomic writes. The principal downside of a triple-redundancy scheme is that performance tends to suffer greatly. The alternative that Evans proposed is the use of a Block Repository (BR), which is similar in many respects to a journaled filesystem. The major highlights of the BR are that it provides:

  • A simple block-oriented, rather than a file-oriented, interface.

  • Userland implementation that provides improved performance and control. The block repository library, which is linked into applications, controls access to allocated storage resources.

  • Data storage on multiple devices, named backing stores, which is similar in many ways to concepts found in volume managers.

A block repository contains at least four backing stores, which are files or raw devices with a header and data space. The backing-store header is triple-redundant to permit atomicity of header updates.

The block repository is designed to be long-running and allow uninterrupted access to the data in the BR. Online insertion and removal of backing stores is possible, which allows modification of the BR size without downtime for configuration changes or maintenance. The repository scheme also allows for block caching, block locking, data-block management, and transaction commit-log processing. Additionally, the BR supports full and incremental backup while online.

The block repository is part of SQRL, a project sponsored by the Hungry Programmers (<>). Information on SQRL is available at <>.

The Global File System: A Shared Disk File System for *BSD and Linux
Kenneth Preslan and Matthew O'Keefe, University of Minnesota; John Lekashman, NASA Ames

The Global File System is a Shared Disk File System (SDFS) that implements a symmetric-share distributed filesystem. It is distributed under an open-source GPL license and implements a high-performance 64-bit network storage filesystem intended for Irix, Linux, and FreeBSD.

The basic design of the GFS includes a number of GFS clients, a Storage Area Network, and a Network Storage Pool. Multiple clients are able to access the Storage Area Network simultaneously.

Some of the key design features of the Global File System are:

  • Increased availability. If one client fails, another may continue to process its tasks while still accessing the failed client's files on the shared disk.

  • Load balancing. A client can quickly access any portion of the dataset on any of the disks.

  • Pooling. Multiple storage devices are made into a unified disk volume accessible to all machines in the system.

  • Scalability in terms of capacity, connectivity, and bandwidth. This avoids many of the bottlenecks in file systems such as NFS, which typically depend upon a centralized server holding the data.

The implementation includes a pool driver, which is a logical-volume driver for network-attached storage. It handles disks that change IDs because of network rearrangement. A pool is made up of subpools of devices with similar characteristics. The file system presents a high-performance local filesystem with intermachine locking and is optimized for network storage. Device locks are global locks that provide the synchronization necessary for a symmetric Shared Disk File System. The lock is at the level of the network storage device and is accessed with the Dlock SCSI command. GFS also uses dynamic inodes and flat, 64-bit metadata structures (all sizes, offsets, and block addresses are 64-bits). Additionally, file metadata trees are of uniform height. To increase performance, hashing of directories is used for fast directory access, and full use is made of the buffer cache. Performance testing showed 50MB/s write performance, with slightly less on read.

More information on the Global File System can be found at <>.

Session: Device Drivers
Summary by Chris van den Berg

Standalone Device Drivers in Linux
Theodore Ts'o, MIT

Theodore Ts'o discussed the distribution of device drivers outside of the Linux kernel. Device drivers have traditionally been developed and distributed inside of the kernel, a situation that can have a number of disadvantages. For example, versions of drivers are inherently tied to a given kernel version. This could lead to a situation in which someone wanting to run a stable device-driver release would be required to run a bleeding-edge kernel—or vice versa: someone running a stable kernel release ends up running a device driver that may still have a number of problems. Additionally, device-driver distribution doesn't scale well. The size of kernel distributions increases proportionally with the number of device drivers that are added. Growth of the kernel therefore can't be tied to the number of device drivers available for it if long-term scalability is desired.

Initially, one method of having separate device-driver distribution simply involved supplying patches for a given driver, reconfiguring, and recompiling the kernel. As time went on, loadable kernel modules were introduced that allowed developers to reduce the configuration, compilation, and test time dramatically. This also keeps kernel bloat to a minimum. Kernel modules, Ts'o noted, are an excellent distribution mechanism for standalone device drivers.

One complication in building modules outside of the kernel tree is that the kernel build infrastructure may no longer be present. Linux does not, however, typically require a complex system for building kernels. For very simple modules, a few C preprocessor flags in the Makefile suffice. Similar modifications to the Makefile can be made in order for drivers to be built both standalone and inside the kernel.

One of the issues in the installation of modularized device drivers is how user-friendly they are. This can be worked around by ensuring during installation that the driver is placed in the right location (which may be a little trickier than it sounds, depending on the kernel version), and setting up rc scripts necessary to load the module at startup. Also, for modules that export functions, it's important to modify kernel header files so that other modules can load a driver's exported interfaces. This leads to the conclusion that rather then using a Makefile target to do all of these functions, a shell script is a better option. Some amount of work is also needed within Linux in order to better support the development of standalone device drivers, such as standardization of rc.d scripts, and binary compatibility. The latter could be achieved through an adopter layer, which could maintain compatibility at the ABI layer but may cause concerns with respect to performance. Taking compatibility issues one step further, a project to create a "Uniform Device Interface" has been undertaken by some industry vendors (most notably SCO). This could allow device drivers to be portable across many OSes, but, again, performance concerns are a major issue.

Design and Implementation of Firewire Device Driver on FreeBSD
Katshushi Kobayashi, Communication Research Laboratory

Katsushi Kobayashi discussed the implementation of a Firewire device driver under FreeBSD. This driver includes an IP network stack, a socket system interface, and a stream-device interface.

Firewire, the IEEE 1394 standard iLink or high-performance serial bus, currently is the greatest area of interest in the audio-visual area. Firewire as a standard encompasses the physical layer up to network-management functions within the system and is capable of high network bandwidth. It also includes online insertion and removal capabilities, as well as the possibility of integrating numerous different peripheral types into one bus system.

The standard itself defines a raw packet-level communication protocol, and applications depend on higher-level protocols that can utilize Firewire. A Firewire device driver has been implemented already for Windows 98, Windows NT, and Linux. Firewire is being standardized for use in conjunction with IP networking, audio-visual devices, and other peripherals, such as SBP (Serial Bus Protocol) and a SCSI adaptation protocol to Firewire.

The FreeBSD implementation of the Firewire device driver is divided into two parts: the common parts of the Firewire system that are hardware-independent, and the device-dependent parts. The device driver currently supports two types of Firewire chipsets, the Texas Instrument's PCILyns and Adaptec AIC5800, with plans to develop driver code for newer-generation chipsets, such as OHCI and PCILynx2, capable of 400Mbps transmission. The API specification currently developed is still not complete, and compatibility with other types of UNIX is an important goal in further development.

The FreeBSD Firewire device driver can be found at <>.

newconfig: A Dynamic-Configuration Framework for FreeBSD
Atsushi Furata, Software Research Associates, Inc.; Jun-ichiro Hagino, Research Laboratory, Internet Initiative Japan, Inc.

The original inspiration for newconfig was work done by Chris Torek in 4.4BSD, and the framework for it is currently being ported to FreeBSD-current. Its motivations are PAO development, CardBus support, and dealing with the difficulties of the IRQ abstraction (especially for CardBus support).

The goals of the newconfig project are to merge newconfig into FreeBSD-current, implement dynamic configuration, and add support for any type of drivers and buses. The eventual removal of the old config(8) is also one of the purposes of newconfig, which has the advantage of bus and machine independence. Newconfig supports separation of bus-dependent parts of device drivers from the generalized parts. Auto-configuration includes configuration hints to the device drivers, bus and device hierarchy information, inter-module dependency information, and device-name-to-object-filename mappings. Currently newconfig handles these components by statically linking them to the kernel. Part of the future work for newconfig includes dynamic configuration.

Information on newconfig is available at <>.

Session: File Systems
Summary by Chris van den Berg

The Vinum Volume Manager
Greg Lehey, Nan Yang Computer Services Ltd.

Greg Lehey discussed the Vinum Volume Manager, a block device driver implementing virtual disk drives. In Vinum, disk hardware is isolated from the block device interface, and data is stored with an eye toward increasing performance, flexibility, and reliability.

Vinum addresses a number of issues pressing upon current disk-drive and filesystem technology:

  • Disk drives are too small for current storage needs. Disk drivers that can create abstract storage devices spanning multiple disks provide much greater flexibility for current storage technology.

  • Disk subsystems can often bottleneck, not necessarily because of slow hardware but because of the type of load multiple concurrent processes can place on the disk subsystem. Effective transfer capacity, for example, is greatly reduced in the presence of significant random accesses that are small in nature.

  • Data integrity is critical for most installations. Volume management with Vinum can address these through both RAID-1 and RAID-5.

Vinum is open-source volume-management software available under FreeBSD. It was inspired, according to its author, by the VERITAS volume manager. It implements RAID-0 (striping), RAID-1 (mirroring), and RAID-5 (rotated block-interleaved parity). Vinum allows for the possibility of striped mirrors (a.k.a. RAID-10). Vinum also provides an easy-to-use command-line interface.

Vinum objects are divided into four types: Volumes, Plexes, Subdisks, and Drives.

Volumes are essentially virtual disks that are much like a traditional UNIX disk drive, with the principal exception that volumes have no inherent size limitations. Volumes are made up of plexes. Plexes represent the total address space of a volume and are the key hierarchical component in providing redundancy. Subdisks are the building blocks of plexes. Rather than tie subdisks to UNIX partitions, which are limited in number, Vinum subdisks allow plexes to be composed of numerous subdisks, for increased flexibility. Drives are the Vinum representation of UNIX partitions; they can contain an unlimited number of subdisks. Also, an entire drive is available to the volume manager for storage.

Vinum has a configuration database that contains the objects known to the system. The vinum(8) utility allows the user to construct volumes from a configuration file. Copies of the configuration database are stored on each drive that Vinum manages.

One of the interesting issues in performance, especially for RAID-0 stripes, is the choice of stripe size. Frequently administrators set the stripe size too low and actually degrade performance by causing single I/O requests to or from a volume to be converted into more than one physical request. Since the most significant performance factor is seek time, multiple physical requests can cause significant slowdowns in volume performance. 256Kb was empirically determined by Lehey to be the optimal stripe size for RAID-0 and RAID-5 volumes. This should ensure that disk access isn't concentrated in one area, while also ensuring that almost all single disk I/O operations won't result in multiple physical transfers.

Future directions for Vinum include hot-spare capability, logging changes to a degraded volume, volume snapshots, SNMP management interface, extensible UFS, remote data replication, and extensible RAID-0 and RAID-5 plexes.

Vinum is available as part of the FreeBSD 3.1 distribution (without RAID-5) and under license from Cybernet Inc. <>.

Porting the Coda File System to Windows
Peter J. Braam, Carnegie Mellon University; Michael J. Callahan, The Roda Group, Inc.; M. Satyanarayanan and Marc Schnieder, Carnegie Mellon University

This presentation described the porting of the Coda distributed filesystem to Windows 95 and Windows 98. (A Windows NT port is still in the incipient stages.) Coda contains user-level cache managers and servers as well as kernel code for filesystem support. It is a distributed filesystem that includes many interesting features:

  • read/write server replication

  • a persistent client cache

  • a good security model

  • access control lists

  • disconnected and low-bandwidth operation for mobile hosts

  • assurance of continuing operation even during network or server failure

The port to Windows 9x involved a number of steps. The port of the user-level code was relatively straightforward; much of the difficulty lay in implementing the kernel code under Windows 9x. Coda's user-level programs include Vice, the file server that services network requests from different clients, and Venus, which acts as the client cache manager. The kernel module is called Minicache. Filesystem metadata for clients and servers is mapped into the address space for Vice and Venus, employing rvm, a transaction package.

The kernel-level module translates Win32 requests into requests that Venus can service. The initial design was around BSD UNIX filesystems, and so required modifications to account for differences in filesystem models.

Part of the task for getting clients running under Windows 9x involved developing Potemkin Venus, a program that acts as a genuine client cache manager and allows easier testing of the Minicache kernel code. Complications arose in development with the Win32 API when filesystem I/O calls made with the Windows Potemkin Venus would attempt to acquire a Win16Mutex, resulting in deadlock conditions if a process making a request via Potemkin had already acquired it. This resulted in the decision to implement the cache manager as a DOS program rather than as a Win32 process. This could be done by hosting Venus in a DOS box, which was made possible in part by using DJGPP's compiler and libc. Once workarounds for missing APIs were resolved (BSD sockets, select, mmap), the port became more straightforward. Also, a solution for Windows 95's inability to support dynamically loaded filesystems was required with a separate filesystem driver, and communication between Venus and the Minicache was modified to use UDP sockets.

In summary, many of the complex porting problems were overcome through the use of freely available software packages and the implementation of mechanisms to circumvent the user-level Win32/Win16 mutex problems.

More information on Coda is available at <>.

A Network File System over HTTP: Remote Access and Modification of Files and "files"
Oleg Keeleyov

Oleg Keeleyov discussed the HTTP filesystem (HTTPFS), which allows access to remote files, directories, and other objects via HTTP mechanisms. Standard operations such as file retrieval, creation, and modification are possible as if one were doing this on a local filesystem. The remote host can be any that supports HTTP and can run Perl CGI scripts either directly or via a Web proxy or gateway. The program runs in user space and currently supports creating, reading, writing, appending, and truncating files on a remote server.

Using standard HTTP request methods such as GET, PUT, HEAD, and DELETE, something akin to Network File System is created, but with the added advantages that the system is cross-platform and can run on almost any HTTP server. Additionally, both programmatic and interactive support models exist.

The HTTPFS is a user-level filesystem written as a C++ class library on the client side and requiring a Perl CGI script on the remote server. The C++ classes can be employed directly or via different applications which link to a library that replaces many standard filesystem calls such as open(), stat(), and close(). Modifications to the kernel and system libraries are not necessary, and in fact the system does not even need to be run with administrative privileges.

Another advantage of HTTPFS is that the server can apply many of the request methods to objects as if they were files without necessarily being files, such as databases, documents, system attributes, or process I/O.

Keeleyov noted that the potential for security risks is inherent in the use of the HTTPFS and that necessary access controls should be in place that are concordant with administrators' authentication and authorization policies.

Session: Networking
Summary by Chris van den Berg

Trapeze/IP: TCP/IP at Near-Gigabit Speeds
Andrew Gallatin, Jeff Chase, and Ken Yocum, Duke University

This presentation focused on high-speed TCP/IP networking on a gigabit-per- second Myrinet network, which employs a messaging system called Trapeze. Common optimizations above and below the TCP/IP stack are important. They include zero-copy sockets, large packets combined with scatter/gather I/O, checksum offloading, adaptive message pipelining, and interrupt suppression. The tests were conducted on a range of current desktop hardware using a modified FreeBSD 4.0 kernel (dated 04/15/1999), and showed bandwidth utilization as high as 956 MB/s with Myrinet, and 988 MB/s with Gigabit Ethernet NICs from Alteon Networks.

It is now widely believed that current TCP implementations are capable of utilizing a high percentage of available bandwidth on gigabit-per-second speed links. Nevertheless, TCP/IP implementations will depend upon a number of critical modifications both above and below a host TCP/IP stack, to reduce data movement overhead. One of this paper's critical foci was to profile the current state of the art in short-haul networks with low latency and error rates, and close to gigabit-per-second bandwidth, as well as to provide quantitative data to support the importance of different optimizations.

The Trapeze messaging system consists of a messaging library linked into the kernel or user applications, and firmware that runs on a Myrinet NIC. Trapeze firmware communicates with the host via NIC memory that is addressable in the host-address space by means of programmed I/O. The firmware controls host-NIC data movement and allows for a number of features important for high-bandwidth TCP/IP. These include:

  • Header/payload separation (handled by the firmware and message system), which allow for payloads to be moved to and from aligned page frames in host memory. This in turn provides a mechanism for zero-copy optimizations.

  • Large MTUs and scatter/gather DMA. Myrinet operates without requiring a fixed MTU, and scatter/gather DMA lets payload buffers utilize multiple noncontiguous page frames.

  • Adaptive message pipelining allows minimization of large packet latency balanced with unpipelined DMA under high bandwidth.

Interrupt suppression is also important in minimizing per-packet overhead for smaller MTUs and is implemented on NICs such as the Alteon Gigabit Ethernet NIC, which is capable of amortizing interrupts across multiple packets via adaptive interrupt suppression. Interrupt suppression doesn't provide much benefit for MTUs larger than 16Kb and is therefore not used for packet reception in Myrinet.

Low-overhead data movement is critical in conserving CPU, though it's important to note that, because of memory bandwidth constraints, faster CPUs are not necessarily a panacea for higher data movement. Optimizations to the FreeBSD I/O manipulation routines such as zero-copy sockets are integral to reduction of data-movement overhead. Page remapping techniques eliminate data movement while preserving copy semantics of the current socket interface. Zero-copy TCP/IP at the socket was implemented following John Dyson's read/write syscall interface for zero-copy I/O. Zero-copy reads map kernel buffer pages into process address space via uiomoveco, a variant of uiomove. A read from a file instantiates a copy-on-write mapping to a page in the unified buffer cache, while read from a socket requires no copy-on-write since the kernel buffer need not be maintained after a read, and any physical page frames which back-remapped virtual pages in the user buffer are freed. For a send, copy-on-write is used if a sending process writes to its send buffer when the send has completed. Copy-on-write mappings are freed when the mbuf is released after transmit. This only applies to anonymous vm pages, as zero-copy transmission of memory backed by mapped files duplicates the already existent sendfile routine authored by David Greenman.

Checksum offloading reduces overhead by moving checksum computation to the NIC hardware. This is available in Myricom's LANai-5 adapter and the Alteon Gigabit Ethernet NIC. The host PCI-DMA engine employs checksum offloading during the DMA transfer to and from host memory and can be done with little modification to the TCP/IP stack. A few complications arise: multiple DMA transfers for single packets require modifications to checksum computations; TCP/UDP checksumming in conjunction with separate IP checksumming requires movement of the checksum calculation below the IP stack (i.e., in the driver or NIC firmware); and the complete packet must be available before checksum computation can occur.

Tests of the Myrinet system were performed on a variety of commercially available desktop hardware, with a focus on TCP bandwidth, CPU utilization, TCP overhead, and UDP latency. The tests showed the importance of techniques to reduce communication overhead for TCP/IP performance on currently available desktop platforms under FreeBSD 4.0. The results of 956 MB/s using Trapeze and the Myrinet messaging system and 988 MB/s with the Alteon Gigabit Ethernet card are currently the highest recorded TCP bandwidths publicly available, with a DEC Monet (21264 model 500MHz Alpha) achieving these bandwidths at around 20% CPU utilization.

Trapeze is available at: <>. FreeBSD modifications are available in the FreeBSD code base.

Managing Traffic with ALTQ
Kenjiro Cho, Sony Computer Science Laboratories, Inc.

Kenjiro Cho discussed ALTQ, a package for traffic management that includes a framework and numerous queueing disciplines, and also supports diffserv and RSVP. The advantages and disadvantages of different designs, as well as the available technologies for traffic management, were discussed.

Kenjiro noted that traffic management typically boils down to queue management. Many disciplines have been proposed that meet a variety of requirements. Different functional blocks determine the type of queueing available for a router, and while functional blocks can appear at the ingress interface, they exist most commonly on the egress interface. Common functional blocks:

  • Classifiers categorize traffic based on header content, and packet-matching rules are used to determine further processing.

  • Meters measure traffic streams for certain characteristics that are saved as flow state and are available to other functions.

  • Markers are particular values within a header, such as priority, congestion information, or application type.

  • Droppers discard packets in order to limit queue length or for congestion notification.

  • Queues are buffers that store packets; different queues can exist for different types of traffic.

  • Schedulers perform packet-forwarding determination for a given queue.

  • Shapers shape traffic streams by delaying certain packets and may discard packets if insufficient space exists in available buffers.

Different queueing disciplines promote different requirements. The available types of queues are: a standard FIFO; Priority Queueing (PQ); Weighted Fair Queueing (WFQ), which assigns a different queue for every flow; Stochastic Fairness Queueing, an easier-to-implement form of Weighted Fair Queueing; Class-Based Queueing (CBQ), which divides link bandwidth using hierarchically structured classes; and Random Early Detection (RED), a "fair" form of queueing by dropping packets in accordance with the probability of given buffer filling.

Kenjiro discussed some of the major issues in queueing, including the wide variety of mechanisms available and how many of them cover only certain specific needs in a queueing environment. Also employing multiple types of queueing can be difficult because each queue discipline is designed to meet a specific, not necessarily inter-compatible, set of criteria and design goals. The most common uses of traffic management are bandwidth control and congestion control. Additionally, traffic-management needs must be balanced with ease of administration. It's also important to note that queueing delays can have a significant impact on latency, especially in comparison to link-speed latency delays.

ALTQ is a framework for FreeBSD that allows for numerous queueing disciplines for both research and operational needs. The queueing interface is implemented as a switch to a set of disciplines. The struct ifnet has several fields added to it, such as discipline type, a general state field, a pointer to discipline state, and pointers to enqueue and dequeue functions. Queueing disciplines have a common set of queue operations, and other parts of the kernel employ four basic queue-management operations: enqueue, dequeue, peek, and flush. Drivers can then refer to these structures rather than using the ifqueue structure. This adds flexibility to driver support for queueing mechanisms. Queueing disciplines are controlled by ioctl system calls via character device interfaces in /dev, with each discipline defined as a minor device for the primary character device. ALTQ implements CBQ, WFQ, RED, ECN, and RIO queueing disciplines. CBQ meets many requirements for traffic management, thanks, in part, to its flexibility.

Additionally, Kenjiro mentioned some of the ways Linux lends itself to queueing in comparison to *BSD. Specifically, the number of fields that Linux's sk_buff contains give it more flexibility than the BSD mbuf structure. The Linux network device layer also adds flexibility by allowing queue-discipline classifiers to access network or transport layer information more readily.

ALTQ for FreeBSD is available at <>.

Opening the Source Repository with Anonymous CVS
Charles D. Cranor, AT&T Labs—Research; Theo de Raadt, The OpenBSD Project

Charles Cranor discussed Anonymous CVS, a source-file distribution mechanism intended to allow open-source software projects to more readily distribute source code and information regarding that code to the Internet community. Anonymous CVS is built on top of CVS, the Concurrent Version System, which provides revision control. Anonymous CVS is currently in use by a number of open-source projects.

Anonymous CVS was initially developed to provide access to an open-source software project for Internet users who did not have write access to the CVS repository. This greatly enhanced the ability of developers and users to access the repository without compromising the security of the repository itself. Anonymous CVS also provides a much better format for distribution of open-source software than previous mechanisms such as Usenet, anonymous FTP, Web, SUP, rsync, or CTM. One of the critical features of anonymous CVS is that it allows access to the metadata for a source repository, e.g., modification times and version information for individual files.

Some of the principal design goals for anonymous CVS were security, efficiency, and convenience. One particularly interesting aspect of the development of anonymous CVS was an anoncvs chroot'd shell, which limited the capabilities of a user with malicious intentions by allowing the client access to run in a secure environment. This environment integrates nicely with the CVS server system and can be accessed by standard means such as ssh or rsh.

One major implementation issue for anonymous CVS involved limitations in the CVS file-locking mechanisms. Since a user cannot write to the CVS repository when accessing it anonymously, file locking was disabled for read-only access. Though there were cases where this could lead to some inconsistencies in the files on the CVS servers, the likelihood was very low. Future versions of anonymous CVS may look to provide some type of file-locking mechanism for anonymous access.

New tools based upon CVS have been developed, e.g., the CVS Pserver, CVSWeb, and CVSup. Pserver, distributed with CVS, requires a login for access to the CVS repository. One downside of Pserver is that it does not operate in a chroot'd environment as the anoncvs shell does. CVSWeb allows a GUI interface for browsing the repository and updates to a local source tree. Addition-ally, the CVSup package provides an efficient and flexible mechanism for file distribution based on anonymous CVS. CVSup is capable of multiplexing stream requests between the client and server, and it operates very quickly. It understands RCS files, CVS repositories, and append-only log files, which make up most of the CVS environment. CVSup provides a command-line and GUI interface. Its one major drawback is that it's written in Modula3.

Session: Business
Summary by Jerry Peek

Open Software in a Commercial Operating System
Wilfredo Sánchez, Apple Computer Inc.

As Apple considered major rewrites to the MacOS after version 7, it faced the fact that writing a new operating system is hard. An OS must be very reliable. But OSes are complex, so new ones will have bugs. Apple acquired NeXT Software and got their expertise in OSes. But Apple's core OS team saw that BSD and Mach, two freely available OSes, had a lot of the features they needed. These tried-and-true OSes have been being refined for years. An active developer community is adding features all the time. As a bonus, Apple would get Internet services, Emacs, Perl, CVS, and other useful packages.

So why should Apple bother having its own OS? Why not just give all Apple users a copy of (for example) Linux? One reason is that many Apple customers don't want raw UNIX-type systems; they want the familiar look and feel. So Apple added application toolkits, did hardware integration, and merged the free code into its new Mac OS X.

Apple decided to contribute much of its own work on the open code base — including some of the proprietary code—back to the community. They also have an in-house system to let developers propose that certain new code be made open source. Why? After all, the BSD license doesn't require release of new code. One reason is that by sharing code and staying in sync with the open base, Apple's code wouldn't fall behind and have to track larger and larger differences. Staying in sync also lets Apple take advantage of the better testing and quality feedback that the open base gets on multiple platforms. One surprising side effect of this code sharing is that, as an active open-source project, Apple's source for PowerPC processors still contains unused code for Intel processors.

Business Issues in Free Software Licensing
Don Rosenberg, Stromian Technologies

Don Rosenberg's talk discussed how a commercial software vendor should deal with open source and what those vendors really want. In general, software vendors want to protect their financial investment and to recover that investment. They also want to make a profit: revenues to keep the doors open and the programmers fed. How can companies do this? The talk covered several current models.

The GNU Public License (GPL) is good for operating systems because OSes are so widely used. In general, there are many more users of a particular OS than of any single application under that OS. Here, vendors can make money by distributing the source code and, possibly, binaries. Red Hat Software is a good example of this model; Don quoted Bob Young, Red Hat's chairman, as saying that Red Hat "gives away its software and sells its sales promotion items."

Scriptics Corporation distributes Tcl/Tk, a freely available language and toolkit with hundreds of thousands of users. Scriptics improves that core material for free while developing Tcl applications for sale. Because users can modify and distribute Tcl and extensions themselves, Scriptics has to work hard to keep them happy if it wants to stay at the center of development. Profits from commercial applications pay for the free software work. Scriptics' Web site also aims to be the principal resource for Tcl and its extensions.

Aladdin's Ghostscript has different free and revenue versions, under different licenses, distributed by different enterprises. Free users are restricted in how they can distribute the product; they get yearly updates from Aladdin. Licensed commercial users, on the other hand, can distribute Ghostscript more freely; they also receive more frequent updates.

More restrictive licenses, such as the Sun Community Source License, are appearing. Sun's license lets users read the source code but requires that any modifications be made by Sun. Rosenberg didn't have a prediction for the success of this kind of license.

Next came a long discussion of the problems with the Troll Tech Qt library and the Q Public License; I'll have to refer you to the paper for all the details. The old Qt Free Edition License has been improved in the new QPL by allowing distribution of Qt with patches, but it did not change the restrictions that the license put on the popular Linux desktop KDE, which uses Qt. Troll Tech "wants to control the toolkit . . . makes the product free on Linux in hope of collecting improvements from users, and wants to reserve the Windows and Macintosh platforms for their revenue product."

The Qt license problems were a good example of the trouble with licensing dependencies. Licensing concerns meant that Debian and Red Hat wouldn't distribute KDE or Qt. Movements have sprung up to clone a Qt that can be distributed as freely as Linux. Will Troll Tech survive?

Finally, Don presented a model for licensing layers of an operating system and its applications. The base operating system is most likely to succeed if it's free. Toolkits and extensions, as well as applications that build on them, can be either free or proprietary—but the free side should be carefully separated from the proprietary side to ensure that licensing dependencies don't cause serious problems.

There's an open-source licensing page and more information at <>.

"Magicpoint" Presentation Tool
Jun-ichiro Hagino, KAME Project

A last-minute substitution for another talk featured Magicpoint, a presentation tool similar to the commercial Microsoft PowerPoint software. Magicpoint runs on the X Window System and is distributed under a BSD-style license. The speaker used Magicpoint to give his talk, and the slides, and the idea in general, drew kudos and applause from the crowd.

There were three design goals:

  • It should be possible to prepare a presentation on demand, in five minutes.

  • The display should be screen-size independent.

  • It should look good.

The presentation source is plain text with encoding (the same idea as, say, HTML or troff—though this encoding doesn't resemble those). You can "%include" a style file. There's no built-in editor; you choose your own. As soon as you edit and save a source file, Magicpoint updates the slide on the screen.

You can place inline images; they'll be rescaled relative to the screen size. Fancy backgrounds, such as your company logo, are no problem. Magicpoint handles animated text and pauses. You can invoke xanim and mpegplay. The speaker ran xeyes from one of his slides! It's also possible to invoke UNIX commands interactively and have them appear, together with their results, on the screen.

Magicpoint has better font rendering than X11 (which isn't good at big fonts, he says). It uses any font size; all length metrics are relative to the screen size. Magicpoint uses freetype (for TrueType fonts) and vflib (a vector font renderer for Japanese fonts).

Here are some of the other goodies in this amazing tool:

  • PostScript output generation (for paper printouts)

  • A converter to HTML and LaTeX

  • Remote control by PocketPoint (from <>)

  • A handy automatic presentation timer on the screen, a histogram that lets the presenter keep track of the time a slide has been up

  • An interactive page selector that puts all the page numbers and titles at the bottom of the screen

  • Handling of multiple languages, including Japanese (of course) and other Asian languages; can handle multi-language Asian presentations

Future work includes file conversion to and from PowerPoint, revisiting the rendering engine, an improved syntax (though not too complex), better color handling in a limited-color environment, and better math and table support (right now, these are made by piping eqn or tbl output into groff).

The initial idea, the key concepts, and much of the coding for Magicpoint were by Yoshifumi Nishida, <>. To get the code and more details, see <>.

Session: Systems
Summary by Chris van den Berg

Sendmail Evolution: 8.10 and Beyond
Gregory Neil Shapiro and Eric Allman, Sendmail, Inc.

Gregory Neil Shapiro started out by recounting the history of sendmail and how this led to the formation of the Sendmail Consortium. The overwhelming number of feature requests received by the Sendmail Consortium then led to the formation of Sendmail, Inc. Sendmail, Inc. now has full-time engineers and a formal infrastructure consisting of seven different platforms that are tested on each release.

Shapiro next discussed the driving forces for the evolution of sendmail: changing usage patterns in the form of increased message volume; spam and virus control; new standards such as SMTP authentication, message submissions, and the IETF SMTP Update standard; and finally, friendly competition with other open-source MTAs.

After that, Shapiro talked about the new features slated for sendmail v8.10. These include SMTP authentication, a mail-filter API (which will probably be deferred to v8.11), IPv6 support, and performance improvements such as multiple queues and use of buffered file I/O to avoid creating temporary files on disk. The buffered file I/O optimizations require the Torek stdio, which provides a way to unbundle f-calls such as fprintf. The BSD implementations of UNIX such as FreeBSD, OpenBSD, NetBSD, and BSDi all use the Torek stdio library.

Finally, Shapiro concluded with some directions for the future of sendmail. These include a complete mail-filter API, threading and better memory management using memory pools instead of forking and exiting to clean up memory, a WindowsNT port, and more performance tuning.

The GNOME Desktop Project
Miguel de Icaza, Universidad de México

With great animation, Miguel de Icaza discussed the GNOME desktop project. The goal of the project is to bring new technology to free software systems: a component model, compound document model, printing architecture, and GUI development tools. On top of these, the project will then build missing applications such as the desktop, the GNOME workshop—consisting of a spreadsheet, word processor, and presentation programs—and groupware tools like distributed mail, calendaring, and contact

The GNOME project is structured to allow for volunteers, who cannot commit to long-term development, to work on small components separately. GNOME makes use of the CORBA framework to tie components together. The GNOME component and document model is called Bonobo (a type of chimpanzee). There are query interfaces to ask whether an operation is supported. This query interface is similar to the OLE2/ActiveX design. CORBA services support mail and printing.

On the graphics side, the GNOME GUI builder is called Glade. It generates C, Ada, and C++ code plus XML-based definitions of the layout. The GNOME canvas is very similar to Tk's canvas, but without doing everything using strings, which de Icaza pointed out as a shortcoming of Tcl/Tk. Finally, he went on to talk about the GNOME printing architecture. It's the PostScript imaging model with anti-aliasing plus alpha channels.

An audience question concerned KDE (K Desktop Environment) support and integration with GNOME. Miguel feels this can be done but needs help from volunteers. Those interested in GNOME are directed to <> and <>.

Meta: A Freely Available Scalable MTA
Assar Westerlund, Swedish Institute of Computer Science; Love Hörnquist-Åstrand, Dept. of Signals, Sensors and Systems, KTH; Johan Danielsson, Center for Parallel Computers, KTH

Meta addresses the problem of building a high-capacity, secure mail hub. The main protocols supported are SMTP and POP. IMAP could be supported, but the authors feel that IMAP is too big to support at this time. Meta is meant to replace the traditional solution consisting of sendmail, the local mail-delivery program mail.local, and popper.

The goals of the Meta MTA are simplicity, efficiency, scalability, security, and little or no configuration. It uses the techniques of SMTP pipelining and omitting fsyncs, and the authors conclude that it's not hard to do better than sendmail simply by omitting the expensive fsync calls.

The spool files are kept in a special POP-wire format to speed up retrievals. Meta servers are clustered: mail is received by any server and fetched by querying all the servers. Simple load-sharing is achieved through multiple A records, though more sophisticated schemes such as load-aware anem servers or hardware TCP routers are possible.

As for security, all spool files are owned by the Meta nonprivileged user. Users never access files directly, so they don't need shell accounts. A user database, not /etc/passwd, containing information such as full names, quotas, and spam filters, is kept and replicated on all servers.

The configuration is not The audience asked when Meta would be available. The authors make no promises.

See <> for further information.

Session: Kernel
Summary by Chris van den Berg

Porting Kernel Code to Four BSDs
and Linux

Craig Metz, ITT Systems and Sciences Corporation

Craig Metz discussed some of the issues involved in porting the U.S. Naval Research Lab's IPv6 and IPSec distribution to different BSDs and Linux. Both the specifics of the porting process and some discoveries concerning porting software to different OSes was presented. The software discussed was ported to FreeBSD, OpenBSD, NetBSD, BSDI's BSD/OS, and Linux.

One general observation: don't port code when it shouldn't be done. Anything can potentially be ported, but architectural dissimilarities between systems, for example, may make it infeasible to port certain software. Porting software across kernel/user space generally shouldn't be attempted. Most code exists where it does for a reason.

One technique for building portable code involves the use of abstraction. When the operations being performed are substantially similar in nature, abstraction can be a powerful tool for portable code. For example, abstract macros can expand to system-specific code, such as the similar but differing use of malloc under BSD systems and Linux. BSD systems use the function malloc(size,type,wait), whereas Linux employs kmalloc(size, flags). These two forms of malloc were abstracted to OSDEP_MALLOC(size), which expands to the correct macro for each system. Additionally, macros can be preferable to functions in kernel space, because the overhead associated with a function call can be significant when performance and memory use are critical, as they are in kernel space.

For significantly different parts of a system, abstraction is useful but may often depend upon having large functions or a group of smaller functions that are encompassed by conditionals. When data structures differ greatly across platforms, abstraction may provide a useful tool for portability. Metz discussed struct nbuf,which was developed as an abstract data structure for portable packet buffers. The nbuf incorporated aspects from the traditional BSD mbuf and Linux's pk_buff. From BSD, the nbuf structure took advantage of small headers and few extraneous fields, and from Linux the nbuf borrowed packet data that is contiguous in memory and payload data copied to its final location with headers assembled around it. Additionally, the nbuf design focused on ensuring that system-native buffer conversion to an nbuf is quick for most cases, and that converting such an nbuf structure back to the native buffer was quick. The latter did not mean that nbuf-to-native conversion had to be quick for most cases, since nbufs are never the initial data structure. This helped reduce code complexity. The nbuf contains various pointers to sections of the buffer and packet data, a pointer to the encapsulated native buffer, and a few OS-specific fields. For most cases, quick conversion from system-native to nbuf structure was possible.

In summary, Metz mentioned that they were able to achieve a significant degree of portability for the IPv6 and IPSec implementations with these techniques and that porting kernel code to multiple systems can be a feasible project.

strlcpy and strlcat—Consistent, Safe, String Copy and Concatenation
Todd C. Miller, University of Colorado, Boulder; Theo de Raadt, The OpenBSD Project

Todd Miller gave a brief presentation on strlcpy and strlcat, which are intended as safe and efficient alternatives to traditional C string copy and concatenation routines, such as strcpy, strcat, strncpy, and strncat. OpenBSD undertook the project of auditing its source base for potential security holes in 1996, with an emphasis on the possibility of buffer overflows in the use of strcat and strcpy and sprintf. In many places, strncat and strncpy had been used, but in ways which indicated that the API for these function can easily be misunderstood. An alternative to these routines was created that was safe, efficient, and had a more intuitive API.

One common misconception about strncpy is that it NUL-terminates the destination string; this is true only when the source string is less than the size parameter. Another misconception is that strncpy doesn't cause performance degradation when compared with strcpy. Miller pointed out that strncpy zero-fills the remaining bytes of the string. This can cause degradation in cases where the size of the destination string greatly exceeds the size of the source string. With strncat, a common mistake is the use of an incorrect size parameter, because the space for the NUL should not be counted in this parameter. Additionally, the size of the available space is given as a parameter, rather than the destination string, and this can often be computed incorrectly.

strlcpy and strlcat guarantee to NUL-terminate the destination string for all non-zero-length strings. Also, they take the size of the destination string as a size parameter, which can typically be easily computed at compile time. Finally, strlcpy and strlcat do not zero-fill the destination strings beyond the compulsory NUL-termination. Both of these functions also return the total length of the string they create, which makes checking for truncation easier.

strlcpy runs almost as quickly as strcpy, and significantly faster than strncpy for cases that are copied into large buffers (i.e., that would require significant zero-filling by strncpy). This was evident on tests run on different architectures, and is a by-product, in all likelihood, not only of less zero-filling but also of the fact that the CPU cache is not essentially flushed with the zero padding. strlcpy and strlcat are included in OpenBSD and are slated for future versions of Solaris. They are also available at <>.

pk: A POSIX Threads Kernel
Frank W. Miller, Cornfed Systems, Inc.

pk is an operating-system kernel whose primary targets are embedded and realtime applications. It includes documentation with literate programming techniques based on the noweb tool. The noweb tool allows documentation and code to be written concurrently: the noweave utility extracts the documentation portion of a noweb source file, generating LaTeX documentation, and the notangle utility extracts the source-code portion. This tends to force a programmer to document while coding, because the source is mixed in with the documentation.

pk is based on the concurrency model for POSIX threads, and Pthreads assume that all threads operate within the same address space, which was intended to be a UNIX process. pk modifies the Pthreads design by adding page-based memory protection in conjunction with the MMU. Additionally, pk is not developed around the ability to utilize paging or swapping, since realtime applications can't rely on this model. pk is designed for threads to have direct access to physical memory, and the MMU is used to provide memory-access restrictions rather than separate address spaces. The three types of memory protection that are provided are Inter-thread, Kernel-thread, and Intra-thread. The first restricts a thread to its own address space; the second restricts thread access to kernel memory via syscall entry points; and the third allows portions of code that are part of a thread to be marked read-only. Because the design of pk differs from the Pthreads model of a monolithic unprotected address space, some modifications were necessary to account for these differences, such as placing restrictions on certain data structures or routines defined in the API.

Further information about pk is available at <>.

Session: Applications
Summary by Arthur Richardson

Berkeley DB
Michael A. Olson, Keith Bostic, and Margo Seltzer, Sleepycat Software, Inc.

Sleepycat Software is the company responsible for the embedded database system called Berkeley DB. Michael Olson listed a few of the larger applications which use Berkeley DB, including sendmail, some of Netscape's products, and LDAP.

Berkeley DB was first released in 1991. The current version, 2.6, was released early in 1999. The original versions were released prior to the creation of the Sleepycat Software company. Sleepycat was formed to provide commercial support for Berkeley DB. Upon the creation of the company, they updated the software to 2.x, adding better currency handling and transaction locking.

One of the strengths of the Berkeley DB package is that it runs everywhere. It's POSIX-compliant and runs on both Windows and X. It's considered the best embedded database software available because of its features, scalability, and performance. It has been used for directory servers, messaging, authentication, security, and backing store Web sites.

The Berkeley DB system is a library that is linked in with other applications. It is very versatile, allowing for use by single or multiple users, single or multiple processes, and single or multiple threads. It has several built-in on-disk storage structures and a full-fledged transaction system, and it is capable of recovering from all kinds of crashes.

In the final part of Olson's presentation, he described how a company such as Sleepycat Software could make money and stay in business by having its main product released under an open-source license. It may be distributed freely with applications whose source is also distributed. To distribute it as a binary within a proprietary application, however, a commercial license is required. Sleepycat Software also sells support, consulting, and training. Most of the company's money comes from the commercial licenses, but it sells a number of support contracts. It has experienced strong growth over the last two to three years. Olson's final comment: Open source can be the basis for a thriving business.

The FreeBSD Ports Collection
Satoshi Asami, The FreeBSD Project

Satoshi Asami has built a system for the distribution of a set of sanctioned applications to FreeBSD systems. The Ports system keeps track of software packages' version numbers, dependencies, and other useful details. With it, installing software onto a FreeBSD system is much easier and better organized.

The install of the Ports system maintains a small footprint. It currently takes about 50MB for the current list of over 2,400 ports. It is categorized, using symbolic links, for easier searching of the applications by either real or virtual categories.

Package dependency is maintained within the Ports system for both compile-time and runtime dependencies. Currently, file checksums are used for maintaining package integrity and as a method of security.

New packages get added into the current Ports collection via the send-pr command. Groups of people, whom Satoshi called Committers, first test the package and then commit the package for addition. The Ports manager, Satoshi, does the actual incorporation of the package into the Ports collection.

The Ports tree changes every day. It supports both FreeBSD-current (4.0) and FreeBSD-stable (3.2). Both the current and stable packages are built three times a week and updated once a week on <>. The Ports system is frozen a few days before release.

Satoshi next spent some talking about the maintenance of the Ports collection. Under the original Ports system, the packages were built by the issuing the commands cd /usr/ports and make package). This method was slow and required a great deal of human intervention. They had problems with incomplete dependency checks, and the system took about three days to compile on dual PII-300s and used more than 10GB of disk space. They now use chroot to isolate each build environment, which provides for greater control of package dependencies. They have also added parallel processes to help speed up the build time. Currently they use a master system that is a dual PII-300 system with 128MB memory and 10 SCSI disks. It has eight clients that are all K6-3D 350s with 64MB of memory and a single IDE disk. The master does some load balancing and passes off the work to the clients. With this new system, it takes about 16 hours to build the 2000+ packages. There are an additional four hours in which the cvs update runs and the INDEX is built. All build errors are available at <>.

Current problems include the number of packages in the collection and the size of the system. Having over 2000 packages can sometimes make it hard to find the right port. They are considering some sort of keyword database to help organize things. One size problem is that now they have so many small files that it slows down CVS. Another size problem is that the collection no longer fits on a single CD, and the weekly 2GB+ update to all the FTP mirrors can cause some network stress.

They have two other unsolved problems. There is no built-in method for doing updates. Currently they only allow installing and deleting a single port. There is also more thought going into increasing security of the ports. Satoshi mentioned PGP signatures as a possible solution to that problem.

Multilingual vi Clones: Past, Now and the Future
Jun-ichiro Hagino, KAME Project; and Yoshitaka Tokugawa, WIDE Project

Jun-ichiro has built a multilingual version of vi. His goal was to build a product that could be used throughout the world by being able to support any language format.

He planned to rely on experience he had gained in Asian support. He decided that Unicode was not an option because it doesn't seem to be widely used, and some Chinese, Korean, and Japanese characters are mapped to the same codepoint.

His problems had to do with some of the assumptions that most vi clones had about text encoding. These included the idea that each single byte was a character and that a space was used to separate words. Asian users need multi-byte encoding support.

He set out to build something that would allow switching between various external encoding methods in order to have seamless multilingual support. He also wanted his product to be able to mix character sets in the text and to be able to preserve that information within the saved file—all while still behaving like the standard vi editor.

The first attempt resulted in JElvis. It was based on Elvis with an updated internal encoding method. It was limited to only a few external encoding methods but was a step closer to what he wanted to accomplish. His current product is nvi-m17n. It is based on Keith Bostic's nvi but has better multilingual capabilities.

Nvi-m17n solves problems in most cases, but he's still working on a few things, including word boundary issues and the regex library. He would also like to switch to using widechar and multi-encoding.

Jun-ichiro's key to multilingual programming includes reducing the number of assumptions made by the software.

Session: Kernel
Summary by Jeffrey Hsu

Improving Application Performance through Swap Compression
R. Cervera, T. Cortes, and Y. Becerra, Universitat Politècnica de Catalunya—Barcelona

Toni Cortes started out by explaining that the motivation for his group's work was not to run super-large applications, but to enable laptops to run larger applications. The goals were to increase swapping performance without adding more memory and to increase swap space without adding more disk. He went on to describe three novel optimizations of the project: different paths for reads and writes, batching writes, and not splitting pages between buffers. The speed-ups on the benchmarked applications range from 1.2x to 6.5x. In performing the benchmarking, they discovered there was no perfect cache size. The large caches take away from application memory, and too-small caches won't allow the application to run. He then described related work, such as that done by Douglis in 1993, which did not handle reads and writes differently, had limited batch writes, and showed performance gains only for some applications. There was a question raised from the audience about which compression scheme was used. The answer was lz0, but it is easy to change the algorithm if a better one comes along.

New Tricks for an Old Terminal Driver
Eric Fischer, University of Chicago

Eric Fischer started by posing the basic question of how to get the arrow keys to work with different shells and applications. The problem is that some applications support keyboard editing directly and others rely on the operating system. He then discussed the three possible places to implement basic editing facilities: individual applications, the operating system, and a middle layer. He concluded that it's best to add support in the OS. He gave an overview of the line terminal code in the kernel and how it parses line input, looking for editing keys. One complication is that the VT100 and vi-mode bindings require the kernel to keep track of state, since they are multi-key sequences. History, in the form of the up and down keys, is kept in a daemon rather than managed inside the kernel. There are ioctls that an application can use to store lines, read lines, get the contents of the current line, and change the contents of the current line. For compatibility reasons, the OS implementation of the editing keys needs to preserve to the application the illusion that the cursor is always on the right. He does this by removing characters when moving left and placing them back when moving right. He uses VT100 key bindings for the cursor keys. A question was raised during the session about how non-VT100 terminals are supported. The reply was that most modern terminals use ANSI sequences, which are the VT100 ones. It is hard to get access to termcap info inside the kernel; therefore, these sequences are hard-wired.

The Design of the DENTS DNS Server
Todd Lewis, MindSpring Enterprises

Todd Lewis first recounted his frustrations configuring and administering the BIND DNS server and how they motivated him to work on a flexible name server with easy-to-configure graphical admin tools and the flexibility to use multiple back ends rather than flat files to store data. Dents was written from scratch in ANSI C. It uses glib, which is like STL, except for C. Internally, it uses CORBA interfaces to control facilities. Lewis talked about the internal structure of the code and how one can write different adapter drivers to retrieve zone information from many disparate sources, include flat files, RDBMs, and even scripts. Lewis believes that server restarts should be rare events, unlike in Windows where configuration changes invariably require a system reboot. Yet in UNIX we suffer from the same problem for many of our services. His solution is to use persistent objects, not config files, to store parameters. By having a well-defined CORBA IDL interface to the server, configuration changes can be made without having to restart. Furthermore, the use of CORBA allows for transactional zone editing. Dents currently uses the ORBit orb.

Lewis feels that the later revisions of the DNS spec confuse the primary role of DNS—retrieving queries—with new features such as editing. The new BIND supports dynamic updates, but the changes are not persistent, but get lost when the server shuts down.

The first public release of Dents came in fall 1988. Ongoing work includes control-facility enhancements and more drivers to interface to different zone data stores. Interested developers should see <>


?Need help? Use our Contacts page.
Last changed: 9 Dec. 1999 jr
Conference index
;login: issue index
Proceedings index