Check out the new USENIX Web site. USENIX - Lisa97


Eleventh Systems Administration Conference (LISA '97)

SAN DIEGO, CA
October 26-31, 1997

KEYNOTE ADDRESS

Generation X in IT
Randy Johnson and Harris Kern, R&H Associates Inc.

Summary by Carolyn M. Hennings

Randy Johnson and Harris Kern spoke about the characteristics of a portion of today's workforce referred to as Generation X and the impact it has on traditional IT departments. The challenge to existing IT departments is identifying the nature of the Generation X workforce, clarifying why these characteristics are potentially an issue, and determining how to manage the situation in the future.

In the early 1990s, industry labelled Generation X ­ persons born between 1964 and 1978 ­ as "slackers"; however, most are entrepreneurial, like change and diversity, and are technically literate. In contrast, the traditional IT organization was built on control and discipline.

As technology has moved away from a single machine to a networked computing model, the nature of the IT business has changed. The speakers noted that IT departments had historically relinquished control of personal computers and local area networks. IT management has come to the realization that these are essential elements of the success of mission-critical applications. As a result, there must be some control.

Johnson and Kern suggested IT management focus on the following areas:

  • Teamwork. Encourage people to work together and rely on individuals to do their jobs.
  • Communication. Improve communication within the organization and with the customer.
  • Involvement. Rather than direction from above, involve the team in decisions and planning.
  • People. Encourage a "can do, be smart" attitude with some discipline.
  • Process. Institute the minimum and sufficient processes to support the organization.

They suggested that this could be considered "creating Generation Y." These people and relationships will be needed to build successful IT organizations. The IT department must become a true services organization. To accomplish this, the department must win back the responsibility for technology decisions, reculture the staff to support diversity and change, market and sell the services, train staff members, and focus on customer satisfaction.

The department must communicate within the IT organization and with customers. Defining architectures, standards, support agreements, and objectives will make great strides in this area. The definition and support of the infrastructure from the desktop to the network, data center, and operations is an essential step. Defining "production" and what it means to the customer in terms of reliability, availability, and serviceability goes a long way in opening communication and expectations.

System management processes with standards and procedures modified from the mainframe discipline are necessary steps. The speakers cautioned organizations against bureaucracy and suggested focusing on producing only "minimum and sufficient documentation." Implemen-ting deployment methodologies and processes was strongly encouraged, as well as developing tools for automating these processes.

REFEREED PAPERS TRACK

Session: Monitoring

Summaries by Bruce Alan Wynn

Implementing a Generalized Tool for Network Monitoring
Marcus J. Ranum, Kent Landfield, Mike Stolarchuk, Mark Sienkiewicz, Andrew Lambeth, and Eric Wall, Network Flight Recorder Inc.

Most network administrators realize that it is impossible to make a network unbreachable; the key to network security is to make your site more difficult than another so would-be intruders find easier pickings elsewhere.

In this presentation, Ranum further postulated that when a network break-in does occur, the best reaction (after repelling the invader) is to determine how access was gained so you can block that hole in your security. To do this, the author presents us with an architecture and toolkit for building network traffic analysis and event records: the Network Flight Recorder (NFR). The name reflects the similarity of purpose to that of an aircraft's flight recorder, or "black box," which can be analyzed after an event to determine the root cause.

Further, he postulated that information about network traffic over time may be used for trend analysis: identifying approaching bottlenecks as traffic increases, monitoring the use of key applications, and even monitoring the network traffic at peak usage periods in order to plan the best time for network maintenance. Thus, this information would be useful for network managers in planning their future growth.

The NFR monitors a promiscuous packet interface in order to pass visible traffic to an internally programmed decision engine. This engine uses filters, which are written in a high-level filter description language, read into the engine, compiled, and preserved as byte-code instructions for fast execution. Events that pass through the filters are passed to a combination of statistical and logging back-end programs. The output of these back-ends can be represented graphically as histograms or as raw data.

Ranum can be reached at <mjr@clark.com>; the complete NFR source code, including documentation, Java class source, decision engine, and space manager, is currently available from <http://www.nfr.net>for noncommercial research use.

Extensible, Scalable Monitoring for Clusters of Computers
Eric Anderson, University of California, Berkeley

The Cluster Administration using Relational Databases (CARD) system is capable of monitoring large clusters of cooperating computers. Using a Java applet as its primary interface, CARD allows users to monitor the cluster through their browser.

CARD monitors system statistics such as CPU utilization, disk usage, and executing processes. These data are stored in a relational database for ease and flexibility of retrieval. This allows new CARD subsystems to access the data without modifying the old subsystems. CARD also includes a Java applet that graphically displays information about the data. This visualization tool utilizes statistical aggregation to display increasing amounts of data without increasing the amount of screen space used. The resulting information loss is reduced by varying shades of the same color to display dispersion.

Anderson can be reached at <eanders@u98.cs.berkeley.edu>. CARD is available from <http://now.cs.berkeley.edu/Sysadmin/esm/intro.html>.

Monitoring Application Use with License Server Logs
Jon Finke, Polytechnic Institute

Many companies purchase software licenses using their best estimate of the number required. Often, the only time this number changes is when users need additional licenses. A side effect of this is that many companies pay for unused software licenses. In this presentation, Jon Finke described a tool for monitoring the use of licensed software applications by examining license server logs.

This tool evolved from one designed to track workstation usage by monitoring entries in the wtmp files. Because most license servers record similar information (albeit in often radically different formats), the tool was modified to monitor license use.

Information can be displayed in a spreadsheet or as a series of linear graphs. The graphs provide an easy visual estimate of the number of software licenses actually in use at a given point in time, or over a period of time. Analysis of this information can quickly uncover unneeded licenses at your site.

Currently, the tool interfaces with Xess (a commercial spreadsheet available from Applied Information Services), Oracle, and Simon (available from <ftp://ftp.rpi.edu/pub/its-release/simon/README.simon>).
Finke can be contacted at <finkej@rpi.edu>.

Session: The Business of System Administration

Summaries by Brad C. Johnson

Automating 24x7 Support Response to Telephone Requests
Peter Scott, California Institute of Technology

Scott has designed a system, called helpline, that provides automated answering of a help desk telephone during nonpeak hours and is used for notifying on-call staff of emergencies within a short amount of time (minutes or seconds) once a situation is logged in the system (scheduler) database. This system was designed mainly to be cheap and therefore mostly applicable to sites with low support budgets. The system is comprised of source code written in Perl, the main scheduler information base written in SGML, and two dedicated modems ­ one for incoming calls (for problem reporting) and one for outgoing calls (for notification).

The rationale for creating helpline is that most other available software that was sufficient to provide automated support cost more than $100,000. Several tools that cost less were discovered, but they did not provide sufficient notification methods (such as voice, pager, and email according to a schedule). Recent entries into this market include a Telamon product called Tel Alert, which requires proprietary hardware, and VoiceGuide from Katalina Technologies, which runs only on Windows. There is also available some freeware software called tpage, but it concentrates on pagers, not on voice phones.

The key to the system is a voice-capable modem. When an incoming call is answered by the modem daemon, it presents a standard hierarchical phone menu ­ a series of prerecorded sound files that are linked to the appropriate menu choice. Independent of the phone menu system is the notifier and scheduler component. When an emergency notification occurs, the scheduler parses a schedule file (written in SGML) to determine who is on call at the time, determines what profile (i.e., action) is appropriate based on the time and situation, and takes the action to contact the designated on-call person. Multiple actions can be associated with an event, and if the primary notification method fails, alternate methods can be invoked.

Unfortunately, this software may not be completed for a long time (if ever) because funding and staff have been assigned to other projects, although the current state of the source code is available for review. Send email with contact information and the reason for your request to <jks@jpl.nasa.gov>. Additionally, in its current state, there are some significant well-known deficiencies such as data synchronization problems (which require specialized modem software), (over)sensitivity to the load of the local host (a host that is assumed to be reliably available), and virtually no available hard copy documentation.

Turning the Corner: Upgrading Yourself from "System Clerk" to "System Advocate"
Tom Limoncelli, Bell Labs, Lucent Technologies

Limoncelli believes that many administrators can be classified in one of two ways: as a system clerk or as a system advocate. A system clerk takes orders, focuses mainly on clerical tasks, and performs many duties manually. A system advocate is focused on making things better, automates redundant tasks, works issues and plans from soup to nuts, and treats users as customers to create respectful, understanding partnerships for resolving problems. The key to job satisfaction, feeling better, and getting better raises is to make the transition from clerk to advocate.

Making a successful transition to system advocate requires converting bad (subservient) habits into good (cooperative) ones, creating spare time for better communication and quality time for planning and research, and automating mundane and repetitive tasks.

Although changing habits is always hard, it's important to concentrate on getting a single success. Follow that with another and another, and over time these experiences will accumulate and become the foundation for good habits.

To find spare time, people need to think outside of the box and be more critical and selective about where their time is spent. Suggestions for regaining time include stop reading USENET, get off mailing lists, take a time management course, filter mail (and just delete the ones that you can't get to in a reasonable time ­ e.g., at the end of the week), and meet with your boss (or key customer) to prioritize your tasks and remove extraneous activities from your workload.

Automating tasks, within the realm of a system administrator, requires competency in programming languages such as Perl, Awk, and MAKE. These are languages that have proven to be robust and provide the functionality that is necessary to automate complex tasks.

Transforming the role of clerk to advocate is hard and requires a change in attitude and working style to improve the quality of work life, provide more value to customers, and create a more professional and rewarding environment. However, the effort required to make this transition is worth it. Simply put, vendors can automate the clerical side of system administration, but no vendor can automate the value of a system advocate.

How to Control and Manage Change in a Commercial Data Center Without Losing Your Mind
Sally J. Howden and Frank B. Northrup, Distributed Computing Consultants Inc.

Howden and Northrup presented a methodology to ensure rigor and control over changes to a customer's computing environment. They (strongly) believe that the vast majority of problems created today are caused by change. When change occurs unsuccessfully, the result can range from lost productivity to financial loss. Change is defined as any action that has the potential to change the environment and must consider the impact from software, hardware, and people. Using the rigorous method that was outlined will lower the overall risk and time spent on problems. They believe that this rigor is required for all changes, not just for significant or complex ones.

There are eight main steps outlined in this methodology: (1) Establish and document a base line for the entire environment. (2) Understand the characteristics of the change. (3) Test the changes in both an informal test and formal preproduction environment. (4) Fully document the change before, during, and after implementation. (5) Review the change with all involved parties before placing it into the production environment. (6) Define a detailed back-out strategy if the change fails in the production environment. (7) Provide training and education for all parties involved in the change. (8) Periodically revisit the roles and responsibilities associated with the change.

The authors were quite firm about testing a change in three physically distinct and separate environments. The first phase includes (unit) testing of the change on the host(s) involved in development. The second phase requires testing in a preproduction environment that, in the best case, is an exact duplicate of the production environment. The third phase is placing the change in the actual production environment.

When pressed on the suitability of using this (heavyweight) process on all changes, the authors stated that the highest priority activities are to fully document change logs and to create thorough work plans. The paper notes, however, that although this process does generate a significant

amount of work by the administrators before a given change, it has (over time) shown to reduce the overall time spent ­ especially for repeated tasks, when transferring information to other staff, when secondary staff are on duty, and when diagnosing problems.

Session: System Design Perspectives

Summaries by Mark K. Mellis

Developing Interim Systems
Jennifer Caetta, NASA Jet Propulsion Laboratory

Caetta addressed the opportunities presented by building systems in the real world and keeping them running in the face of budgetary challenges.

She discussed the role of interim systems in a computing environment ­ systems that bridge the gap between today's operational necessities and the upgrades that are due three years from now. She presented the principles behind her system design philosophy, including her extensions to the existing body of work in the area. Supporting the more academic discourse are a number of cogent examples from her work supporting the Radio Science Systems Group at JPL. I especially enjoyed her description of interfacing a legacy stand-alone DSP to a SparcStation 5 via the DSP's console serial port that exposed the original programmer's assumption that no one would type more than 1,024 commands at the console without rebooting.

Caetta described points to consider when evaluating potential interim systems projects, leveraging projects to provide options when the promised replacement system is delayed or canceled, and truly creative strategies for financing system development.

A Large Scale Data Warehouse Application Case Study
Dan Pollack, America Online

Pollack described the design and implementation of a greater-than-one-terabyte data warehouse used by his organization for decision support. He addressed such issues as sizing, tuning, backups, performance tradeoffs and day-to-day operations.

He presented in a straightforward manner the problems faced by truly large computing systems: terabytes of disk, gigabytes of RAM, double-digit numbers of CPUs, 50 Mbyte/sec backup rates ­ all in a single system. America Online has more than nine million customers, and when you keep even a little bit of data on each of them, it adds up fast. When you manipulate that data, it is always computationally expensive.

The bulk of the presentation discussed the design of the mass storage IO subsystem, detailing various RAID configurations, controller contention factors, backup issues, and nearline storage of "dormant" data sets. It was a fascinating examination of how to balance the requirements of data availability, raw throughput, and the state of the art in UNIX computation systems. He also described the compromises made in the system design to allow for manageable system administration. For instance, if AOL strictly followed the database vendor's recommendations, they would have needed to use several hundred file systems to house their data set. By judicious use of very large file systems so as to avoid disk and controller contention, they were able to use a few large (!) file systems and stripe the two gigabyte data files across multiple spindles, thereby preserving both system performance and their own sanity.

Shuse At Two: Multi-Host Account Administration
Henry Spencer, SP Systems

Spencer's presentation described his experiences in implementing and maintaining the Shuse system he first described at LISA '96. He details the adaptation of Shuse to support a wholesale ISP business and its further evolution at its original home, Sheridan College, and imparted further software engineering and system design wisdom.

Shuse is a multi-host administration system for managing user accounts in large user communities, into the tens of thousands of users. It uses a centralized architecture. It is written almost entirely in the expect language. (There are only about one hundred lines of C in the system.) Shuse was initially deployed at Sheridan College in 1995.

Perhaps the most significant force acting on Shuse was its adaptation for ISP use. Spencer described the changes needed, such as a distributed account maintenance UI, and reflected that along with exposing Sheridan-specific assumptions, the exercise also revealed unanticipated synergy, with features requested by the ISP being adopted by Sheridan.

A principal area of improvement has been in generalizing useful facilities. Spencer observed in his paper, "Every time we've put effort into cleaning up and generalizing Shuse's innards, we've regretted not doing it sooner. Many things have become easier this way; many of the remaining internal nuisances are concentrated in areas which haven't had such an overhaul lately."

Other improvements have been in eliminating shared knowledge by making data-transmission formats self-describing, and in the ripping out of "bright ideas" that turned out to be dim and replacing them with simpler approaches. These efforts have payed off handsomely by making later changes easier.

Spencer went on to describe changes in the administrative interfaces of Shuse, and in its error recovery and reporting.

Shuse is still not available to the general public, but Spencer encourages those who might be interested in using Shuse to contact him at <henry@zoo.toronto.edu>

Spencer's paper is the second in what I hope will become a series on Shuse. As a system designer and implementor myself, I look forward to objective presentations of experiences with computing systems. It's a real treat when I can follow the growth of a system and learn how it has changed in response to real-world pressures and constraints. Often papers describe a system that has just been deployed or is in the process of being deployed; it is rare to see how that system has grown and what the development team has learned from it.

Session: Works in Progress

Summaries by Bruce Alan Wynn

Service Level Monitoring
Jim Trocki, Transmeta Corp.

Many system and network administrators have developed their own simple tools for automating system monitoring. The problem, proposes Jim Trocki, is that these tools often evolve into something unlike the original and in fact are not "designed" at all.

Instead, Jim presents us with mon: a Perl 5 utility, developed on Linux and tested on Solaris. mon attempts to solve 85% of the typical monitoring problems. The authors developed mon based upon these guidelines:

  • Simple works best.
  • Separate testing code from alert generation code.
  • Status must be tracked over time.

The mon tool accepts input from external events and "monitors" (programs that test conditions and return a true/false value). The mon processes then examine these data and decide which should be presented directly to clients and which should trigger an alarm.

The authors are currently expanding the functionality of mon to include dependency checking of events, severity escalation, alert acknowledgments via the client, "I'm okay now" events, asynchronous events, a Web interface, and a better name.

The current version of mon is available at <http://consult.ml.org/~trockij/mon>.
Jim Trocki can be reached at <trockij@transmeta.com>.

License Management: LICCNTL ­ Control License Protected Software Tools Conveniently
Wilfried Gaensheimer, Siemens AG

Gaensheimer presented an overview of a number of tools that can help control and monitor the use of software licenses. The tools can also generate reports of license use over time.

For additional information on these tools, contact Gaensheimer at <wig@HL.Siemens.DE>.

Inventory Control
Todd Williams, MacNeal-Schwendler Corp.

One of the less exciting tasks that system and network administrators are often faced with is that of taking a physical inventory. Typical reasons for this requirement include:

  • Maintenance contract renewal
  • Charge backs for resource use
  • Identifying the type of machinery

Williams began tracking his inventory by including comments in the system's "hosts" files, but quickly outgrew this mechanism when devices appeared that did not have an IP address, and when the amount of information desired made the "hosts" table unwieldy.

Instead, Williams developed a database to track this information. He developed procedures to keep this information up to date as machinery moves in and out of the work site.

For additional information on these software tools for tracking inventory, contact Todd Williams at <todd.williams@macsch.com>.

Values Count
Steve Tylock, Kodak

Although it may initially seem a surprising topic for a technical conference, Tylock reintroduced the basic values of a Fortune 500 company:

  • respect for the dignity of the individual
  • uncompromising integrity
  • trust
  • credibility
  • continuous improvement and personal renewal

Instead of applying these to the company itself, Tylock suggested that system and network administrators could increase their professionalism and efficiency by applying these basic values to their daily work.

For more information on this topic, contact Steve Tylock at <tylock@kodak.com>.

Extending a Problem-Tracking System with PDAs
Dave Barren, Convergent Group

Many system and network administrators use one type of problem-tracking system or another. But because working on the typical system or network problem often means working away from one's desk, administrators must keep track of ticket status independently of the tracking system. When administrators return to their desk, they must "dump" the information into the tracking system, hoping that they don't mis-key data or get interrupted by another crisis.

To help alleviate this problem, Barren suggests using a PDA to track ticket status. Barren has developed a relatively simple program for his Pilot that allows him to download the tickets, work the problems, track ticket status on the Pilot, then return to his desk and upload the changes in ticket status in one easy step.

This allows Barren to work on more tickets before returning to his desk and increases the validity of the tracking system. Barren hopes to encourage more users to implement this plan so that the increased number of Pilots will allow him to upload ticket status information at virtually any desk instead of returning to his own.

For additional information on this concept and the software tools Barren has developed, contact him at <dcbarro@nppd.com>.

Survey of SMTP
Dave Parter, University of Wisconsin

One of the beautiful things about the Simple Mail Transfer Protocol is that it allows people to use any number of transfer agents to deliver electronic mail across the world. The down side is that there is a hodgepodge of versions and "brands" of transfer agents in use, and nobody really knows what is in use these days. Except, perhaps, Dave Parter.

To examine this issue, Parter monitored the incoming mail at his site for a short period of time. For each site that sent mail to his, he tested the SMTP greeting and tried to identify the type and version of the agent. His results:

  • 60%
  • 17%
  • 3%
  • 2%

Parter was able to identify 140 distinct versions of sendmail in use in this small sampling.

Where, Parter asks, do we go from here with these data? He isn't sure. If you would like to discuss these findings, or conduct your own survey, contact Parter at <dparter@cs.wisc.edu>.

Session: Net Gains

Summaries by Mark K. Mellis

Creating a Network for Lucent Bell Labs Research South
Tom Limoncelli, Tom Reingold, Ravi Narayan, and Ralph Loura, Bell Labs, Lucent Technologies

This presentation described how, as a result of the split of AT&T Bell Labs Research into AT&T Labs and Lucent Bell Labs, they transitioned from an "organically grown" network consisting of four main user communities and ten main IP nets (out of a total of 40 class C IP nets) to a systematically designed network with two main user communities on four main IP nets, renumbering, rewiring, cleaning up, and "storming the hallways" as they went.

Unlike many projects of this scope, the authors planned the work as a phased transition, using techniques such as running multiple IP networks on the same media and operating the legacy NIS configuration in parallel with the new config to transition slowly to the new configuration, rather than make all the changes during an extended down time and discover a critical error at the end. They relate their experiences in detail, including a comprehensive set of lessons learned about strategy, end-user communications, and morale maintenance. ("Yell a loud chant before you storm the hallways. It psyches you up and makes your users more willing to get out of the way.'')

Having been faced with a network unfortunately typical in its complexity, and real-world constraints on system downtime, this group described their thought processes and methodologies for solving one of the problems of our time, corporate reorganization. In the face of obstacles such as not having access to the union-run wiring closets and "The Broken Network Conundrum," where one must decide between fixing things and explaining to the users why they don't work, they divided their networks, fixed the problems, and got a cool T-shirt with a picture of a chainsaw on it, to boot.

Some the tools constructed for this project are available at
<http://www.bell-labs.com/user/tal>.

Pinpointing System Performance Issues
Douglas L. Urner, BSDI

Urner gave us a well-structured presentation that within the context of a case study on Web server performance optimization presents a systematic model for tuning services from the network connection, through the application and operating system, all the way to the hardware. His paper is a vest-pocket text on how to make it go faster, regardless of what "it" might be.

Urner began the paper by describing an overview of system tuning: methodology, architecture, configuration, application tuning, and kernel tuning. He discussed the need to understand the specifics of the problem at hand ­ protocol performance, application knowledge, data collection and reduction. He then described tuning at the subsystem level, including file system, network, kernel, and memory. He presented a detailed explanation of disk subsystem performance, then went on to examine CPU performance, kernel tuning, and profiling both application and kernel code.

Urner's paper is about optimizing Web server performance, but it is really about much more. He describes, in detail, how to look at performance optimization in general. He encourages readers to develop their intuition and to establish reasonable bounds on performance. By estimating optimal performance, the system designer can determine which of the many "knobs" in an application environment are worth "turning", and help set reasonable expectations on what can be accomplished through system tuning.

Session: Configuration Management

Summaries by Karl Buck

The first two papers deal with the actual implementations of tools written to handle the specific problems. The third paper is an attempt to get a higher level view of where configuration management is today and make suggestions for improving existing CM models.

Automation of Site Configuration Management
Jon Finke, Rensselaer Polytechnic Institute

Finke presented his implementation of a system that not only tracks interesting physical configuration aspects of UNIX servers, but also stores and displays dependencies between the servers and the services that they provide. The configuration management system has an Oracle engine and outputs data to a Web tree, making for a very extensible, useful tool. For instance, if a license server is to be updated, one can find out not only all the other services that will be affected, but also the severity of those outages and who to contact for those services. Source code is available; see
<ftp://ftp.rpi.edu/pub/its-release/simon/README.simon>
for details.

Chaos Out of Order: A Simple, Scalable File Distribution Facility for "Intentionally Heterogeneous" Networks
Alva L. Couch, Tufts University

The core of this paper is a file distribution tool written by Couch called DISTR. Using DISTR, administrators of unrelated networks can use the same file distribution system, yet retain control of their own systems. DISTR can "export" and "import" files to and from systems managed by other people. Frank discussion is given to the existing limitations and potential. DISTR is available at <ftp://ftp.eecs.tufts.edu/pub/distr>.

An Analysis of UNIX System Configuration
Remy Evard, Argonne National Laboratory

This paper is an attempt to step back and take a look at what is available for use in UNIX configuration and file management, examine a few case studies, and make some observations concerning the current configuration process. Finally, Evard argues for a "stronger abstraction" model in systems management, and makes some suggestions on how this can be accomplished.

Session: Mail

Summaries by Mark K. Mellis

Tuning Sendmail for Large Mailing Lists
Rob Kolstad, BSDI

Kolstad delivered a paper that described the efforts to reduce delivery latency in the <inet-access@earth.com> mailing list. This mailing list bursts to up to 400,000 message deliveries per day. As a result of the tuning process, latency was reduced to less than five minutes from previous levels that reached five days.

Kolstad described himself as a member of Optimizers Anonymous, and he shared his obsession with us. He described the process by which he and his team analyzed the problem, gathered data on the specifics, and iterated on solutions. He took us through several rounds of data analysis and experimentation, and illustrated how establishing realistic bounds on performance and pursuing those bounds can lead to insights on the problem at hand.

Kolstad and his team eventually homed in on the approach of increasing the parallelism to the extreme of using hundreds of concurrent sendmail processes to deliver the list. They also reduced timeouts for nonresponsive hosts. This, of course, required the creation of a number of scripts to automate the parallel queue creation. These scripts are available upon request from Kolstad, <kolstad@bsdi.com>.

Kolstad closed by noting that after the optimizations were made, the biggest remaining problem was unavailability of recipients. He expressed his amazement that in a mailing list dedicated to Internet service providers, some one to three per cent of recipients were unreachable at any point in time. Also, even with these improvements, the mailing list traffic of mostly small messages doesn't tax even a single T-1 to its limits.

Selectively Rejecting SPAM Using Sendmail
Robert Harker, Harker Systems

Harker offered a presentation that addressed one of the hottest topics on the Internet today ­ unsolicited commercial email, otherwise known as spam. He characterizes spam, examines the different requirements for antispam processing at different classes of sites, and offers concrete examples of sendmail configurations that address these diverse needs.

After his initial discussion of the nature of spam, Harker outlined the different criteria that can be used for accepting and rejecting email. His approach differs from others in that he spends sendmail CPU cycles to get finer granularity in the decision to reject a message. He goes on to treat the problem of spammers sending their wares to internal aliases and mailing lists.

The remainder of the presentation was devoted to detailed development of the sendmail rulesets necessary to implement these policies. He discussed the specific rulesets and databases needed, and how to test the results. His discussion and code are available at <http://www.harker.com/sendmail/anti-spam>

A Better E-mail Bouncer
Richard J. Holland, Rockwell Collins

Holland presented work that was motivated by corporate reorganization: how to handle email address namespace collisions in a constructive way.

As email usage becomes more accessible to a wider spectrum of our society, fewer and fewer email users are able to parse the headers in a bounced message. Holland talked about his bouncer, implemented as a mail delivery agent, which provides a clearly written explanation of what happened and why when an email message bounces due to an address change. This helps the sender understand how to get a message through, helps the recipient get a message, and helps the postmaster by automating another portion of her workload.

The bouncer was originally implemented as a simple filter. Because of the diversity in headers and issues related to envelope vs. header addresses, especially in the case of bcc: addresses, the bouncer was reimplemented as a delivery agent. The bouncer, written in Perl, relinquishes its privilege and runs as "nobody." Many of the aspects of bouncer operation are configurable, including the text of the explanatory text to be returned. A particularly nice feature is the ability to send a reminder message to the recipient's new address when mail of bulk, list, or junk precedence is received, reminding them to update their mailing list subscriptions with the new address.

Holland concluded by discussing alternatives to the chosen implementation and future directions. Those interested in obtaining the bouncer should contact Holland at <holland@pobox.com>.

INVITED TALKS TRACK

So Now You Are the Project Manager
William E. Howell, Glaxo Wellcome Inc.

Summary by Bruce Alan Wynn

Many technical experts find themselves gaining responsibility for planning and implementing successively larger projects until one day they realize that they have become a project manager.

In this presentation, Howell offered helpful advice on how you can succeed in this new role without the benefit of formal training in project management.

Howell's first suggestion is to find a mentor, someone who has successfully managed projects for some time. Learn from that mentor not only what the steps are in managing a project, but also the reasons why those are the right steps.

But, as Howell points out, a mentor is not always available. What do you do then? Howell presented a few tips on what you can do if you can't find a mentor.

For copies of the presentation slides, contact Marie Sands at
<mms31901@glaxowellcome.com>; please include both your email and postal addresses.

When UNIX Met Air Traffic Control
Jim Reid, RTFM Ltd.

Summary by Mike Wei

Every once in a while we see reports of mishaps of the rapidly aging air traffic control (ATC) system in the United States. We have also seen reports that some developing countries have ATC systems "several generations newer" than the US system. For most of the flying public, the ATC system is something near a total mystery on which our lives depend. As a pilot and a system administrator, I hope I can lift the shroud of mystery a little bit and help explain the ATC system Reid talked about, how UNIX handles such a mission-critical system, and how this system helps air traffic control.

The primary purpose of air traffic control is traffic separation, although it occasionally helps pilots navigate out of trouble. Government aviation authorities publish extensive and comprehensive regulations on how aircraft should operate in the air and on the ground. Air traffic control is a massively complex system of computers, radar, controllers, and pilots that ensures proper traffic separation and flow. Human participants (i.e., controllers and pilots) are as essential as the computer and radar systems.

Naturally, air traffic congestion happens near major airports, called "terminal areas." In busy terminal areas, computer-connected radar systems provide controllers with realtime traffic situations in the sky. Each aircraft has a device called a transponder that encodes its identity in its radar replies, so the controllers know which aircraft is which on the computer screen. Computer software along with

traffic controllers ensure proper separation and traffic flow by vectoring planes within the airspace to their destinations.

Outside terminal areas, large planes usually don't fly anywhere they want. They follow certain routes, like highways in the sky. On-route traffic control centers control traffic along those routes. Traffic separation is usually ensured by altitude separation or fairly large horizontal separation. Some on-route centers have radar to help track the traffic. For areas without radar coverage, on-route centers rely on pilot position reports to track the traffic and usually give very large separation margins.

This system worked fairly well for many years, until air travel reached record levels. Two things happened. First, some terminal areas became so congested that, during some parts of the day, the airspace just couldn't hold any more traffic. Second, traffic among some terminal areas reached such a level that these on-route airspaces became almost as congested as terminal areas.

A new kind of system was developed to address the new problems. This "slot allocation system" tries to predict what the sky will look like in the future, based on the flight plan filed by airliners. Based on the computer prediction, we can allocate "slots" in the sky for a particular flight, from one terminal area to another, including the on-route airspace in between. Every airline flight is required a flight plan, including departure time, estimated time on-route, cruising airspeed, planned route, destination, and alternate destination. With the flight plan, an airplane's position in the sky is fairly predictable.

This slot allocation system is very much like TCP congestion control in computer networking: when the network is congested, the best way to operate is to stop feeding new packets into it for a while. For the same reason it's much better to delay some departures than to let planes take off and wait in the sky if the computer system predicts congestion sometime in the future.

The Western European airspace, according to Reid, is the busiest airspace in the world. Instead of a single controlling authority, like the US Federal Aviation Authority, each country has its own aviation authorities. Before "Eurocontrol," the agency Reid worked at last year, each country managed its airspace separately, and an airliner had to file a flight plan for each country it had to fly over along its route. This led to a chaotic situation when traffic volume increased. According to Reid, there was also a problem of ATC nepotism (i.e., a country favoring its own airliners when congestion occurred).

The Eurocontrol agency has three UNIX-based systems that serve Western Europe. IFPS is a centralized flight plan submission and distribution system, TACT is the realtime slot allocation system, and RPL is the repeat flight plan system.

IFPS provides a single point of contact for all the flight plans in Western Europe. It eliminates the inconvenience of filing multiple flight plans. This is basically a mission-critical data entry/retrieval system.

The TACT system provides slot allocation based on the flight plan information in the IFPS system. It provides slots that satisfy separation standards in the airspace above Western Europe. It controls when an airplane can take off and which slots in the sky it can fly through to its destination. It keeps a "mental picture" of all the air traffic in the sky for all the

moments into some future. RPL is the repeat flight plan system. Airlines tend to have the same flights repeatedly, and this system simplifies filing those flight plans. The RPL system is connected with the IFPS system and feeds it with those repeat flight plans.

This must be an awesomely impressive system with equally impressive complexity. According to Reid, it actually works. Ever since the adoption of the system, it has never failed. Furthermore, the increase in traffic delay is much less than the increase in traffic volume. Kudos for our European computer professionals!

The slot allocation system does not provide the actual traffic separation. Realtime traffic separation must be based on actual position data obtained from radar or pilot position report, rather than projected position data based on flight plan. However, this slot allocation system is an invaluable tool to help the realtime traffic separation by avoiding congestion in the first place.

Using UNIX in such a mission-critical system is quite pioneering in an ATC system. Most ATC systems in the US are still mainframe-based. The system is built on multiprocessor HP T90 servers, and the code is written in Ada.

Like most of the mission-critical systems, operation of those UNIX systems has its idiosyncrasies. According to Reid, the system operation suffers organizational and procedural inefficiencies. However, some of them may well be the necessary price to pay for such a mission-critical system. The whole system is highly redundant; almost all equipment has a spare. The maintenance downtime is limited to one hour a month. Change control on the system is the strictest I've ever heard of. For new code releases, it has a test environment fed with real data, and there's a dedicated test group that does nothing

but the testing. Any change to the production systems must be documented as a change request and approved by a change board, which meets once a week. Any kind of change, including fixing the sticky bit on /tmp, needs change board approval. Reid said that it took SA six weeks to fix the /tmp permissions on six machines because each one needed a change request and only one change a week is allowed on the production system. To minimize the chance of system failure, all nonessential service on the system is turned off, including NFS, NIS, and all other SA life-saving tools. This does add pain to SA's daily life.

This kind of process sounds bureaucratic, and it's a far cry from a common UNIX SA's habit. However, for this kind of system, it might be right to be overly conservative. At least when Reid flew to the LISA conference this year, he knew nothing bad would likely happen to Eurocontrol due to a system administrator's mistake.

Enterprise Backup and Recovery: Do You Need a Commercial Utility?
W. Curtis Preston, Collective Technologies

Summary by Bruce Alan Wynn

Nearly every system administrator has been asked to back up filesystems. Even those who haven't have probably been asked to recover a missing file that was inadvertently deleted or corrupted. How can a system administrator determine the best solution for a backup strategy?

In this presentation, Preston presented an overview of standard utilities available on UNIX operating systems: which ones are common, which ones are OS-specific. He then explained the capabilities and limitations of each. In many cases, claims Preston, these "standard" utilities are sufficient for a site's backup needs.

For sites where these tools are insufficient, Preston discussed many of the features available in commercial backup products. Because some features require special hardware, Preston described some of the current tape robots and media available. Once again, he iterated the capabilities and limitations of each.

Copies of Preston' presentation are available upon request; Preston can be reached at <curtis@colltech.com>.

A Technologist Looks at Management
Steve Johnson, Transmeta Corp.

Summary by Bruce Alan Wynn

Employees often view their management structure as a bad yet necessary thing. Johnson has worked in the technical arena for years, but has also had the opportunity to manage research and development teams in a number of companies. In this presentation, he offered his insight into methods that will smooth the relationship between employees and managers.

Johnson began by postulating that both employees and managers have a picture of what the manager-employee relationship should look like, but it is seldom a shared picture. He further postulated that a great deal of the disconnect is a result of personality and communication styles rather than job title.

Johnson loosely categorized people as either thinkers, feelers, or act-ers. A thinker focuses on analyzing and understanding; a feeler focuses on meeting the needs of others; an act-er focuses on activity and accomplishment.

These differences in values, combined with our tendency to presume others think as we do, cause a breakdown in communication that leads to many of the traditional employee-manager relationship problems.

After making this point, Johnson suggested that technical people who are given the opportunity to move into management first examine closely what the job entails: it's not about power and authority; it's about meeting business needs. He suggested a number of publications for additional information on this topic.

Steve Johnson can be reached at <scj@transmeta.com>.

IPv6 Deployment on the 6bone
Bob Fink, Lawrence Berkeley National Laboratory

Summary by Mike Wei

We all know that IPv6 is the future of the Internet; there's simply no alternative to support the explosive growth of the Internet. However, despite years of talking, we see little IPv6 deployment. According to Fink, the adaptation and deployment of IPv6 is currently well under way, and it's heading in the right direction.

An experimental IPv6 network, named 6bone, was created to link up early IPv6 adopters. It also serves as a test bed to gain operational experiences with IPv6. Because most of the machines on the 6bone also run regular IPv4, it provides an environment to gain experience in IPv4 to v6 transition.

The 6bone is truly a global network that links up 29 countries. Most of the long haul links are actually IPv6 traffic tunnelled through the existing IPv4 Internet. This strategy allows 6bone to expand

anywhere that has an Internet connection for almost no cost. On the 6bone network, there are some "islands" of network that run IPv6 natively on top of the physical network.

An important milestone was achieved in IPv6 deployment when Cisco, along with other major router companies, committed to IPv6. According to Fink, IPv6 will be supported by routers in the very near future, if it's not already supported. In addition, we will start to see IPv6 support in major OS releases.

A typical host on the 6bone runs two IP stacks, the traditional v4 stack and the IPv6 stack. The IPv6 stack can run natively on top of the MAC layer if the local network supports v6, or it can tunnel through IPv4. The v6 stack will be automatically used if the machine talks to another v6 host. An important component of the 6bone network will be the new DNS that supports IPv6 addresses. The new DNS supports AAAA record (quad-A record, because a v6 address is four times the length of a v4 address). If a v6 host queries the new DNS server for another v6 host, an AAAA record will be returned. Because the new DNS simply maps a fully qualified domain name to an IP address (v4 or v6), the DNS server itself doesn't have to sit on a v6 network. It will be perfectly normal for a dual-stack v6/v4 host to query a DNS server on the v4 network, getting a v6 address, and talk to the v6 host in IPv6.

The key to the success of IPv6 deployment is smooth transition. The transition should be so smooth that a regular user should never know when the IPv6 has arrived. Given the fact that the IPv4 network is so far reaching throughout the

world, IPv6 and v4 will coexist for a very long time; the transition to IPv6 from v4 will be gradual. Routers will be the first ones that have IPv6 capabilities. Just like the 6bone, an IPv6 backbone can be built by tunnelling v6 traffic through the existing v4 network or run v6 natively on the physical network when two neighboring routers both support v6. Because v6 is just another network layer protocol, it can run side by side with IPv4 on the same physical wire without conflict, like IP and IPX can run together on the same Ethernet. This means that we do not have to make a choice between v6 and v4; we can simply run both of them during the transition period. IP hosts will gradually become IPv6 capable when the new OS versions support it. During the transition, those IPv6 hosts will have dual IP stacks so they can talk to both v4 and v6 hosts. Nobody knows how long this "coexist" will last, but it will surely last for years. When the majority of the hosts on the Internet are doing v6, some of the hosts might choose to be v6 only. One by one, the v4 hosts will fade away from the Internet.

Will that ever happen? The answer is yes. In the next decade, the IPv4 address will be so hard to obtain that IPv6 will be a very viable and attractive choice. We haven't seen that yet, but based on the current Internet growth, it will happen.

The IPv6 addressing scheme is another topic Fink talked about in the seminar. IPv6 has a 128 bit address space, which allows thousands of addresses per square

foot if evenly spread on the earth's surface. How to make use of this address space in a highly scalable way is a big challenge. IPv4 suffers the problem of an explosive number of routing entries, and this problem arises years before the exhaustion of IPv4 addresses. To address this problem and to allow decades of future expansions, IPv6 uses an aggregation-based addressing scheme.

3bits 13bits 32bits 16bits 64 bits

001 TLA NLA SLA Interface ID

public topology site local machine topology

The best analogy of this aggregation-based addressing is the telephone number system. We have ten-digit phone numbers in US and Canada, with a three-digit area code, three-digit exchange code, and the last four digits for individual telephone lines.

The first three bits are 001. In the great tradition of TCP/IP, other combinations are reserved for future use, in case one day we have an interplanetary communication need that requires a different addressing scheme. The 13-bit TLAs are top-level aggregators, designed to be given to long-haul providers and big telcos that run backbone service. The 32-bit NLAs are next-level aggregators for various levels of ISPs. It can be further subdivided to several levels of NLAs. The 16-bit SLAs are for site topologies. (It's like getting a class A IPv4 address and use 16-bit for network address.) The machine interface ID is 64 bits.

An important feature of IPv6 is autoconfiguration, in which a host can figure out its own IPv6 address automatically. The 64-bit interface ID is designed so that the host can use its data link layer interface

address as the host portion of the IPv6 address. Ethernet uses a 48-bit address, and it seems adequate for globally unique addresses. Reserving 64 bits for the local machine shall accommodate any future address method used by future physical networks.

Aggregation-based addressing is a big departure from the current IPv4 addressing. Although IPv4 has three classes of addresses, it's not a hierarchical addressing scheme. In IPv4 (at least before the CIDR days), all the network's addresses were created equal, which means they could all be independently routed to any locations they chose to be. This caused the routing entry explosion problem when the Internet grew. Classless Inter Domain Routing (CIDR) was introduced as a stopgap measure to address this urgent problem by introducing some hierarchy in the IPv4 address space. IPv6 is designed at the beginning with a hierarchical scheme. By limiting the number of bits for each aggregator, there is an upper limit to the number of routing entries that a router needs to handle. For example, a router at a long-haul provider needs only to look at the 13-bit TLA portion of the address, limiting the possible number of routing entries to 213.

Another advantage of a hierarchical-based addressing system is that address allocation can be delegated in a hierarchical manner. The success of DNS teaches us the important lesson that delegation of address allocation authority is a key to scalability.

There's a price to pay to use a hierarchical addressing system. When a site changes its providers, all the IP addresses need to

be changed. We already experience the same kind of issue in IPv4 when we use CIDR address blocks. IPv6 tries to make address changes as painless as possible, to have a host autoconfigure itself. The host will use its MAC layer address as the lower portion of its IPv6 address and use a Network Discovery protocol to find out the upper portion of the address (routing prefixes). The whole site can be renumbered by simply rebooting all the hosts without any human intervention.

There are still lots of problems to be discovered and addressed in IPv6. That's exactly what the 6bone is built for. IPv6 is the future of the Internet, and the transition to IPv6 will start in the near future.

More information on 6bone can be found on <http://www.6bone.net>

Joint Session

Panel: Is System Administration a Dead-End Career?
Moderator: Celeste Stokely, Stokely Consulting
Panelists: Ruth Milner, NRAO; Hal Pomeranz, Deer Run Associates; Wendy Nather, Swiss Bank Warburg; Bill Howell, Glaxo Wellcome Inc.

Summary by Carolyn M. Hennings

Ruth Milner opened the discussion by responding to the question with, "It depends." She went on to explain that it is necessary for everyone to define "system administration" and "dead-end career" to answer this question for themselves. In some organizations, "system administration" leaves no room for growth. However, Ruth pointed out that if people enjoy what they do, then maybe it should not be considered a "dead-end."

Hal Pomeranz outlined the typical career progression for system administrators. He described the first three years in the

career field as a time of learning while receiving significant direction from more senior administrators. During the third through fifth years of practicing system administration, Hal suggested that even more learning takes place as the individual works with a greater degree of autonomy. Hal observed that people with more than five years of experience are not learning as much as they were, but are more focused on producing results as well as mentoring and directing others. Hal commented that many organizations move these senior people into management positions and wondered how technical tracks might work.

Wendy Nather discussed the question from the angle of recruiting. Those hiring system administrators are looking for people who have dealt with a large number of problems as well as a variety of problems. She pointed out that being a system administrator is a good springboard to other career paths. Wendy outlined some of the characteristics of good system administrators that are beneficial in other career areas: a positive attitude, social skills, open-mindedness, and flexibility.

Bill Howell examined the financial prospects for system administrators. He commented that there will always be a need for system administrators. However, industry may be unable and unwilling to continue to pay high salaries for them, and salary increases may begin to be limited to "cost of living" increases. Bill suggested that growth in personal income and increases in standard of living are the results of career advancement. If salaries

do become more restricted in the future, system administration may become a dead-end career.

Celeste then opened up the floor for questions and discussion. One participant asked about other career options if one was not interested in pursuing the managerial or consultant path. The panel suggested that specializing in an area such as network or security administration would be appropriate. Discussion ranged among topics such as motivation for changing positions, how the size of the managed environment affects opportunities and working relationships, the impact of Windows NT on UNIX administrator's careers, how an administrator's relationship with management changes with career advancement, and the importance of promoting system administration as a profession.

BOFs

Summaries by Carolyn M. Hennings

Keeping a Local SAGE Group Active

This BOF at LISA '96 and SANS '97 inspired me to start a group in Chicago. Chigrp's initial meeting was in early October, and I was anxious to announce our existence. General suggestions for getting a group started and keeping one alive were shared by attendees. If you want more information on how to start a group, see <http://www.usenix.org/sage/locals/>.

Documentation Sucks!

As system administrators, we all know how important documentation is, but we hate to write it. This BOF explored some of the reasons we don't like to write documentation, elements of good documentation, and what we can do personally to improve our efforts in this area. About 50 people attended the BOF. Some professional technical writers participated in the BOF and were interested in the approach sys admins were taking in their struggle to write documentation.


?Need help? Use our Contacts page.
First posted: 2nd February 1998 efc
Last changed: 2nd February 1998 efc
Conference index
Proceedings index
USENIX home