################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally presented at the

	  Ninth System Administration Conference (LISA '95)
	     Monterey, California, September 18-22, 1995


	    It was published by USENIX Association in the
		    Conference Proceedings of the
		Ninth System Administration Conference

 
        For more information about USENIX Association contact:
 
                   1. Phone:    510 528-8649
                   2. FAX:      510 548-5738
                   3. Email:    office@usenix.org
                   4. WWW URL:  https://www.usenix.org
 
 
^L


        Morgan Stanley's Aurora System: Designing a Next

         Generation Global Production Unix Environment
Xev Gittler, W. Phillip Moore and J. Rambhaskar - Morgan Stanley

                            ABSTRACT

     The challenge:   To come up with a distributed systems
environment that would allow Morgan Stanley to centrally manage
tens of thousands of systems spread out over more than 30 offices
on virtually every continent on the globe in a fully production
fashion.

     The solution: The Aurora System.

                             History

     Morgan Stanley is a global investment banking company. The
day to day business activity of the firm depends in a large part
on the stability, reliability and functionality of its
technology. The firm trades on most exchanges around the world,
and as such, there are very few hours in a week when trading
activity is not occurring somewhere on our networks.

     Until November 1993, Unix computing activities within Morgan
Stanley were divided into two distinct groups. One business unit
within Morgan Stanley, the Fixed Income Division, maintained its
own computer management staff, called the Fixed Income Research
(FIR) group. The Fixed Income Division required significant
computing capacity, and, more importantly, realized and
encouraged growth and use of equipment to its fullest potential.
FIR introduced Unix systems in 1987 and by 1993 had approximately
1500 systems, 90% Sun and 10% IBM. FIR had nearly complete
authority over all computing related decisions.

     In contrast, Information Systems (IS) was responsible for
the vast majority of Unix systems throughout Morgan Stanley,
reporting to many different business units with different needs
and desires. In some instances, IS was allowed to suggest
technological solutions to problems and implement those
decisions. In many cases, however, it was presented with systems
to manage with no initial consultation, and any suggestions,
especially those that required any additional expenses, were
overridden by the business unit. IS introduced Unix in 1990 and
by 1993 had approximately 3500 systems, all Suns. IS and FIR
rarely cooperated.

     In November 1993, FIR, IS and all other computing groups
merged into the Information Technology (IT) department,
consolidating all computing related activities under a single
organization with significant decision making authority. After
the merger, in mid-1994, the top distributed systems engineers
from both IS and FIR formed into the Core Infrastructure Group
(CIG). The mandate given to this group was to create a common
distributed systems technology platform for the firm. The CIG had
the unique opportunity to dream up the ideal Unix operating
environment for our needs and make it happen.

                          Design Goals

     Up until (and through) the merger, our network expanded via
the installation of new workstations at a rapid pace. As a
result, the system components that we had in place were starting
to show signs of strain. For example, our file system
distribution was becoming unwieldily, taking too long to complete
and systems that should have been identical were no longer in
sync. Additionally, replicated copies of the system data
accounted for almost 40% of all used disk space, which was far
too much overhead.

     The pre-Aurora environment was functional, and under
ordinary circumstances we could have made it last a few more
years without a major overhaul. However, because the systems
would undergo a major upheaval in any event as a result of the
merger, we decided that we might as well go whole hog and design
the perfect system.

     In addition to remedying shortcomings of the pre-Aurora
systems, our perfect environment needed to address the following
issues:

     Scale - A major criteria of the environment the CIG
developed was its adaptability to scale. We define scale somewhat
differently than most, based on our diverse physical requirements
in our various locations. Morgan Stanley is not set up in a
traditional campus setting. We have over 5000 systems distributed
globally in over 30 locations. Our largest offices have over 1000
systems, while the smallest offices have fewer than five. The
bandwidth between offices varies from 56Kbs to hundreds of Mbs.
The largest offices have hundreds of support staff, while the
smallest have no support staff.

     Based on this infrastructure, any system that we designed
not only had to scale upwards in the traditional sense, but it
also had to scale to support the smallest sites as well. This
requirement means it must fit on small systems, use less
bandwidth and not depend on many different servers to support the
environment.

     System Usage - Given the nature of our business, even
minutes of down-time are unacceptable. Our systems must be
operational 24 hours a day, 7 days a week. Except for a few hours
a week, trading is always going on. Additionally, we have a
fairly static user environment. Users log on to particular
workstations, and typically stay logged on until the system
reboots. Running jobs and screen configurations are painful for
the users to restart. As a result, system reboots only occur on
the order of months.

     Because our environment must run with zero down time, in
designing Aurora we tried to avoid products that only ran in
university or research labs. Every part of the environment that
we use must either be commercially supported, or we must support
it internally. It is important to realize that we were not
designing a research project.  Project Andrew, for example, which
was a research project, took three or four years of determined
effort to stabilize after the initial rollout. We did not have
this luxury.

     Global Usage - Many of our users travel extensively between
offices for their work. We designed the system based on the
concept that wherever a user went, when she sat down at an
available workstation and logged in, it would be her environment.
If a trader in New York jumped on a plane to Singapore and logged
in, all his files, the programs that he normally ran and the data
he normally accessed would appear with no operator intervention
and minimal performance degradation. There actually are
regulatory and contractual restrictions on what and where users
can access data and programs, however there are mechanisms other
than visibility for enforcing these restrictions.

     In order to manage a global environment of this size with
the small number of people we have in our operations staff, we
had to provide a single operating environment worldwide. The
sacrifice that we made in doing so was to place restrictions on
developers and business units about what they could place into
the environment, and what they had to go through the operations
staff to do. However, with proper hooks to allow for
customization, we felt that this would not be an onerous burden
in relation to the economies of scale that this scheme would
provide us.

Design Focus

     Rather than merely merging the existing FIR and IS systems,
the CIG decided to provide a necessary level of interoperability
for the short term, and focus on creating a new system from
scratch, using the best proven technology available. We used the
opportunity to throw out many of the bad features and unused
functionality of the old system. We did not provide backwards
compatibility by default, we provided it only if it was proven
that it was required functionality that was not provided in the
new system. We would take the best of the old systems and the
best that the marketplace had to offer. For no particular reason,
we named the system Aurora.

     In designing each aspect of the environment, we always kept
in mind the 4 Rs: Redundancy, Reliability, Recoverability and
Reproducibility.  In addition, before we built something
ourselves, we surveyed the market to determine if a product was
available that met our needs.  Unfortunately, in far too many
cases, the products either were not close enough to our needs, or
required significant local customization.  If a vendor's product
provided functionality that was close to meeting our
requirements, we attempted to use Morgan Stanley's size, buying
power and clout to encourage the vendor to modify the product to
meet our needs.

     Finally, it was important that the system we designed was
abstract enough to support multiple hardware architectures, yet
still provided an interface specific enough so that developers
could create both system and user applications that run on all
Aurora platforms.

                          System Design

     There are four key components to Aurora:
1. Global Look and Feel (Global Desk)
2. Global File System (AFS)
3. Global Configuration Database (DSDB)
4. Global Homogeneous OS Configuration

Global Desk

Choosing the Window Manager

     The change that was most visible to users was the
replacement of the window manager on the desktop. Before Aurora,
different groups were using different window managers, including
olvwm, mwm, twm, tvtwm, fewm and others. In order to avoid having
to modify the user's window manager again for a long time, we
peered into the future and decided to go with the Common Desktop
Environment (CDE), which appeared to have enough vendor
acceptance behind it to win the desktop manager wars. However,
CDE, as it was implemented at the time, did not have sufficient
functionality for our environment.  In particular, it had the
following drawbacks:
 + No support for multiple displays.
   In our information intensive business, our users need to be
   able to see as much information as possible. As a result, a
   large percentage of the desktop workstations have between 2
   and 4 displays.
 + Inadequate Workspace Management
   CDE did not have a graphical window manager. We have users
   that actually make use of up to 60 workspaces. Navigating
   through windows is done via toggles that move either one
   window forward or one back. While this may be sufficient with
   a small number of workspaces, it is not adequate for a large
   number of workspaces.

     In order to address these shortcomings, we partnered with
TriTeal, the company mainly responsible for the development of
CDE. At our encouragement, they added the following major
features to CDE that we felt were required (along with many other
minor mods and bug fixes):
 + Support for multiple displays, including support for multiple
   control panels
 + Graphical Workspace Manager, similar to the workspace manager
   found in olvwm
 + Modifying workspace buttons, so that if a user has 60
   workspaces, he doesn't need 60 buttons
 + Enhancements to assist non-ICCCM-compliant applications

     One of our key requirements with TriTeal is that anything
they develop that is not explicitly Morgan Stanley specific would
be rolled back into their product. We did not want to be stuck
with a consulting special that would cause us to be unable to
upgrade to new versions, and we did not want to have to support
the product ourselves.

X Window Server

     Until recently, we were using X11 compiled from MIT sources,
rather than relying on the vendors to provide us with a working
version.  We did this because the vendors generally lagged behind
MIT by a few years, and when they did implement a server, it was
slower and buggier than MIT's. However, in the last few years,
the vendors' implementation have gotten much better, to the point
where they provide significant benefits over the MIT server. In
particular, they provide real customer support, as well as
support for all devices they sell and Display Postscript. As
such, for Aurora we have decided to use the vendors' native X11R5
server.

Rollout

     Because this was such a visible change, we decided to split
the rollout of Global Desk from the rest of Aurora and implement
it in advance of the rest of the Aurora changes. In doing so, we
believed that when we converted users from their old IS or FIR
system to Aurora, they would not even notice a change. We also
took advantage of the window system rollout to upgrade all users
to use a new standard profile system.

Profile System

     The profile system was one of the pieces that we reused from
the pre-Aurora environment. It allows us to set up users with
standard sets of configuration files, and gives us the ability to
easily update users' profiles and ensure that their profiles are
correct.  By giving the users hooks to allow local
customizations, we are providing a commonality and
standardization, while ensuring that users will not try to get
around the system. The profile system works as follows:

     There are a number of base configuration files, like
base.profile, base.vuewmrc, base.envfile, etc. These files are
common for all users. In addition, there are model files. These
are typically set up by business function. The files are located
in a central directory, and are set up via symlink to the users'
home directories.  This allows us to make global changes easily,
without editing files in each user's home directory. For example,
the Governments Trading Desk model might set up paths,
environment variables and window system configuration to include
all the programs that they use.  The Sysadmin model on the other
hand, might have all the etc directories in its path. In addition
to the common files and the mode files, there is a .custom
directory that may contain optional override files.

     There are three commands available: installprofiles,
checkprofiles and restoreprofiles. A user is set up with the
installprofiles command. This saves all old profile information
(which can later be recovered using restoreprofiles), and
installs symlinks or copies of files from the central
installation directory. The files that are installed contain
references to other files in a model directory, and to files in
the user's $HOME/.custom directory. The model of the user is
determined by the contents of the $HOME/.model file.  In addition
to static files tat are copied or symlinked from the common
location, there are action files, that specify programs to run to
populate other files in the user's home directory. For instance,
the .printer file, which contains the name of the default
printer, is generated by picking a printer based on the location
of the user as obtained from the corporate database.  However if
that file already exists, the program assumes the user does not
want it changed, and just leaves it.

     The checkprofiles command compares the contents of the
user's configuration files to the contents of the common
configuration file. When a user calls the help desk with a
problem, one of the first things the support personnel do is run
a checkprofiles. If the contents differ, the user can reinstall
the standard profiles, and check for the problem again. If the
problem is gone, we have significantly reduced the problem set.
In some instances, we may even tell the user that it is his own
problem, and he must fix it himself.

     This model provides a powerful resource for easily making
global or departmental changes without having to run around and
modify files in every directory.

Global Filesystem Project

Limitations of the Pre-Aurora Environment

     Our pre-Aurora distributed filesystem environment is built
entirely with NFS. We have gone to great lengths with the free-
ware automounter, amd, to try to maximize the functionality
provided by the underlying NFS technology, but it has, at best,
several limitations:
 + Redundancy/Replication
   Replication is obtained by using locally developed scripts
   which use rdist and track to force remote copies of
   replicated data to remain in sync with a master copy.
   Redundancy is obtained by configuring amd to mount key
   filesystems from one of potentially many sources. Regardless,
   when an NFS mounted filesystem hangs, it hangs. If you are
   paging off of an NFS mount, there is no transparent fall over
   mechanism to allow use of available backup filesystems - you
   core dump or hang.
 + Building-wide shared namespace
   NFS does not perform well enough to use reliably over low
   speed WAN links, so practically speaking, a building-wide
   namespace is the best we can do.
 + Network Intensive
   NFS caching is minimal, in core memory, and has no
   consistency.  Heavy use of filesystems over the network
   results in heavy use of the network.  The above limitations
   had wide ranging implications on how the pre-Aurora
   environment was built, such as how many server were
   necessary, how many copies of data were required, etc.

Design Goals

     The features we wanted to have in the Aurora network
filesystem included the following:
 + Better Redundancy and Automated Replication
 + Global access to shared files
 + More efficient use of the network
 + Better Security mechanisms

File System Selection

     There are not too many options to choose from to meet the
above requirements. NFS over TCP and CacheFS are solutions which
do not meet all of our requirements and are not available on all
the platforms we need to support. DFS would appear to be a
possibly better choice, but the technology is not yet mature
enough for use in a production environment.

     AFS is currently the only production, supported distributed
filesystem technology available meets a significant portion of
our requirements.  Some of the features of AFS which are key
reasons why it was chosen are:
 + Local Disk Cache
 + Guaranteed Cache Consistency
 + Logical Volume Management
 + Automated Data Replication
 + Transparently Available Redundant Data
 + Superior Performance over WAN links

     AFS is not the perfect solution for use as a Global
Filesystem, however. There are a number of problems we have
encountered, most of which we have been able to work around, and
some have introduced new constraints on the design of the
environment:
 + One Global Cell
   The Ubik protocol used by the AFS database server does not
   scale well enough to allow us to have one global cell,
   covering more than 30 interconnected offices. This restricts
   us to having one cell per building.
 + Inter-cell Data Distribution
   AFS provides a mechanism for replicating data within a cell,
   but there is no mechanism for distributing replicated data
   between different cells. We have had to develop a
   distribution mechanism from scratch internally.
 + Kerberos Support
   We already used Kerberos on our systems. The version that we
   use, however, was different than the version required by AFS.
   As a result, we had to provide bridging mechanisms between
   the two systems.
 + Sparse File Support
   There is none. Sparse files in non-replicated volumes work by
   accident, but sparse files in replicated volumes do not. This
   precludes use of AFS by a large class of internal data files,
   which will have to remain in NFS until we have true sparse
   file support in AFS, or eventually DFS.
 + Byte-Range Locking
   File level locking is supported, but not byte-level range
   locking. This prevents many PC-based application from using
   AFS for user-data files, since use of byte-range locking is
   so prevalent in that environment.
 + Backup Systems
   The AFS backup system is very weak, and third party support
   for AFS backups was non-existent. Early attempts to restore
   huge amounts of data have been very disappointing. We
   partnered with another vendor (Boxhill) and had them write a
   module (vosasm) that interfaces with Legato's Networker to
   provide volume-level backups. However the technology is still
   far from optimal.
 + No Per File Permissions
   The file permission semantics change significantly between
   UFS and AFS. While there are more options for directory
   permissions, there are no per file permissions.
 + Significant Departure from UFS Semantics
   Error checking write() is not good enough anymore; you have
   to check close(). This change requires coding changes to
   bullet proof some applications, many of which we do not have
   control over (third party applications). The behavior of
   mmap()ed file is also significantly changed, so migration of
   data and user applications to AFS is not trivial in some
   cases.

 % fs lsm /ms/.global/*
 `/ms/.global/bk.a' is a mount point for volume `#a.bk.ms.com:ms.cell'
 `/ms/.global/ln.a' is a mount point for volume `#a.ln.ms.com:ms.cell'
 `/ms/.global/ex.a' is a mount point for volume `#a.ex.ms.com:ms.cell'
 `/ms/.global/mg.a' is a mount point for volume `#a.mg.ms.com:ms.cell'
 `/ms/.global/tk.a' is a mount point for volume `#a.tk.ms.com:ms.cell'
 `/ms/.global/sa.a' is a mount point for volume `#a.sa.ms.com:ms.cell'
                     Figure 1:  Mount points

 % ls -al /ms
 total 17
 drwxr-x 2     afsadmin 2048 Dec 19 1994  .
 drwxr-xr-x 16 root      512 Jul 18 03:08 ..
 drwxr-xr-x 2  afsadmin 2048 May 13 00:28 .global
 drwxr-xr-x 2  afsadmin 2048 Jan  5 1995  .local
 drwxr-xr-x 68 afsadmin 4096 Jul 17 13:06 dev
 drwxr-xr-x 2  afsadmin 2048 Jul 10 14:31 dist
 drwxr-xr-x 4  afsadmin 2048 Jun 22 18:02 group
 drwxr-xr-x 29 afsadmin 2048 Jul  5 10:53 user
                    Figure 2:  Top level view

 % ls -al /ms/user/w/wpm
 lrwxr-xr-x 1 afsadmin 27 Feb 10 10:56 /ms/user/w/wpm ->
                                      /ms/.global/sa.a/user/w/wpm
               Figure 3:  Global namespace pointer

Global Virtual Cell

     The goal of a single, globally consistent view of a shared
filesystem is still possible, even within the limitations of the
technology provided by AFS. Although the granularity of
individual cells is at the building level, this is hidden in the
user view of the filesystem.  Users will almost always access
data through our canonical name space, not being aware that some
files come from the local cell, while others come from foreign
cells.

     Inter-cell access is provided by mount points for each
cell, collected under /ms/.global; see Figure 1.  The two
character names refer to our locations, for example tk is Tokyo,
ln is London, etc.

     The pathnames used by users, applications, etc, reference a
canonical pathname space. The top level view is shown in Figure
2.  The four directories dev, dist, group and user make up the
canonical namespace.  A user home directory, e.g.,
/ms/user/w/wpm, is just a pointer into the global namespace to
the actual location of the volume for that user; see Figure 3.
Thus, wpm's home directory becomes transparently relocatable
from cell to cell, without requiring changes to the canonical
pathname to his home directory.

     The same approach has been taken for the dev (Development
Volumes, e.g., source code) and group (Shared Group Volumes),
and together these 3 classes of data cover all the non-
replicated RW data we maintain in AFS.

     The final class of data, distributed replicated data, is
available under /ms/dist, which consists of mount points for
volumes assumed to be obtained from the local cell. Our
internally developed volume distribution system automates the
task of actually replicating the data between cells.

Real-time/Batch Auditing

     There are no built-in mechanisms for auditing the state of
AFS fileservers. In order to support AFS in a production
environment we have had to spend a significant amount of time
developing software to audit the state of AFS, both in real-time
(monitoring AFS server process error logs) and in batch.

Volume Management System (VMS)

     In a multi cell environment such as ours, we are required
to replicate data between cells. We developed a system to
automate the distribution of data, using incremental vos
dump/restore technology, and a configuration database to
maintain timestamp information and master/slave volume
relationships.

Canonical vs. Distributed Volumes

     We have introduced the concept of a canonical volume, which
is the master source volume for the distributed copies
maintained in each cell globally. Changes are made to the
canonical volume, and then incrementally dump/restored to the
distributed volumes in each cell. The distribution mechanism
works by dumping the backup copy of the canonical volume to a
file, and forking multiple vos restore commands in parallel to
all the cells.

Incremental Propagation

     Incremental propagation is accomplished by managing the
timestamps in a database external to the filesystem. The act of
updating backup volumes or moving distributed volumes from
server to server will invalidate the Last Update timestamps on
the volumes themselves.  Thus, when a volume is released for
distribution the timestamp from the backup copy of the canonical
volume is updated in the database.

     Once the distributed volume in a given cell has been
successfully updated, this same timestamp is updated in a
separate table in the database for this volume in the given
cell. Upon subsequent releases of a given volume, the timestamp
for each separate distributed volume is the time from which an
incremental dump of the canonical must be performed to bring it
up to date.

Authenticated Access to Privileged Commands

     In our pre-Aurora NFS-based environment, it was easy to
delegate permission to distribute and maintain a portion of our
distribution tree to non-root users, as rdist requires no
special root privileges to distribute files from to server to
server. AFS requires special privileges to dump/restore volumes.

     VMS uses Kerberos mutual authentication to determine the
identity of the user, and then performs various restricted
operations on the user's behalf, using the user identity as the
key to lookup authorized operations. This mechanism allows
development groups to create new AFS volumes for development of
a new release of an application without the necessity of system
administrator intervention.

     Our goal is to automate all the normal operational
procedures for AFS which require special administrative
privileges. Most of the processes which are automated have
multiple steps and are very error prone when performed by human
beings, even experts. VMS automates these steps, reducing
operator error, and eliminating the need for giving junior
administrators special privileges.

Growing Pains

     Although the implementation of AFS for our Global File
System is well on its way to success, it has not been without
its problems.

WAN Issues

     Filesystems across WAN links have historically been
forbidden because of the ease of inducing heavy use of the WAN
and the poor behavior of the filesystem technology (NFS). Use of
AFS over the WAN is significantly better, given the behavior of
the RX protocol under high retransmission scenarios. But 10MB of
data over a 64KB link is still a heavy load regardless of the
efficiency of the transfer mechanism.

     We have to be very careful to ensure that access of non-
replicated globally available data is minimized, since we have
market data critical to our traders flowing over the WAN. Use of
technology such as Cisco's custom queueing helps to minimize the
impact of the problem, but a loaded WAN link still needs to be
avoided if possible. This will prove to be a challenge as
production usage of non-replicated data increases. Tools for
analyzing WAN access and pinpointing the culprits during a
saturation condition have not yet been deployed.

Training and Support

     AFS is a radical new technology from the point of view of
support personnel. Learning to manage the filesystem, debug and
analyze problems, etc. requires understanding a new set of
technology and tools. Training has been and continues to be a
challenge, as we struggle to get existing administrators up to
speed on both the vendor provided technology and the system we
have developed internally to implement the global filesystem.

Current Status of the Global Filesystem

     We currently have 26 buildings (i.e., potential cells)
globally with UNIX hosts of some kind, and by the end of 1996
will have AFS available in all of them. The size of these
installations ranges from 3 or 4 hosts to over 2000, and thus
the server infrastructure varies from site to site.

     In the larger cells, we have installed dedicated AFS file
and database servers. In medium sized cells, we have dedicated
servers performing both file and database server functionality.
In smaller sites, existing servers are simply having additional
disk space install for AFS service, and these servers will share
functionality with other CPU and database functions.

     We currently have approximately 20 GB of replicated data in
AFS, and over 200GB of read/write user data (i.e., source code,
user home directories, and group directories). We expect the
amount of replicated data to grow steadily, but non-replicated
data growth could potentially be explosive, with as much as a
Terabyte on line by the end of the year.

Distributed Systems Database (DSDB)

Design Goals

     As anyone who has attempted to manage a large set of
systems centrally knows, there is an absolute need for a central
repository of configuration information. This includes the
traditional user and host information, but also extends to
machine information, configuration files, software information
and a legion of other information. In addition, the
configuration database must provide significant dependency
checking, so that, for example, a user cannot be deleted until
all groups and mail groups that he owns are gone, any projects
that he has responsibility for are reassigned, etc. The Global
Configuration Database is part of virtually every aspect of the
system.

     Before attempting to build a system from scratch we
surveyed the marketplace for possible solutions. Unfortunately,
most products address environments in which there is only
minimal central management, with many departments wanting their
own autonomy. As mentioned earlier, we solved this problem using
political, rather than technical means. We felt that allowing
multiple administrative domains introduces the possibility, even
likelihood, of local configuration changes. Our goal was to
maximize homogeneity and minimize local changes. These products
are designed to optimize for local customization.

     We also felt that these products focus on allowing
distributed autonomy meant that the functionality to allow total
central administration suffered. Additionally, we wanted
whatever system we used to tie in seamlessly to the other
databases that were scattered around the company, such as the
Human Resources and Inventory databases. Since we had already
built databases like this in our pre-Aurora systems, we decided
that we had the experience and knowledge to do a better job of
building this database ourselves.

     The goal of the system that we built, the Distributed
Systems Database (DSDB) was to create a configuration database
so that, in the event that we lost every building at Morgan
Stanley, we could rebuild the entire physical environment with a
backup of the DSDB and a lot of money. Note that this does not
mean user data, only system configuration information. All
system configuration information is primarily stored in DSDB,
and only derived on the system itself.

     Additionally, the database that was designed is a
configuration database only. There is no real-time access
component of it. For example, the source for the NIS maps are
maintained in DSDB, but there is a separate process that dumps
the information from DSDB to NIS, and NIS handles the real-time
lookup queries.

     In the pre-Aurora environment, we had two databases of
information that contained information in the NIS maps and
machine configuration information. However as we grew, we
discovered some major problems with them:
 + They were two separate, non-interacting systems
 + There was very little consistency checking. The checking
   that was done was typically only on the input side. When
   someone deleted a user, there was no checks to see if that
   user was in other groups, mailgroups, etc. As a result,
   there was a lot of old cruft lying around.
 + Because of the size of our maps, changes took over an hour
   to dump from the database and propagate worldwide.
 + They were not designed to scale as much we were scaling our
   environment.
 + One of the databases was based on a lightweight, locally
   developed database engine that was very flexible, but in
   which complex queries took a long time
DSDB was designed to replace both these databases and provide
significant added functionality. DSDB was designed to:
 + Provide extremely strong consistency checking
   You cannot add objects that do not fall into pre-specified
   criteria (for instance, you cannot add a host entry unless
   the network already exists) and you cannot delete an object
   that is referenced by anything else (you cannot delete a
   host for which there is an fstab entry in the database).
 + Interact with other internal databases
   Other organizations had databases that were primary sources
   of information of information, such as the Human Resources
   database and the Inventory database
 + Ensure information exists in only one place
   When a user changes their name, we no longer have to update
   it in dozens of places. They submit the appropriate form to
   the human resources department, and a short time later, the
   Gecos Field in the password file reflects the change.
 + Complete historical information
   Information in the database is never deleted, it is aged.
   Using this, we can
    + get a snapshot of our environment at any given point in
      time
    + trace who did what to the environment
 + Provide incremental propagation
   We can request all changes from the database from a given
   time.  This allowed us to implement incremental propagation,
   which means that we now propagate all NIS changes worldwide
   within minutes, instead of days.
 + All data is world readable (internally)
   There is no private data in the database. Based on the
   nature of the data that we were storing, we felt that this
   allowed us to easily avoid complex security issues.

     In additional to the standard NIS information, we keep
virtually all configuration information in this database. Some
examples:
 + fstab file information
 + processes to start when a machine is rebooted
 + machine configuration information, such as the devices on
   the system, the names of the mounted file system, etc

Primary vs. Secondary Information

     There are two types of information kept in DSDB, primary
and secondary. Primary information is that information for which
DSDB is the primary source, such as the data in the password and
host files. Secondary information is that information for which
DSDB is simply a repository for ease of searching and reporting,
such as the amount of memory on a machine, or the type of the
graphic card, or even the user's real name (for which the human
resources database is the primary source).

     Primary information is typically entered manually as the
user or machine or service or whatever is added. Secondary
information is usually loaded in an automated, batch format.
There are nightly audits that are run to collect information
from all machines and upload this information to the database.
This information allows us to create reports of system
characterizations and usage.

     There are a number of programs that run to get information
out of the database. One such program is the NIS updater, which
distributes the NIS maps incrementally. Another is the program
to update the fstab file on a system after it has rebooted.

Architecture

     Because the majority of our pre-existing internal databases
are based on Sybase, we based DSDB on Sybase as well. All
updates to the information are done through stored procedures.
These procedures are responsible for the strong consistency
checking.

     The history mechanism that we use provides us with a means
to do incremental propagation. For the NIS maps, every NIS
server is a master. When it retrieves information from the
database, it stores a timestamp in the YP_LAST_MODIFIED key of
the map. Next time it queries the database, it asks for all
changes since that timestamp.  The sybase procedures to do this
are tuned to make this a very fast operation.

Operating System Configuration

Design Goals

     The primary goal of the OS Configuration portion of Aurora
is to design an Operating System configuration which allows us
to manage easily tens of thousands of hosts spread around the
globe. The OS configuration is only concerned with system data,
not with the data associated with, for example, user home
directories, exported NFS partitions, or Sybase data partitions.
We are addressing the files that are required to run the core
system - those file under root, /usr, /var, etc.

Homogenous Support for Heterogeneous Hardware

     We want to maximize the uniformity of the machine
configurations to simplify administration, however we also
require support for multiple hardware platforms, simply as a
means of leveraging the right hardware for the right job to meet
the varying needs of a wide variety of applications. From a
hardware and operating system point of view, the system must
support heterogeneous hardware platforms, however the operating
environment must be as uniform as possible across all
architectures.

     Homogeneity is not easy to accomplish. Doing so in a
heterogeneous environment is extremely hard. However we have
shown in our previous environments that creating an environment
like this reaps large benefits, including economies of scale and
the ability to allow users to use the best tool for the job.

Rapid, Automated Installation and Reconfiguration

     Hosts should be easily and rapidly reconfigurable. An
installation mechanism which takes two hours to dump data from a
CDROM, and then requires an administrator to update several
locally maintained files, either from a backup system or from
memory, is simply not practical when installing hundreds of
machines. Furthermore, if a trader's workstation dies, minutes
make a real difference. Our users require us to replace user
desktops in under 10 minutes.  Server OS reconfiguration should
be just as fast, not accounting for possible restoration of data
such as Sybase partitions.

     Installation of systems are often done by people without
the root password, such as electricians. When a machine is
removed from the vendor's shipping material, the installer
should be able to plug it in, turn it on and wait till he gets
the login prompt. The only work a system administrator should do
is define the ethernet address in the host configuration
database.

Control-Alt-Delete

     When there is a problem with a user's desktop, it is not
always appropriate to take the time to discover what the problem
is. Often, time is of the essence, and the user just wants to
get back on-line as quickly as possible. We allow users to do
the Unix equivalent of Control-Alt-Delete. If there is a
configuration or disk problem, the user on a Sun, for example,
simply types L1-A, and then boot net. The machine is completely
scrubbed, reinstalled and rebooted in under 10 minutes, with no
intervention by an administrator and no root access required. In
fact, in Aurora it takes less time to reinstall a system than it
does to fsck it's root partition.

     In actual practice we discourage use of this feature by end
users.  The system administrator needs to make the decision that
the problem requires a reboot, and then about whether
discovering the root cause of the problem is more important than
getting the machine back up again quickly.

Support for Machine Models

     All customization of machine configuration will be handled
via the configuration database, by associating the
customizations for a particular type of machine with a model.

     For example, a desktop model would be used for most user
workstations with graphical displays attached to them. However,
the model for a generic headless CPU server in a comm room
server, may differ only sightly from desktop. The former will
support use of mirrored or striped filesystems perhaps,
something we don't currently support on the desktop.

     A sybase model may differ from the generic server model by
configuring special directories in /var, installing a special
version of a kernel (or configuring a special loadable kernel
module for use).

     An afsfileserver model will have a much larger number of
locally copied replicated files in order to make the machine as
independent as possible on the primary service it is providing.

Upgrade to new technologies

     We used SunOS 4.1.3 and AIX 3.2.5 in the pre-Aurora
environments, and have taken these operating systems to the
limit in many ways.  The vendors have begun putting more effort
into the new operating systems, and the most modern hardware is
only supported on the most recent versions of the operating
systems. We chose to implement the OS configuration portion of
Aurora entirely on Solaris 2.x and AIX 4.x in order to take
advantage of the advanced available in these OS release.

Building from Scratch

     We could have saved ourself some time by making use of the
binary compatibility modes available in upgrading the operating
systems.  We made a conscious choice not to do this. Rather,
every single program, binary and script was carefully examined,
and its existence in Aurora had to be justified. This allowed us
to bring forward only those applications, scripts, and
methodologies which are required in the new environment, leaving
behind a lot of historical infrastructure which is no longer
necessary or simply outdated.

Dataless AFS Client Design

     Many of the above design goals are met by implementing a
dataless AFS client. In this design, local copies of replicated
files (e.g., /usr/vice/etc/afsd) are present only in order to
bring up afsd, from which point on, everything is accessed from
AFS. Every file kept local to the machine has to justify its
existence based on this criteria.

     Most of the traditional pathnames for entire directories
and most configuration files are merely symlinks into AFS. Prior
to starting afsd, /ms is a directory which contains a symlinks
to /localfs, where the real copies of locally maintained files
reside. Once afsd is started, the AFS namespace overlays this,
and files are no longer accessed from /localfs.

     Minimizing the contents of /localfs is one of the keys to
minimizing the installation and reconfiguration time of a host.
The amount of data on the current Solaris 2.4 model is less than
20MB. At boot time, all local files are checked against correct
copies of the file in a central AFS repository. If files are
different, new files are copied to the local disk, and a reboot
occurs. This allows us to always keep local files up to date.

     One of the problems we had to solve with the AFS model was
AFS cache initialization time. We use a fixed number of files
for afsd start-up, and the creation of 400+ files in a large
cache normally takes as much as 30 minutes to complete. This
time is reduced to less than a minute by enabling asynchronous
I/O during the afsd start-up.

                            Summary

     As of the writing, the environment has reached the
following milestones:

AFS

     We have rolled out AFS to both the old and new
environments. AFS is currently available on about 50% of our
machines worldwide. By 12/95, AFS will be available on all
systems.

Global Desk

     The implementation of Global Desk is complete. Currently,
Global Desk is rolled out to 25% of our systems. We expect 100%
installation by Q196. The rollout of Global Desk is slowed
because each user must be changed and converted to the profile
system.

DSDB

     DSDB is running for Aurora and in parallel on our old
systems, supplying all NIS information. In will be rolled out to
all systems by 12/95. New functionality is being continuously
added.

OS Configuration

     There are currently about 50 pure Aurora desktop systems,
based on Solaris 2.X. Work to support AIX 4.X is underway.
Additionally, we have about a dozen production servers running
under Aurora. We expect desktop rollout of the Aurora OS
Configuration by Q196.

                        Acknowledgments

     The following people were involved in the initial design of
the Aurora System: David Birnbaum, Marc Donner, Chris Edmonds,
Xev Gittler, Douglas P. Kingston, W. Phillip Moore, David
Nochlin and J. Rambhaskar.  Richard Campbell, Bruce Howard and
Mike Lewis were also involved in the implimentation. Finally,
thanks to Ben Fried, David Nochlin, Paul O'Donnell and Rebecca
Schore who all proofread the paper and made valuable comments.

                       Author Information

     Xev Gittler has been a Unix Systems Administrator/Architect
for over 10 years. He received his B.A. in Medieval and
Renaissance Studies in 1987. He started administering systems in
universities, moved to R&D labs and finally to Wall Street,
along the way managing systems from 3-5,000 machines. He co-
founded and runs the New York System Administrators Group
(NYSA), a local-area group affiliated with SAGE. He has no
outside interests other than his wife and his child. He can be
reached at xev@morgan.com.

     W. Phillip Moore received a B.S. in both Physics and Math
from the University of Oregon in 1986, and then entered the
Ph.D. program in Physics at The Ohio State University, from
where he escaped with an M.S. in 1988, after publicly confessing
he was in reality an engineer. He then fled to Osaka, Japan
where he spent 4 years as a UNIX/Network Systems Administrator
for Matsushita Electric Works.  In 1992, he joined Morgan
Stanley Japan in Tokyo and worked in the Distributed Systems
group, and became a founding member of the Core Infrastructure
Group. Phil now works in the Morgan Stanley's New York office
and heads the Global Filesystem Project. He can be reached at
wpm@morgan.com.

     J. Rambhaskar, received his B.E. degree in Mechanical
Engineering from Shivaji University, India in 1990. During the
Summer of 1991, he join the Systems Group, College of
Engineering, Ohio University as a System Administrator. He
joined Morgan Stanley as a System Administrator in October 1993.
He later joined Morgan Stanley's Core Infrastructure Group in
October 1994. Since then he has been involved in the OS
Configuration and Global Filesystem projects. He can be reached
at jram@morgan.com.

                           References

James Gettys, Project Athena, pp. 72-77, USENIX Conference
     Proceedings, 1984.
John H. Howard, On Overview of the Andrew File System, pp.  23-
     26, USENIX Conference Proceedings, 1988.
Michael Leon Kazar, Synchronization and Caching Issues in the
     Andrew File System, pp. 27-36, USENIX Conference
     Proceedings, 1988.
Lessons Learned from Project Athena, Dan E. Geer, Jr.  pp. 221-
     247, Distributed Computing: Implementation and Management
     Strategy, Edited by Raman Khanna, Published by PTR Prentice
     Hall, 1994.
John Leong, Project Andrew, in Distributed Computing:
     Implementation and Management Strategy, pp 203-220, Edited
     by Raman Khanna, PTR Prentice Hall, 1994.

         Appendix - Details of OS Configuration Design

     This section provides the details of how we designed the OS
configuration for Solaris 2.X. In particular, it explains the
installation process, the run time file system layout, how we
update local system files, and how the start-up scripts work.

Installations

     The Solaris 2.x Jumpstart installation procedure has grown
more manageable since the SunOS 4.x suninstall. It provides
pre-configuration information, etc. However Jumpstart does not
provide complete automatic configuration. In addition, the
overhead on the installation server is high, requiring 200+ MB
of UFS disk space, and it does not support client builds across
a router.

     The design requirements for clients installation specified
that the installation time should be under seven minutes, the
effort to install a client should be the absolute minimum and
the resources required on the installation server should be
minimal. Our goal was to allow someone to simply plug the
hardware into a power source and the network, and turn it on.

     The steps involved in the base Solaris2.x JumpStart
(netinstall) are very simple.
1. Client send a rarp request and get the IP Information, then
   tftpboots inetboot
2. With the IP Information it then contacts the bootparams for
   root file system information, etc. This root file system
   entry in bootparams is KVM specific.
3. NFS mounts root and loads up the kernel.
4. The kernel remounts root by contacting bootparams again and
   starts up init
To achieve our design requirements, we did the following.

 exmktaaa root=saaa2:/export/fid/bootnet/sparc.Solaris_2.4
 sun4c=saaa2:/export/fid/bootnet/sparc.Solaris_2.4.sun4c
 sun4m=saaa2:/export/fid/bootnet/sparc.Solaris_2.4.sun4m
 wsmodel=localhost:aurora_install cellname=localhost:a.sa.ms.com
               Figure 4:  Typical bootparam entry

 % ls -l /usr
 lrwxrwxrwx 1 root other 19 May 24 11:47 /usr -> ./ms/dist/sunos.5.4
            Figure 5:  Example symlink during startup

 % ls -al /ms
 total 10
 drwxr-xr-x 2 root other 2048 Jun 5 21:41 .
 drwxrwxr-x 20 afsadmin sux 2048 Jul 27 15:53 ..
 lrwxr-xr-x 1 root other 23 May 9 14:21 dist -> ../localfs/root/ms/dist
        Figure 6:  Contents of /ms before AFS is running

 % ls -l /ms
 total 34
 drwxr-xr-x 2 afsadmin root 2048 Dec 19 1994 .
 drwxr-xr-x 23 root root 1024 Jul 26 17:18 ..
 drwxr-xr-x 2 afsadmin root 2048 May 13 00:28 .global
 drwxr-xr-x 2 afsadmin root 2048 Jan 5 1995 .local
 drwxr-xr-x 67 afsadmin root 4096 Jul 25 18:44 dev
 drwxr-xr-x 2 afsadmin root 2048 Jul 28 14:30 dist
 drwxr-xr-x 4 afsadmin root 2048 Jun 22 18:02 group
 drwxr-xr-x 29 afsadmin root 2048 Jul 5 10:53 user
         Figure 7:  Contents of /ms after AFS is running
1. Fixed rarpd to dynamically assign IP information if no
   information is available for that host.
2. Fixed inetboot to autodetect KVM values so architecture
   information is not required in bootparams
3. Patched the NFS kernel module to detect KVM values so that
   it can do the right thing when it queries bootparams to
   remount root.
4. Truncate the /export/install (NFS install tree) to a very
   small distribution, approximately 20 MB per architecture
5. Separate out the installation server functionality into 2
   parts
   a. local network functions services like rarp, tftp and
      bootparams
   b. NFS root filesystem.
6. Dropped pkgadd to install client packages as it was too time
   consuming. Instead, we created a prototype of the root
   partition and we use cpio to install it on the new
   filesystem.

     Using this method, a single installation server can install
an arbitrary number of clients. The only limitation is the
bandwidth between the client and server, and the number of
simultaneous installs. The actual installation process is as
follows:
1. The machine is powered up, and a network boot is started
2. A rarp request is sent out to the network
3. The dynamic rarpd provides a hostname and ipaddr
4. The machine then tftps the inetboot kernel and runs the
   inetboot kernel
5. inetboot detects the KVM type and NFS mounts the proper NFS
   root associated with the KVM type
6. The machine loads the /kernel/unix and all the required
   kernel modules and starts up init
7. init runs /sbin/rcS and /sbin/rc2
8. /sbin/rc2 starts up AFS and basic NIS services
9. After the basic Services are started up, the model script is
   run, which formats the boot disk (sd0), sets up the AFS
   cache and uses cpio to copy files to the local disk. The
   machine then reboots from the local disk.
Figure 4 shows a typical bootparam entry.  wsmodel is the model
name for a machine. cellname is the AFS cell name to use for
installations. It need not be the local cell.  sun4m and sun4c
are the NFS roots depending on the KVM.

Boottime and Runtime Layout of the File System

     The contents of the root file system on an Aurora
workstation is minimal; it only contains files that are required
to bring up AFS.  The remainder of the contents of the root file
system are symlinks back into /ms (which is our AFS mount
point). Since /ms is not available before starting AFS, we have
a UFS /ms which has symlinks pointing to /localfs. When AFS
starts up, these symlinks are hidden from the system; see Figure
5 for an example.  Figure 6 illustrates the contents of /ms
before AFS is running.  /localfs contains files that will be
overlaided when AFS is started up.  Figure 7 shows the contents
of /ms after AFS is running.

Updating Local File Systems

     A master copy of all files that are required on the local
disk at boot time is kept in an accessible AFS directory. Every
time a machine reboots, a script is run to check if there is any
file in the repository that differ from the files on the local
disk, if new files have been added, or if old files have been
removed. In addition to the common repository, there is also a
provision for overriding files on particular machines.

     The update script takes about thirty seconds to run and
makes sure that all files in UFS are identical to the master's.
If any changes are applied to the UFS as a result of this
script, the machine is rebooted. This allows us to update
kernels and other locally required files and ensure that they
are always up to date.

Start-up scripts (/etc/rc*)

     Start up Scripts needed before AFS are maintained in the
UFS. The rest of the start-up scripts are symlinked back into an
AFS directory.  For example, the SunOS default /sbin/rc2
executes scripts from /etc/rc2.d. In Aurora, /sbin/rc2 runs
scripts from /etc/rc2.preafs.d and then from /etc/rc2.d.
/etc/rc2.d is a symlink back into AFS.  This means maintaining
scripts across all machines becomes trivial.

     Machine specific start scripts are started out using start
which is a locally written script which reads from central file
and evaluates whether a particular process should be started on
a particular machine. By using this mechanism, we avoid the need
for different rc scripts on different machines.