################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally presented at the

	  Ninth System Administration Conference (LISA '95)
	     Monterey, California, September 18-22, 1995


	    It was published by USENIX Association in the
		    Conference Proceedings of the
		Ninth System Administration Conference

 
        For more information about USENIX Association contact:
 
                   1. Phone:    510 528-8649
                   2. FAX:      510 548-5738
                   3. Email:    office@usenix.org
                   4. WWW URL:  https://www.usenix.org
 
 
^L


                     Metrics for Management
                Christine Hogan - Synopsys, Inc.

                            ABSTRACT

     We were recently asked by management to produce an
interactive metric.  This metric should somehow measure the
performance of the machine as the user perceives it, or the
interactive response time of a machine.  The metric could then be
used to identify unusual behavior, or machines with potential
performance problems in our network.  This paper describes
firstly how we set about trying to pin down such an intangible
quality of the system and how we produced graphs that satisfied
management requirements.

     We also discuss how further use can be made of the metric
results to provide data for dealing with user reported
interactive response problems.  Finally, we relate why this
metric is not the tool for analyzing system performance that it
may superficially appear to be.

                          Introduction

     Metrics are being used to produce charts and graphs of many
areas of our work.  For instance, at Synopsys, the number of
calls opened and closed in a given day or month are charted, as
are the number resolved within a given time period.  There are
also metrics that monitor more specific areas of our work, such
as new hire installs, and more general things, such as customer
satisfaction ratings.  More metrics are constantly being devised
by management, and put into place by the systems staff.  One such
metric was the ``interactive metric'', which was intended to
measure the interactive response time of a machine and highlight
any problems that might require further investigation.

     While it is possible to monitor a number of different
components of the system that could influence system response
time [1, 2, 3], there are currently no tools available to monitor
the performance of the system as the user perceives it.  We were
asked to develop such a tool.  The interactive metric was meant
to imitate a user as closely as possible, and measure how long it
took to perform a ``typical'' action.  In this paper, we first
provide some background information on the particular
installation in which this metric was to be installed and
describe some of the issues that arose in its design.  Then we
present the actual design of the metric.  Thereafter we focus on
the issues that arose in interpreting and presenting the data
that the metric gathered.  These issues, in fact, were the key
ones when it came to presenting the results to management and
determining to what extent we, as system administrators, could
find them useful.  The paper concludes with a discussion of the
limitations of the metric, in particular from a system
administrator's perspective, and describes some possible
extensions that could enhance its current utility.

                           Background

In this section we present a description of the site at which the
metric is running.  Also included are some tables that relate
machine names to architectures, fileservers, subnets and the role
of the machine.  This data provides some insight into the graphs
produced later in the paper.  The motivation and intentions
behind the development of the metric are also briefly mentioned.

The Site

The site at which this metric was developed is in Mountain View,
California, in the head office of a company with a number of
sales offices throughout the US, Europe and Asia.  Most of these
sales offices are on the company WAN.  The Mountain View campus
network has a services backbone, to which the shared servers are
connected [4].  The servers are often dual-homed with the other
interface being on a departmental subnet.  Each department has
its own subnet, with a number of fileservers and compute servers,
as well as desktop machines.  Desktop machines are a mixture of
X-terminals and workstations.  The workstations generally have
local swap, remote root, and shared, read-only usr partitions.
There are also a large number of Macintoshes and PCs in this
site.  However, the metric is not run on those machines.  The
metric was initially tested on a small subset of machines on a
few of the subnets.  For each subnet that was included in the
testing phase a fileserver was selected, along with a number of
machines that were known or suspected to be slow, and machines
that represented some cross-section of the architectures at the
site.  The fileserver that was selected for a given subnet was
one of the fileservers typically used by the machines on that
subnet.  Table 1 groups the testing machines into subnets and

shows the fileservers used, while Table 2 gives the architecture
of the fileserver.

     During the initial phases of testing and designing the
metric it was also run in a number of WAN-connected sales
offices, which are not shown in Tables 1 and 2.  The sales
machines were dropped since the complaints that were being
received from people in the sales offices related to WAN latency
issues, and the metric is not a suitable tool for monitoring or
graphing the performance of a WAN connection.

 Machine     Subnet   Fileserver    Architecture        Purpose
 mingus          72   anachronism   Sun 4/40            Desktop
 gaea            64   anachronism   Solbourne S4100dx   Many services
 underdog        72   anachronism   Sun SS4             Desktop
 kency           72   anachronism   Sun SS10            X-terminal server
 amnesia        100   anachronism   Sun SS20            NAC administration

 fili           100   dempsey       Sun SS10            CPU server
 mahogany        74   dempsey       Sun IPX             Build machine
 redwood         74   dempsey       Sun IPX             Build machine
 mordor         124   dempsey       Sun SS10/512        Porting

 canary          68   anachronism   Sun SS4             Desktop
 paris           68   mammoth       Sun 4/40            Desktop
 goofus2         68   mammoth       Sun 4/50            Desktop
 orac            68   mammoth       Sun 4/40            Desktop

 millstone       92   almanac       SS20/61             Xterm Server
 droid           92   almanac       SS10                LISP testing
 mercury         92   almanac       Sun 4/60            Desktop
 sark-92         92   almanac       Sun SS20            CPU server
 jose            92   almanac       Sun 4/60            Desktop


   Table 1:  Subnet numbers and fileservers of tested machines

 Server        Subnets    Architecture        Users
 anachronism   72, 100    Network Appliance   NCS
 dempsey       100, 116   Auspex NS6000       Porting Center
 almanac       92, 100    Network Appliance   Design Verification
 mammoth       98, 100    Solbourne S6/904    Product Engineering

            Table 2:  Architecture of the fileservers

Motivation

The idea for such a metric was arose from the observation that
sometimes a user will complain that the system is slow, or the
network is down.  There was a suspicion that ``the network was
slow'' was becoming the modern, hi-tech, equivalent of ``the dog
ate my homework''.  However, without any form of metric or
supporting data, it was impossible to try to refute those
statements, or defend the state of the system/network as a whole.
Therefore, there was a desire to try to quantify the experience
of the ``average'' user, and to have some form of data to use for
the basis for discussion.  An analogy that is frequently used at
Synopsys* [[FOOTNOTE: The inspiration for which was Eric
Berglund's, with extensions from Arnold de Leon.  ]] is the
comparison of a network with the freeway system in Los Angeles.
At any one time the network, or freeway system, is neither up nor
down.  Some segments of it may be down, or extremely congested,
but others will be in perfect working order.  In fact, this
analogy extends to the metric.  The metric is the equivalent of
taking sample trips along certain routes at different times
during the day, and measuring the time taken, to get a feeling
for ``delay''.  We recently discovered that such sampling is one
of the methods actually employed by civil engineers in the study
of traffic routes and delays.  The idea behind the metric was to
develop a tool to measure ``performance'' at a high level, as the
user sees it, rather than breaking the system down into a series
of components, none of which in itself means anything to the
user.  The metric was not intended to be an all-purpose instant
system and network analysis tool for the system administrator.
However, we, the system administration staff, thought it would be
nice if it did give us some useful feedback in return for the
time and effort expended.  We will examine to what extent our
goals and our hopes were met.

              Total        Edit 1      Edit 2     Compile      Run

 mingus      53.090136    28.652537   6.077702   18.288256   0.059256
 gaea       373.525000   332.572501   7.661372   33.211638   0.067266
 underdog    40.528716    28.717238   5.917166    5.866199   0.029712
 kencyr      43.236301    28.745429   5.941636    8.513442   0.034764
 amnesia     38.046358    28.390941   6.440619    3.173970   0.033065
 canary      39.296980    28.499463   5.820487    4.952673   0.021444

 orac        52.123264    28.618801   6.015613   17.426540   0.067794

       Table 3:  Initial results of the metric, using pty

                      Designing the Metric

There were a number of issues involved in designing the metric.
Not only did it necessitate deciding upon a typical set of
actions that would be sufficiently non-intrusive, but it also
involved determining a typical environment in which these
commands would be run.  Of particular interest were the
differences in performance for different groups of users, who
were on distinct subnets and using separate fileservers.

Design Issues

User perceived system performance cannot be simply measured by a
single number.  A user's view of the performance of a system is
formed on the basis of how long it takes to perform a particular
job, and thus two different users on the same system may have
differing views on the performance of that system [1].  Thus, an
essential part of designing a metric that would be of some use to
us involved determining what a typical user's job involved.
Since a significant portion of our user community is involved in
software development, it was decided that the typical action that
we would model would be the edit, compile, run cycle.  While this
series of actions is not typical of all sections of our user
community, the only way that we could make comparisons between
different networks and fileservers was by running the same metric
on all groups of machines.  Thus, this set of actions was decided
upon, while acknowledging that it limited the usefulness of the
results for some sections of the community.  The first phase also
involved performing some sanity checks on the metric using data
gathered from an initial set of metric runs.  This initial
examination of the data was intended to verify that the metric
was producing at least superficially believable results.  It also
enabled us to alter it somewhat to emphasize any differences in
performance that existed.

Simulating an Interactive Session

To simulate an interactive session, we first tried using Dan
Bernstein's pty package [5] to both provide ptys for  vi and to
simulate the speed at which a user types.  While the results
showed some variation from machine to machine, it was not
significant, and we did not feel that it represented the
differences that we perceived when using the machines.  Partial
results from this run are shown in Table 3.  The metric performed
two edits, a make of a small project and a short execution run.
The results here are correlated with the tables from the
``Background'' section, and it is noted these results provided a
sanity check on the metric by displaying longer times for
machines that seemed to give poor interactive response.  Even
ignoring the results produced by gaea, which suffered from some
problems during that week, the relative ordering of the other
machines at least made sense.

          
                      Total      Edit 1     Edit 2
          
           kencyr    0.837238   0.413107    0.419813
           gaea      2.391545   1.100123   1.275356
           mingus    1.064980   0.500288    0.563270
           canary    0.525883   0.257806    0.267521
           amnesia   0.417002   0.200812    0.215840
          
           orac      1.377794   0.716232    0.660213
          
    Table 4:  Initial results from the metric using chat2.pl

While the relative performances made sense, the absolute
differences in the numbers did not.  These machines had been
selected to show a wide variation.  It seemed that the time was
dominated by the speed that the pty program was playing back
keystrokes, and that actually simulating user input was therefore
probably not what we wanted, if we were looking for widely
dispersed numbers.  The metric was altered to use Randal
Schwartz's chat2.pl package [6]  to simulate user interaction.
This method of providing input to interactive programs can send
the input to the program as quickly as possible, without
simulating a delay between keystrokes.  Results from an initial
run with this altered metric are shown in Table 4.  This initial
run was executed in a different week to that in Table 3, but
measured the same sequence of commands.  These results show a
much clearer separation between the machines, and were considered
to be more representative of the differences between the
machines.  Thus the metric was altered to use chat2.pl instead of
pty, in order to emphasize the differences between the machines.

     [picture NCS.ps not available]

               Figure 1:  Average results for NCS

The Execution Framework

This section briefly describes how, based on the configuration
described in the ``Background'' section, we set the metric up to
run on a number of subnets, using fileservers that we could argue
were typically used by people in the group to which each subnet
belonged.  The metric is currently being run on each of ten
subnets that correspond to the primary business units of the
company in Mountain View.  For each subnet a number of machines
were selected to give us a sample of about three machines of each
class of machines present on that subnet.  For example, if a
subnet had Sun SPARCstation 20s (SS20s) that act as compute
servers, Sun SPARCstation 10s (SS10s) that act as X-terminal
servers, Sun SPARCstation 5s (SS5s) that are on desktops, Sun
SPARCstation IPCs (IPCs) that are on desktops, and SS20s that act
as servers for the desktop workstations, there would be five
classes of systems on that network, and a sample of each class
would be selected.  In addition, machines that we expected to be
especially heavily or lightly loaded were selected.  Each subnet
is used by a particular business unit, which uses a number of
fileservers.  One of these fileservers contains home directories,
whereas work areas containing product can be housed on a number
of different fileservers.  Thus the decision was made that the
fileserver that would be used by the metric on a given subnet
would be the one that contains the home directories of the people
on that subnet.  The metric therefore may not in fact be using
the machine that a given user accesses while working on a
product.  However, it is guaranteed to access a fileserver that
every user in the group accesses a number of times during the
day.  Therefore the chosen fileserver can be said to influence
the users' perceptions of system performance, in some way.

                      Analyzing the Results

In ``Interactive Session'' section, we showed the results as
running averages over the course of a week.  While this was
useful in the initial stages of attempting to produce a metric
that gave believable results with significant variations, it
yields no further information.  The next stage was to produce
graphs that were satisfactory from a management perspective,
which involved considerable dialogue and a number of revisions in
what graphs were produced.  We thereafter examined the data
produced to see if we could extract any useful information from
the users' or the system administrators' perspectives.  We
discovered that all three goals were quite distinct from each
other and involved taking different views of the same data.

Graphs for Management

Initially we produced some graphs that showed the average results
on an hour-by-hour basis for each of the internal groups that we
were monitoring, where each group was represented by a set of
machines on the same subnet, using the same fileserver.  Along
with these graphs, we also produced individual graphs for each of
the three best and three worst machines.  For example, the graph
produced for the Network and Computer Services (NCS) group is
shown in Figure 1, and the results from the machine jose are
shown in Figure 2.  The numbers on the horizontal axis are hours
since 0:00 on Sunday, and each graph represents a week.  The
numbers on the vertical axis represent the length of real-world
time that it took to run the metric, in seconds.

     [picture jose.ps not available]

               Figure 2:  Average results for jose
Other than noting that the systems in NCS are lightly loaded at
weekends, we also noticed that there was generally a trend in the
values produced on each machine.  What management were interested
in were variations from the trend.  Since trends are machine-
based, we normalized the results from each machine, and plotted
the set of machines that were being monitored in a particular
group on the same graph.* [[FOOTNOTE: Due to the intentional
clustering of the data around a single area - the ``normal'' line
- these graphs are best viewed in color.  ]] These graphs
represent deviations of the machines from their normal behavior.
Another representation of the data that was perceived to be
interesting was a stock-market style high, low and average chart,
on a day by day basis, based on the normalized results, as
described above.  This form of graph would highlight machines
with a large variation in performance, which could be indicative
of a problem.

Relating to the Users

The metric is also being used to gather trend data for a given
architecture on a given network.  It is thought that this data
will be useful as a reference point when discussing performance
issues with a user.  It should be possible to monitor a machine
about which a complaint has been received, compare the data it
generates with that of other machines of its class on that
subnet, and either state that there is a problem with it, or that
it seems to behaving normally for a machine of its class.

Information for the System Administrator

One way of representing the data was to produce a graph for each
group that compared the different classes of machines in that
group against each other.  As expected, the SS20s were shown to
be considerably faster than the SS5s or IPXs in these graphs.
However, we were able to make a couple of interesting
observations from these graphs.  Firstly, our SS10s that are used
as X-terminal servers yield worse performance than the SS5s that
are on people's desks.  We also noted that there seemed to be a
base ``best'' performance for each class of machines, which seems
to be primarily dominated by NFS response time, but is also a
function of the class of machine.  This ``best'' performance time
was universal across our subnets with the exception of the
Product Engineering network, on which we were using an Auspex*
[[FOOTNOTE: HP IV processor; Storage processor III; File
processor II; old-style I/O processor; running 1.6.1M1 ]] file
server, rather than a Network Appliance* [[FOOTNOTE: FAServer
1400, running 2.1.3 ]] file server.  The machines using the
Auspex exhibited marginally worse results across the board for
this metric.  However, these were brief, once-off observations.
Of more use and interest would be the use of the metric as a
diagnostic or analytical tool.  It has been said [3] that all
system performance issues are basically resource contention
issues and that the biggest challenge is figuring out which
kernel subsystem is really in trouble.  In a large site, however,
the primary challenge is discovering which individual servers or
networks are suffering performance trouble. System administrators
may never log into a user's desktop machine.  This machine may,
however, be central to the user's perception of system
performance.  The graphs that are produced for management from
this metric do not aid the system administrator in the task of
identifying individual machines that have problems because they
are based on the usual behavior of each given machine, with no
consideration given to absolute performance values.  Unusual
behavior may, however, be detected from graphs relating the
performance of one machine to others of its type, such as the
graphs that are being produced for generating concrete data for
dialogue between users and system administrators.  The utility of
both of these sets of graphs for system administrators will be
discussed further in next section. Another potential use of the
metric for system administrators is in the area of performance
tuning and reconfiguration.  Machines at a large site, such as
Synopsys, tend to be clones of a standard model.  Before changing
that hardware model, some testing and benchmarking is performed.
This principle can also be applied to performance tuning.  A
single system could be reconfigured and the performance of the
new configuration monitored over a period, with the results being
compared with its previous results and those of other machines in
that class during the same time period.  This data could then be
used to justify a decision on whether or not to similarly
reconfigure the other machines in that class.

                           Limitations

While the metric has produced some information that management
found interesting, and other information that can be useful in
dealing with performance issues that can arise with users, we
feel it is of limited use, as it stands, in producing information
that a system administrator can use.  The metric can currently be
used to detect abnormal behavior of a machine.  However, if all
of the machines of a particular type are identically mis-
configured, and yielding less than the performance of which they
are capable, that will not be detected.  Even if abnormally bad
performance is detected, the metric gives no indication of how
the performance of the system may be improved.  Also, the numbers
that it produces cannot be used directly to say that a given
machine is performing well or badly.  The metric must be
calibrated within the particular environment before the results
are interpreted.  Grouping machines into classes, and comparing
the results of a given machine to those of others in its class
during the same time period is a step in the right direction.
However, there are many other things that can affect the
performance of a system which are not taken into account in these
graphs.  For example, this metric does not indicate the amount of
memory or swap space in a system; it gives no indication of
whether a machine was otherwise active or idle at the time; and
there is not, currently, any way of correlating the results to
network load, or NFS performance of the server during the given
time period.  Another data-point that we found interesting was
that the SS4s produced better results for the metric than SS5s.
We use generally SS5s rather than SS4s at our site as desktop
machines, because it was felt that the lack of expansion
possibilities, and the slightly different operating system
outweighed any advantages that the SS4s might have.  Thus, this
result demonstrates that better performance is not everything
when it comes to choosing a machine for the standard desktop
model.  One area that we feel is a major limitation of the metric
is the lack of any way test X performance, which is fundamental
to a user's perception of system performance in our environment.
Another correspondingly significant limitation is the inability
to run this metric on our other platforms, such as Macintoshes
and PCs.

                           Extensions

The metric could be extended in a number of ways to provide more
information for the systems administrator.  In considering
possible extensions, we consider the reasons behind the
aforementioned limitations.  We also consider the approaches
taken in monitoring and tuning the performance of a small number
of systems.  We then propose ways of extending the metric to
provide additional information without adding unduly to the
logging overhead.

Finding the Source of a Problem

While it can be argued that this metric aids the system
administrator in locating machines in a large network that may be
suffering from performance problems, the graphs represent
information at too high a level to even give a feel for what
component of the system might be at fault.  Tracking down a
performance problem on a single machine is discussed in the
literature [1, 3].  Automating the testing that would be
performed on a sick system could be incorporated into the metric.
However, the inclusion of the results of this testing into the
logs would result in more data than could reasonably be stored
without filling up the filesystem.  On the other hand, it is
possible to set thresholds in the metric, so that if the time
elapsed from the start of a run to the end (or any, clearly
defined, intermediate point) is greater than the threshold, the
additional information is logged.  This form of additional
logging could be a useful extension to the metric, because it
would give some context for the spikes on the graph.

     We could also consider the relationships between this metric
and other tools that monitor the performance of a particular
aspect of a machine or network, such as NFS [7, 8, 2].  We would
like to consider how the results of these tools could be
correlated with those of the metric to demonstrate what influence
that aspect is having on the overall user-perceived performance.
We feel that this would be a useful and interesting extension,
but it has not been implemented as yet.

Real-time Detection of Problems

An extension that was proposed within Synopsys was that
notification of a problem could be sent to the administrator of
that machine by email or pager, in real-time.  This notification
would allow the administrators to monitor the state of the
machines as the problems were occurring.  In addition, it was
proposed that notification could be sent to the administrators
when the script failed to complete for one reason or another.
The most common cause of failure is lack of disk space on the
fileserver that a given machine is using, or the server going
down.  I experimented briefly with automatic notification, and
discovered that with even a fraction of the machines that are in
the current operational run, the amount of notification is so
large as to be overwhelming.  I am of the opinion that this is
not a useful extension, as it stands.  It could possibly be made
useful, however, through employing  syslog or by linking the
metric in to a real-time monitoring system such as Netlabs.  We
have not experimented in this direction, but either of these
approaches would conquer the flooding effect.

Using Alternate Sequences

It was mentioned in ``Design Issues'' section that the user
action that we model is the edit, compile, run, cycle.  It has
also been suggested that we may want to monitor file transfers
over the various WAN connections, or perhaps the amount of time
it takes to perform a particular sequence of queries on one of
our databases.  The metric is written in such a way that,
providing you are comfortable with perl 5 and chat2.pl, it is
trivial to drop in any sequence of transactions that could be
performed by a user at a standard shell prompt.  It does not
support timing X applications, however.

                           Conclusions

The metric produces some useful ways of visually demonstrating
normal and abnormal behavior of machines in a network to
management and the user community.  It gives a high-level,
generalized overview of how the performance of the machines, or
``network'' in a particular business unit is performing.  The
approach taken is similar to the civil engineering approach of
taking sample trips to measure delay on particular routes, or in
a particular network of roads.  The metric does not, however,
produce data that in itself is useful to system administrators in
finding trouble spots in a large network of machines.  It could
be extended to be somewhat more useful in this regard, but it
would take a considerable amount of work.  The amount of time
that would be spent extending the metric in this way must be
balanced against the usefulness of the results and the intentions
behind running the metric.  In our case, we decided that it was
not worth the effort since the metric was not intended to produce
information that would be directly acted upon by the system
administration staff, but rather to provide a general overview of
the performance of the network as a whole.

                         Acknowledgments

The encouragement and advice that I received from Aoife and Paul
were instrumental in the production of this paper, and for that I
am profoundly grateful.  I owe thanks, and perhaps a few drinks,
pizza, or something interesting to read, to my proof-readers,
Aoife, Jeff and Becky.  My thanks, also, to all the unwitting
participants of my experiments - the user community of Synopsys,
Inc. - for their patience in putting up with the additional load
on their machines.  And equally, to Eric who believed in the
metric, and to Randy whose fault it was in the first place.
Finally, not forgetting Arnold, who not only proof-read the paper
and avoided implementing the metric, but also takes some of the
blame for inventing it in the first place.

                       Author Information

     Christine Hogan is the security officer at Synopsys, Inc.,
in Mountain View, California.  She holds a B.A. in Mathematics
and an M.Sc. in Computer Science, in the area of Distributed
Systems, from Trinity College Dublin, Ireland.  She has worked as
a system administrator for six years, primarily in Ireland and
Italy.  She can be reached via electronic mail as
chogan@maths.tcd.ie.

                           References

 [1] Mike Loukides.  System Performance Tuning.  Nutshell.
     O'Reilly and Associates, Inc., 1990.
 [2] Gary L. Schaps and Peter Bishop.  A Practical Approach to
     NFS Response Time Monitoring.  In Proceedings of the 7th
     System Administration Conference (LISA VII). USENIX,
     November 1993.
 [3] Marc Staveley.  Performance Monitoring and Tuning. In
     Invited Talks Track of the 8th System Administration
     Conference (LISA VIII). USENIX, September 1994.
 [4] Arnold de Leon.  From Thinnet to 10baseT, from Sys Admin to
     Network Manager.  In Proceedings of the 9th System
     Administration Conference (LISA IX). USENIX, September 1995.
 [5] Daniel J. Bernstein.  pty documentation and man pages, 1992.
     Available by anonymous ftp from mojo.eng.umd.edu in
     /pub/misc/pty-4.0.tar.gz.
 [6] Randal L. Schwartz.  The chat2.pl package.  Posted to
     comp.sources.unix, June 1991.
 [7] Matt Blaze.  NFS Tracing by Passive Network Monitoring.  In
     Proceedings of the USENIX Winter Conference. USENIX, January
     1992.
 [8] David A. Curry and Jeffrey C. Mogul.  nfsstat man page,
     1993.