Getting More Work Out Of Work Tracking Systems

             Elizabeth D. Zwicky - Silicon Graphics

                            ABSTRACT

     This paper discusses work initially done for SRI
International's Information, Telecommunication, and Automation
Division (ITAD).  ITAD's computer facility staff originally
implemented a work tracking system to avoid the embarrassment of
discovering that some important user problem had been brought to
their attention and then entirely forgotten.  Over the years, the
system also began to address more complex tasks, and is now used
to deal with some problems before users report them, to better
communicate with the users about the amount of work being done,
and to get those minor housekeeping chores that kept accumulating
done at last. This paper explains how.

                      Work Tracking Systems

     Work tracking systems, also commonly known as trouble ticket
systems, are systems that keep track of user problems for a group
of people. These have a family resemblance to the systems used to
track bugs in software (for instance, they generally allow you to
assign responsibility to a person, set priorities, and open and
close entries), but differ in some basic assumptions (for
instance, bug tracking systems generally assume that a bug can be
assigned to a particular product, that it only needs to be fixed
once, and that the database of bugs will be searched by naive
users, while work tracking systems generally assume that the
problem may have multiple causes, that it may recur in other
places, and that the database will be searched only by the people
fixing things). There are both commercial and public domain
systems available, at various levels of complexity and
specialization for system administration. At the most complex
end, problems are submitted through a special program with a
graphical user interface and tracked in a relational database. At
the simplest end, problems are submitted through electronic mail
and tracked with some pseudo-database.

                     ITAD's Tracking System

     ITAD's tracking system, internally called the ``action
queue'', is very much at the simpler (and cheaper) end of the
scale. It was developed at ITAD but modeled on the University of
Colorado's QueueMH. ITAD's system is discussed in [7] and
Colorado's in [2]. Users submit problem reports in e-mail, and
the people dealing with them use modified MH commands to set
priorities, list outstanding problems, update reports, and close
them when the problems have been fixed. This is well suited for
ITAD's environment, for several reasons: ITAD considers it
important for users to be able to submit problems directly into
the work tracking system; MH is already the standard mail system;
and there is a need to support users on a wide range of
platforms, making it difficult to port beautiful graphical user
interfaces. Some of ITAD's users are also strongly resistant to
learning new programs, however beautiful, and since they all
already knew how to use some mail system, and were accustomed to
sending electronic mail requests to ``action'', maintaining this
interface kept them happy.

                         Why Get Fancy?

     In late 1992, ITAD's computer facility staff became
embroiled in yet another discussion of exactly what it is that
the computer facility does and why it takes so long. With four
full-time system administrators and only a few problems coming in
per day, people didn't understand why problems weren't fixed
immediately.  The system administrators didn't understand why
nobody had yet snapped under the strain, since in addition to the
few hundred items that were in the queue, they were dealing with
all the maintenance and upgrades that weren't direct user
requests, plus all the telephone requests, questions in the hall,
and notes taped to doors.

     The computer facility developed a new purpose in life;
putting absolutely everything that needed to be done into the
action queue. When this initiative started, there were a few
hundred items in the queue, and 10 to 20 of them were resolved
each week. Currently, there are about 1,200 items in the queue,
and approximately 150 are resolved each week.  (At the queue's
peak, there were more than 2,000 queue items.)  Even the most
skeptical of users is capable of figuring out that this means
that the computer facility does a lot of stuff, and has even more
stuff left to do.

     The change took several months, and required new software
and new procedures.

                        The Software Side

     Aside from the action queue software itself, there is a set
of programs I call ``complainers''. Each one of these programs
detects problems in some particular area; for instance, one of
them compares the nameserver's hostname to address mappings with
its address to hostname mappings to make certain they match.
These programs output messages, separated by a blank line. A
separate program, imaginatively named ``complain'', takes
messages separated by blank lines as input, compares their
subject lines to those of messages currently in the queue, and
submits them if they are not already present. This makes certain
that any given problem is reported only once. complain also
introduces a 5 second delay between messages to avoid overrunning
the mail queues.  (This is a common problem with programs that
are automated mail senders, since they tend to be lightweight
compared to the mail system. A good, up-to-date sendmail
configuration will keep you from drowning the receiving machine,
but not from using up all the memory on the sending machine.)

     These programs are run from cron, via a small script which
has multiple names. It looks for a directory in
/usr/facility/complainers that is named after the script, runs
all the programs in it, and pipes the output to complain. It is
run every evening as ``complain-daily'' and once a week as
``complain-weekly''.

     Currently, the daily complainers are bad-aliases,  bad-time,
dumps, todays-errors and  user-problems. The weekly complainers
are hosts-not-hosts, missing-man-pages, multi-mount and  no-
modules.

bad-aliases

     bad-aliases runs through the aliases file, looking for
aliases that include files that don't exist, refer to local users
that don't exist, or forward mail to remote hosts that return
``User unknown'' for the user we're forwarding to. For local
lists, it reads through the files; for remote hosts, it uses a
local program called mailaddr that contacts the remote host and
does an SMTP VRFY. It would be nice to weed out forwards to non-
existent hosts, but there is no easy way to distinguish between a
host that doesn't exist and a host that isn't reachable at the
moment. bad-aliases also reads the comments in our files looking
for a date, and complains if the date is more than 6 months old;
we use this for mailing list aging.

bad-time

     Theoretically, all of ITAD's machines run ntp or otherwise
synchronize clocks. In practice, ntpd occasionally dies, and some
users have root, modify rc.local in odd ways, and fail to notice
that the time is wrong until they call up complaining that make
is behaving oddly (because they're 8 minutes off the time their
file server has).   bad-time checks that all hosts are within 5
minutes of the time on the host running the complainers. It
ignores hosts in a different time zone from the host it's running
on. bad-time is ITAD-specific only in the code that identifies
the hosts to run on and the specific message it outputs;
otherwise it only needs to be able to run date on the remote
hosts. Any site that has a simple way to get all remote hosts to
provide the time in number of seconds since the epoch, instead of
in human-readable form, would probably be much better served to
rewrite the program, since that would be simpler and more
reliable than bad-time's method of parsing the human-readable
date. (It would be a service to the universe if somebody managed
to make it standard for date to provide a format code for ``I
don't want a beautiful date, just give me the seconds''.  I have
known sites that simply installed a  stupiddate program for this
purpose.)

dumps

     dumps is another doublecheck. ITAD's backup system sends
mail when it's working and when it encounters errors, but this
complainer checks the dump frequency field against the data in
/etc/dumpdates as a doublecheck.  This will catch, among other
things, disks that have been assigned to a dump run that is never
performed. The basic concept behind dumps is applicable to any
backup system that uses  dump as its transport agent and runs on
machines with a dump frequency field in fstab, but code changes
will be necessary to adapt to the values other sites use in the
fstab dump frequency field. It could be adapted to any situation
where the information about what should have been backed up and
what was backed up is available to a program other than the
backup system.

todays-errors

     todays-errors is one of the most complex and useful
complainers.  It reads through today's messages in the syslog or
equivalent message log on each of ITAD's machines, sorting the
messages into three classes; messages it knows are ignorable,
messages it knows are bad, and messages it doesn't recognize.
Ignorable messages include normal boot messages, informational
messages from programs that insist on logging their startup or
automated restart (routed, mrouted, and named for instance), and
messages about missing services that occur while the servers are
down for backup. Messages which are recognized as specific
problems with specific advice provided include disk errors, lpr
complaining about missing printers, and memory errors. In
addition, it counts the number of reboots encountered, and sends
a tailored message if it exceeds 3. Messages that are not
recognized are sorted by the program that logged the message, and
one action request is queued for each program.

     This finds a number of problems that might not otherwise
have been noticed, ranging from dying SCSI disks, which are now
usually replaced  before they completely fall apart and the user
has no machine, to a user who believed that the right way to
debug his X environment was to power-cycle his machine every time
it hung. It also has discovered a number of things were perhaps
better left concealed; both the routed and the sendmail version
ITAD currently runs have bugs that cause them to log incorrect
error messages, for instance. Then there are the mysteries of the
universe, like the machine that logs ``Sun 4/600'' once every few
months. That's the complete line. The machine is not a 4/600.

     todays-errors is highly  ITAD-specific, in that not only the
code for figuring out which machines to run on, but also all of
the regular expressions and tricks it uses to classify lines, are
hard-coded into the program.

user-problems

     user-problems compares /etc/passwd to what it can find of
reality.  For instance, ITAD has a centrally maintained on-line
phone database, which user-problems compares to the phone number
and location shown in /etc/passwd.  user-problems also checks to
make certain that users with valid passwords have valid home
directories, that their mail is forwarded somewhere, and that
they have calendars; conversely, it attempts to assure that users
with ``*'' passwords do not have valid home directories, do not
receive mail locally, and do not have calendars. (ITAD uses ``*''
to indicate users whose accounts have been removed, and ``**''
for administrative accounts that the program should ignore.)

     This serves partly as a check to make certain that people
have installed accounts using the programs for that purpose,
instead of by hand, and partly as a check on removing accounts,
which we prefer not to automate.  While it mostly catches botched
installs, it also regularly catches people who have changed
offices or phone numbers, inadvertent corruption in the online
database, and other forms of bit- or brain-rot.

     user-problems is extremely ITAD-specific, since it knows a
great deal of detail about where things are, how they are
formatted, and what different kinds of users should have. For
instance, it enforces a specific GECOS format, including an
expiration date, and cross-checks that against a particular file
parsed by column position; it also insists that all users be on
one of two specific mailing lists. Other sites are unlikely to
find it useful except conceptually.

hosts-not-hosts

     hosts-not-hosts compares /etc/hosts to the nameservice
databases, and also compares forward and reverse data within the
nameserver.  When it first ran, it found a surprising number of
hosts registered in only one place, or registered differently in
A and PTR records. Once these were removed, people learned not to
do that any more, and hosts-not-hosts rarely if ever finds any
differences at this point, although there is still no automated
program for adding hosts that would avoid these problems. It also
compares the NIS ethers file to the hosts in the nameserver. ITAD
considers the nameserver the primary source of host information,
not the hosts file, and it maintains an ethers file for all
hosts, regardless of whether they will ever need to use RARP.
This is partly to aid in diskless booting of hosts that normally
run dataless but don't at the time have functional operating
systems locally, and partly to aid in the identification of hosts
from packet traces on the network.

     hosts-not-hosts is fairly general; it requires a hosts file
containing only local hosts, tries to run named.xfer on the zone
the host it's running on is in, and tries to ypcat the ethers
map. If those conditions can be met, only a few strings need to
be changed.

missing-man-pages

     missing-man-pages compares programs in /usr/local/bin to
manual pages in /usr/local/man, and complains when programs do
not have matching manual pages. With programs like Perl and GNU
Emacs, which want to install a version-numbered link, this
requires you to link the manual pages as well; I haven't yet
figured out whether I think this is a good thing or not. While it
might possibly be helpful to users, there's no evidence that they
ever notice, and it clutters /usr/local/man with links. The
complainer has increased the likelihood that programs will have
manual pages (although in practice not the likelihood that the
program's installer will put them in place originally), but not
to a certainty.  Since it walks through all the binaries anyway,
it also complains about compressed binaries (the complaint says
neutrally that they are compressed and thus useless; some people
choose to uncompress them, others to remove them). It should work
at any site with minor and obvious changes to the particular
binary and manual page directories.

multi-mount

     ITAD's environment has an unusually large number of
networks, and machines move from one to another with unusual
frequency (particularly when you realize that machines are rarely
physically moved). This sometimes results in machines receiving
NFS service from multiple servers in situations where they could
be using a single server. This cannot be entirely fixed by using
automounts, since some of the file systems involved are not
interchangeable, and furthermore many of the users complained
that even though automount would normally give them the closest
server, it was not guaranteed to be deterministic; they wanted
the server closest in network terms, not currently fastest in
responding, and they wanted it to be the same every time. multi-
mount detects cases in which the user's email, the user's home
directory, and the /usr/local mounted on the user's home machine
are not all three on the same NFS file server. This is unlikely
to even be an interesting question at most sites, but except for
the code that determines the host that holds the user's home
directory, it's perfectly portable.

no-modules

     no-modules detects software packages installed in their
appropriate central location which do not also have files for use
with the Modules package, and which do not have ``.nomodule''
files in their top directory. Modules, described in [1] allows
users to add programs to their environment easily. With directory
names changed, it should work for most sites using Modules.

Why Complainers and Not Fixers

     People usually ask why ITAD uses complainers and not fixers.
In fact, there are fixers for some problems, but many problems
are not amenable to automated solutions. A number of the
complainers are backups to automated systems that are supposed to
have fixed or avoided the problem already, but which occasionally
fail or are circumvented. Others involve situations where either
of two information sources might be correct (for instance, the
password file or the phone database might have the correct phone
number) and there is no automated way to distinguish.  Some of
them clearly are very risky to attempt to fix automatically; for
instance, an invalid password entry with an existing home
directory should not automatically be fixed by removing the home
directory. Occasionally there is a complainer instead of a fixer
simply because the complainers are much faster and easier to
write, and serve to keep the facility in line while people look
for the time to figure out how to fix or avoid the problem.

     Complainers also have more applicability for the environment
at Silicon Graphics, where the hosts have considerably autonomy,
and it's appropriate for system administrators to check for
problems, but not for them to actually make changes to other
people's systems.

ITAD's Experiences With Complainers

     In general, the complainers are an improvement.  Problems
are detected earlier, more often fixed before users complain
about them, and things are generally tidier and more
comprehensible. On the other hand, maintaining the complainers
can be difficult, and has required some changes to the facility
solely to make the complainers work (for instance, the use of *
and ** passwords). Furthermore, it has proven to be difficult to
get people to deal with complaints consistently and correctly.
People have a tendency to remove items that they know are not
problematic each time they occur, rather than trying to make the
complainers stop complaining. People often fail to check for
earlier occurrences of errors in order to merge new complaints
with them, possibly because it's an action that's extremely
rarely required except for items from the complainers. Real
people rarely send in new complaints about the same problem with
different subject lines from the original (and when they do, the
two complaints are almost never correctly merged together).

     There is also an unfortunate tendency to treat the
complainers as humans. I've seen many people remove a complaint
with an explanation of why the complainer's interpretation of the
problem is wrong. (For instance, a complaint about a missing
manual page may be removed with the notation that the manual page
exists under another name.) Oddly enough, the software does not
generally listen to such explanations.  The instructions that it
provides are not always correct either, and people have been
known to take phrases like ``this probably means that ntp is not
running'' as gospel (as per the famous saying ``garbage in,
gospel out''), leading them into trying to figure out what the
ntp problem is when the real problem was failure in the
complainer to compensate for times that cross an hour boundary,
and the clocks were not off. This is despite the fact that the
message clearly lists two times less than five minutes apart,
with the notation underneath that they are more than five minutes
apart.

Underpinnings of the Complainers

     Many of the complainers rely on the ability to find all
hosts of particular types, which ITAD provides by maintaining a
file called Hosts.Status listing various information about our
hosts including hardware type, OS type, and a classification into
server, dataless client, diskless client, and standalone. This
file is built automatically from the nameserver database, using
HOST and TXT records to keep the information, and contains a
field marking whether the host was ``up'' or ``down'' (in
reality, whether it responded to ping) last time the file was
built. Hosts.Status is built automatically every night, and is
used by a number of programs besides the complainers; it actually
predates the complainers by several years.

Software Other Than the Complainers

     The complainers are not the only programs that insert items
into the action queue. Any program that can send mail can insert
items into the queue, and ITAD tries to configure things as much
as possible so that anything that needs administrative assistance
sends mail directly into the queue. In particular Wietse Venema's
TCP wrappers [9] mail information about refused TCP/IP
connections directly into the queue. This simplifies detection of
break-in attempts, at the cost of severely reducing our patience
with people who insist on trying failed connections multiple
times. Other security programs also send mail directly into the
action queue; this allows us to deal with problems which should
be looked at, but would be far too frequent and annoying to page
a specific security person about.

Changes to the Underlying Trouble-Ticket System

     The trouble-ticket system that was perfectly adequate with
200 items in the queue was not perfectly adequate with 2,000
queue items. We ended up making several after-market
modifications. The first was to dramatically enhance the
available reporting options, so that as well as getting a list
sorted by submission date of the items assigned to you, you could
get a list sorted by priority, and within priority have user-
submitted tickets before auto-generated tickets, and within that
have them ordered by the time since they were last responded to.
This corresponds roughly to order of importance. Another report
lists goals and the extent to which they were being met. For
instance, one goal was to have each ticket assigned to a group
and given a priority; another goal was to have no tickets over a
year old.

     There is also a staggeringly unpopular program called whine
that sends you electronic mail telling you how many of your
action items haven't been replied to recently enough for their
priority level. This mechanism works beautifully for people who
occasionally have a few overtime items. If you have people who
routinely have several hundred, forget it; anyone with any sense
simply puts mail filtering in. People who occasionally have
several hundred are responsible for the bulk reply and bulk
remove features that were added to the trouble ticket system (and
are supposed to be used only an autogenerated items, as it's hard
to come up with a bulk reply that works reasonably for human-
generated items.)

                         The Human Side

     ITAD made two fundamental changes in its procedures. First,
everybody became fanatical about making certain that all requests
from users went into the queue, no matter how the user submitted
the request. Doing this is very difficult; it's hard to remember,
it increases the time that short requests take, and it requires
that you figure out how to restate the user's question so that
it's immediately comprehensible, easy to type, and not insulting
to the user. (It also required yet another change to the trouble
ticket system to allow tickets to be submitted in a closed
state.) On the other hand, it makes all those requests visible.
Those little questions can eat away a day, leaving you feeling
like you got nothing done whatsoever; the documentation proves
that something did happen.  When one of the ITAD users complained
that the computer facility wasn't responsive enough to his
requests, the queue logs proved he submitted more requests than
anybody else in the division - as many, in fact, as the next
several people added together - and this made management
considerably less receptive to his complaints.

     Second, everybody entered the things that ought to be done,
even though users hadn't requested them. This exercise was
enlightening, if not in the end terribly encouraging. Taking the
time to consider exactly what needed to be done, independent of
current crises, was an enjoyable and fruitful exercise. On the
other hand, it eventually became clear that the volume of crises
was such that a lot of these projects were never going to get
done, no matter how useful they would be. This was not only
inherently depressing, but also a major disincentive to put the
queue items in, since they merely sat around for a year or so
before getting removed. The final compromise position is that the
facility staff now enters into the queue only those projects that
they consider absolutely critical, not the ones that they
consider merely highly desirable. This makes only a moderate
difference in the number of them that actually get achieved, but
a significant difference in the usability of the statistics in
staffing arguments.

                             Results

     This enterprise did have most of the benefits that were
hoped for:
 o Problems are detected earlier; more pro-active things happen.
 o Projects of the system staff have the same weight and
   visibility as problems reported by users.
 o It is easier to communicate the entire appalling scope of the
   problem to non-staff members.
 o When the users try to play ``My queue is bigger than yours''
   they always lose.
In addition, it turns out to have several unexpected benefits:
 o People find removing items from the action queue inherently
   rewarding; trivial jobs that have been waiting for years
   actually get done.
 o The action queue is handled by lots of people; since problems
   are now obvious to people besides the ones that already know
   how to fix them, more processes are documented.
 o Making complainers has required that things be configured
   more consistently - programs need to be able to figure things
   out that people could guess at before.
 o Developing complainers leads naturally to developing fixers.
 o People who had previously claimed that it was extremely
   important for them to be told every day what was in the queue
   and what its priority was decided that maybe they really
   didn't care all that much after all. Similarly, a number of
   people who had wanted to see every closed action item
   realized finally that this information was not useful to
   them. This noticeably reduced arguments about the titles of
   items and the exact relative priorities of things with low
   absolute priorities. (Are you going to do that when hell
   freezes over enough to skate on, or not until you can build
   ice fishing huts?)

                             Futures

     As originally written, the complainers rely heavily on
ITAD's environment.  Furthermore, they have no options, no
configuration files, and a depressingly small number of comments;
changing what they do involves changing the Perl code. This is
acceptable for small ones, where the code is rapidly
comprehensible, but not for the larger ones, where the intent
becomes buried and it is difficult to maintain the rule sets.
Every complainer that needs to talk to multiple hosts contains
the code for identifying hosts that meet specific criteria and
running programs on them, allowing the complainers to either have
common bugs that therefore have to be fixed multiple times, or
even more fascinatingly to all have different bugs.

     I am currently engaged in rewriting the complainers for a
less centralized environment (notably, one where someone who is
not me needs to be able to maintain them). The rewrite involves a
single program which takes a rule file to specify how to find the
host list, which ones to run on, and what conditions map to what
complaints, looking at files or the output of single command
lines. This master complainer will not handle every condition;
it's appropriate mostly for checking for full disks, error
messages in logs, and the like. It won't replace either bad-
aliases or user-problems, so the architecture of having numerous
small programs and the occasional intruding monolith will be
maintained.

                        Related Software

     There are a number of other packages aimed at monitoring
networked systems to proactively detect problems, including [3],
[8], [5], and [4]. These systems are in general considerably more
beautiful than this one, but are also in general aimed at
immediate notification (via graphs, beeping computers, or beeping
beepers) and are not integrated with trouble ticket systems. This
system, while unbeautiful, is highly flexible and requires little
or nothing from the machines that are being monitored.  Trouble
ticket systems that have been documented include [8] and [6] in
addition to the previously mentioned [7] and [2].

                       Author Information

     Elizabeth D. Zwicky is a senior system administrator at
Silicon Graphics, where she supports the company's internal
system administrators. She prefers to program in languages that
begin with a ``P'' and work for companies that have three-letter
abbreviations containing the letter ``S''. Reach her via US Mail
at Silicon Graphics, Mail Stop 730, 2011 N.  Shoreline Boulevard,
Mountain View, CA 94043-1389. Reach her electronically as
zwicky@corp.sgi.com.

                           References

 [1] John L. Furlani, ``Modules: Providing a Flexible User
     Environment'', Proceedings of the Fifth USENIX Large
     Installation Systems Administration Conference, 1991
 [2] Tinsley Galyean, Trent Hein, and Evi Nemeth, ``Trouble-MH: A
     Work-Queue Management Package for a >3 Ring Circus'',
     Proceedings of the Fourth USENIX Large Installation Systems
     Administration Workshop, 1990
 [3] Stephen E. Hansen and E. Todd Atkins, ``Automated System
     Monitoring and Notification with Swatch'', Proceedings of
     the Seventh USENIX Systems Administration Conference (LISA
     VII), 1993
 [4] Darren Hardy and Herb M. Morreale, ``buzzerd:  Automated
     Systems Monitoring with Notification in a Network
     Environment'',  Proceedings of the Sixth USENIX Systems
     Administration Conference (LISA VI), 1992
 [5] Richard W. Kint, ``SCRAPE (System Configuration, Resource
     And Process Exception) Monitor'', Proceedings of the Fifth
     USENIX Large Installation Systems Administration Conference,
     1991
 [6] David Koblas and Paul M. Moriarty, ``PITS: A Request
     Management System'', Proceedings of the Sixth USENIX Systems
     Administration Conference (LISA VI), 1992
 [7] Bryan McDonald, ``QMH: A Problem Tracking System'',
     Proceedings of the World Conference on System Administration
     and Security, 1992
 [8] James M. Sharp, ``Request: A Tool for Training New Sys
     Admins and Managing Old Ones'',  Proceedings of the Sixth
     USENIX Systems Administration Conference (LISA VI), 1992
 [8] Carl Shipley and Chinyow Wang, `` Monitoring Activity on a
     Large Unix Network with perl and Syslogd'', Proceedings of
     the Fifth USENIX Large Installation Systems Administration
     Conference, 1991
 [9] Wietse Venemaa, ``TCP Wrapper: Network Monitoring, Access
     Control and Booby Traps'', Proceedings of the Third USENIX
     UNIX Security Symposium, 1992