|Pp. 119-130 of the Proceedings|
Thresh is a simple SNMP  monitor, written in Scotty  (Tcl  with the Tnm extensions), which uses the UNIX file system hierarchy for configuration and data storage. Thresh compares SNMP variables to per-device thresholds or values, and issues notifications if any current SNMP variable values are unacceptable (or unexpected). Thresh can be used by itself, or as a complement to other network management and monitoring tools. Thresh can be thought of as fitting in between tools that generate immediate emergency alerts (such as Big Brother ) and trending and history tools (such as Cricket ).
Virtually every computing system and network has some kind of monitoring mechanism in place these days, but in some cases the monitoring system consists only of users phoning and asking if there is something wrong with the network. For those interested in something a little more advanced or automatic, there are quite a few software packages available that do some form of monitoring. They range from the very simple (ping tests, etc.) to the very complex (large commercial packages that map networks, configure devices, and make your lunch), with various alternatives in between.
In the realm of ``simple'' monitoring software, packages can generally be divided into two types:
A couple of years ago it occurred to me that there was another class of monitoring that didn't seem to be very well addressed - low to medium priority tracking of certain parameters and their values on computing systems and network devices. For example, you might want to track configuration changes, system or device reboots, or network interface status or change time. These are things that you may want to know, but which don't necessarily indicate an immediate problem and which may not be worth waking anyone up to investigate.
Thresh was created to provide this kind of monitoring. It tracks SNMP variables, compares them to threshold, pre-set, or last-observed values, and reports (typically via email) unexpected changes or out of range values.
I've become convinced that, ugly as it may sometimes seem, the Simple Network Management Protocol (SNMP)  should be the basis for just about every monitoring system. Virtually every network connected device either comes with an SNMP agent or can run one with only a modest amount of effort. Most SNMP agents can provide a vast amount of (usually) useful information, and most agents for general purpose computers can be extended to provide just about any data that you might want to have.
Some monitoring systems (such as Big Brother and Spong) rely, to a greater or lesser extent, on separate client agents, running on each system that needs to be monitored, with a system-specific reporting protocol between the agent and the management station. This approach can be somewhat limiting (it's hard to use on things like networking equipment for example), can result in some duplication of services (if you need an SNMP agent for other purposes), and limits both what you can do (they're often not extensible), and where (as your firewall may not be able to pass the particular protocol implemented by the software). Thresh avoids these kinds of problems by using only SNMP for communication. [Note 1]
It's often useful to be able to track certain SNMP variables, but you don't always need to know about changes immediately - it's often good enough to hear about them the next time you read your mail. For example, the system.sysUpTime.0 SNMP variable gets reset every time a system or device gets rebooted - if you get notified every time that variable resets, you'll know if you've got a device reliability problem (or an extension cord that people keep tripping over). Similarly, you can generate disk capacity threshold warnings, network bandwidth warnings, network interface up/down notices, and so on.
With many monitoring systems, this kind of low-priority information can be hard to generate. Many of the most common freely available packages tend to have only a small number of notification or alert mechanisms, and are built with the idea that you're monitoring vital services and networks - sometimes everything is assumed to be an emergency.
Thresh was created to address this kind of medium-level monitoring need.
There are a number of commercial monitoring systems available. Some, such as Spectrum , HP OpenView , and NetCool , are widely deployed and very well respected, and most of them can do (or can be made to do) most or all of what thresh does. However, the commercial packages tend to require a much larger monitoring infrastrucutre, and a much larger committment of time and money to implement, operate, and maintain. For small to medium sized sites, small, simple monitoring tools, like thresh, are often the best choice.
Thresh was implemented in Tcl , using the Tnm network management extensions provided by the Scotty/Tkined  software. The Tnm extensions are a toolkit of Tcl procedures that make it easy to perform SNMP operations.
Tcl was chosen because it is well suited to this kind of task, and the Tnm extensions provided just the right functionality. At the time thresh was first contemplated, the SNMP modules for Perl were relatively primitive when compared to Tcl and Tnm.
Thresh has benefitted from the use of a scripting language, and the string and array manipulation routines provided by Tcl. Tcl allows the use of an interactive ``shell'' for testing and development. Being a traditionalist, I tended to develop incrementally, using the well-known edit, run, repeat cycle, rather than a more ``modern'' approach to program development. Thresh is currently about 700 lines of Tcl.
In retrospect, I'm still glad to have chosen Tcl in preference to Perl, C, awk, or Visual Basic.
One of the design and implementation goals for thresh was simplicity. That goal has been addressed in the following ways:
Thresh's configuration is ``data-directed'' [Note 3] - it is configured using a hierarchy of directories and configuration files that is intended to reflect organizational structure and DNS naming conventions.
The default configuration assumption is that each directory in the
configuration hierarchy represents an element of the DNS name of the
devices being monitored. For example, the sub-directory named
com/whizbang/admin/printer1 would usually contain the thresh
configuration for the device with the DNS name
printer1.admin.whizbang.com. The name configuration directive
makes it easy to override this default behaviour, by setting the DNS
domain or node name associated with a particular directory.
Each directory may contain a DEFAULTS file, which sets the
various configuration variables (such as name,
notifier, delay, community, etc.) which
control thresh's behaviour. Thresh configuration variable names are
case sensitive, and all include only lower case letters. Settings in a
DEFAULTS file are in effect for that node and those lower in
the hierarchy, unless overridden by a lower DEFAULTS file. [Note 4]
Thresh configuration variables can also be set on the command line, in
which case they override any other settings for the given variables.
Figure 1 shows a sample DEFAULTS file.
verbose = true
name = mydomain.net
community = hello
mib = /usr/local/mibs/ascend.mib
# big network, long timeout
timeout = 20
notifier = threshmail jsellens
syslog = local1.info
Thresh's configuration variables include:
ignore = *for example, ends up ignoring the data hierarchy below a given point.
All other files in a directory are expected to contain a list of SNMP variables to monitor for that particular device, with comparison indicators and expected or threshold values. Thresh currently lacks a file inclusion mechanism, but the use of symbolic links makes it easier to manage the configurations for multiple, similar devices. A sample configuration file is shown in Figure 2.
# this is a comment S system.sysDescr.0 S system.sysContact.0 I system.sysUpTime.0 C interfaces.ifTable.ifEntry.ifDescr.5 C interfaces.ifTable.ifEntry.ifAdminStatus.5 C interfaces.ifTable.ifEntry.ifOperStatus.5 G ucdavis.memory.memTotalReal.0 90000 G enterprises.ucdavis.memory.memAvailReal.0 3000 L loadTable.laEntry.laLoad.1 1.20 L loadTable.laEntry.laLoad.2 1.50 L loadTable.laEntry.laLoad.3 2.00 V snmp.snmpInPkts.0
The SNMP variable names used in a configuration file can be any string that Scotty will recognize as a particular MIB variable. They can be fully-qualified names, such as iso.org.dod.internet.mgmt.mib-2.system.sysUpTime.0, unique substings with common prefix elements removed, as in system.sysUpTime.0, or SNMP object identifiers (OIDs), such as 188.8.131.52.184.108.40.206.0. OIDs aren't always the best choice, as they are typically somewhat more cryptic for the casual reader.
The first letter on each configuration line, C, G,
I, L, S, or V, indicates the
comparison to be made:
The final field on some lines is the threshold value to compare against - a threshold value is required for G and L, optional for S, and not allowed for C, I and V. If a value for an S comparison is not provided in a configuration file, the first value for that SNMP variable retrieved from the device is saved and used as the ``normal'' value for the variable.
The configuration mechanism has proven to be quite flexible and easy enough to deal with, though some form of file inclusion mechanism would make some configurations simpler to create and maintain.
Thresh provides a flexible mechanism for notifications. For each device described in the configuration hierarchy, if thresh determines that something needs to be reported, it formats a message and pipes it into whatever ``notifier'' program has been specified by the configuration variables. Thresh also keeps a copy of the message for internal reference when it is next run. If thresh observes the same problems (i.e., an identical notification message) the next time it is run, it will only send another notification if the specified frequency time has passed. If it sees a different set of problems when next run, it will report the complete current set of problems, regardless of whether or not the normal delay time has passed.
If the syslog variable is set, thresh will also generate
syslog messages for SNMP variables that are out of range, in
addition to the normal notifications. Messages formatted for
syslog are deliberately terse, and include the node name,
comparison type, SNMP variable, the normal or threshold value, and the
Status, History and Logging
For each device in the hierarchy, thresh will create a configuration subdirectory named .thresh, as a convenient location to store the data it generates. Thresh maintains status, history and logging information, both for internal use and to make it possible to review past changes in state. It tracks the previous values of the SNMP variables (in .thresh/status_*), the last notification message sent (in .thresh/last_complaint), the date and time of last contact with the device (in .thresh/last_response), and a log of when variables were found to be outside their ``normal'' ranges (in .thresh/log_*).
The status and log files are named for the configuration files that they are related to, with a prefix of status_ or log_ added to the configuration file's name.
Currently, there are no really interesting ways to access and use this data - you're more or less stuck with using some paginator program to view the files. I should note that there is no built-in mechanism for log file rotation.
Thresh can be used effectively as a standalone, isolated
monitoring tool, but it can also be integrated with other logging or
reporting systems. Thresh's core functionality is polling SNMP
variables, comparing against pre-determined thresholds, and generating
messages for distribution. Integration with other tools can be
accomplished in two ways:
Thresh can be configured to record out of range values to syslog, which provides an easy interface to any existing syslog watcher. Custom Notifiers
By setting the notifier configuration variable, thresh's alert messages can be trivially piped through arbitrary custom processes, that can record, mail, or dispatch as appropriate.
The message format is fairly consistent, simple, and relatively easy to parse. Additionally, the internal thresh code which actually generates the messages would be easy to change if a specific output format was required. [Note 5]
Thresh is not arbitrarily scalable to huge numbers of devices and variables being monitored. Beyond a certain point, you will start to run into problems such as:
Installation of thresh is very straightforward.
The last step is admittedly somewhat more complicated and time-consuming than the others, but that is pretty much unavoidable. The distribution will include some sample configurations.
There are a number of enhancements to thresh that should probably be made:
Thresh seems to serve to illustrate a few useful lessons:
Some of these also illustrate problems, the most obvious being that you probably need more than one monitoring tool, since no one tool is likely to do everything that you need done.
Generally, thresh has proven useful and fits nicely into an effective monitoring toolkit.
Thresh is ``freely'' available through https:// thresh.sourceforge.net/ or https://www.generalconcepts.com/ resources/.
John Sellens is the General Manager for Certainty Solutions in Canada, based in Toronto. (Certainty Solutions was previously known as GNAC - Global Networking and Computing.) Prior to joining Certainty Solutions, he was Director of Network Engineering at UUNET Canada, and was a system administrator at the University of Waterloo for 11 years. He has a master's degree in Computer Science from the University of Waterloo, is a Chartered Accountant, and is a semi-regular contributor to ;login:. John, Joanne, and their delightful children live in Unionville, Ontario. Contact him at firstname.lastname@example.org.
 J. Case, M. Fedor, M. Schoffstall, and J. Davin, A Simple
Network Management Protocol (SNMP), Network Working Group, May
1990. RFC 1157, STD 15
 J. Shönwälder and H. Langendörfer, ``Tcl Extensions for Network Management Applications,'' in Tcl/Tk Workshop, pp. 279-288, USENIX and Unisys, Inc., Toronto, Canada, July 6-8, 1995.
 John K. Ousterhout, ``Tcl: An Embeddable Command Language,'' in USENIX Conference Proceedings, pp. 133-146, USENIX, Washington, D.C., January 22-26, 1990.
 Sean MacGuire and Robert-Andre Croteau, Big Brother FAQ. https://www.bb4.com/
 Jeff R. Allen, ``Driving by the Rear-View Mirror: Managing a Network with Cricket,'' in First Conference on Network Administration (NETA '99), pp. 1-10, USENIX, Santa Clara, California, April 7-10, 1999.
 Vikas Aggarwal, NOCOL - Network Operation Center On-Line. https://www.netplex-tech.com/ software/nocol/
 Stephen L. Johnson, Spong - Systems and Network Montoring. https://spong.sourceforge.net/
 Tobias Oetiker, ``MRTG - The Multi Router Traffic Grapher,'' in Twelfth Systems Administration Conference (LISA '98), p. 141, USENIX, Boston, Massachusetts, December 6-11, 1998.
 Aprisma Technologies Inc., Spectrum Network Monitoring and Management System. https:// www.aprisma.com/
 Hewlett-Packard Company, OpenView Monitoring and Management Software. https://www. openview.hp.com/
 Micromuse Inc., Netcool Monitoring and Reporting Suite. https://www.micromuse.com/
 University of California, Davis, UCD-SNMP distribution. https://ucd-snmp.ucdavis.edu/
thresh - a data-directed SNMP threshold poller
thresh [ varname=value ... ]
thresh is a data-directed SNMP threshold poller, and uses the file system for configuration. status, and logging. Each host or device to be monitored is configured in a separate directory, using files listing SNMP variables, values, and a comparison indicator. In normal operation, thresh starts scanning a data hierarchy (as described in theshdata (5)) at a particular directory (set by the topdir variable), reading DEFAULTS files, variable files, querying hosts and devices, and recording and reporting the results. thresh variables, as described in threshvars (5), and set in DEFAULTS files or on the command line, change thresh 's default behaviour and notification mechanisms. Any varname=value settings on the command line override both the built-in defaults and the settings in any DEFAULTS files encountered during processing. thresh would typically be called periodically by cron (8).
In normal use: % thresh To use a non-default start directory: % thresh topdir=/some/other/place To traverse the data hierarchy and provide information on what would normally be queried: % thresh walkonly=true To do almost nothing: % thresh 'ignore=*'
thresh is written in Tcl (n), using scotty (1) and the Tnm (n) network management extensions.
thresh uses just about any file and directory names. The name .thresh is reserved for naming the subdirectories used by thresh to store status and logging information. Any files matching .thresh/log_* are log files, which will grow without bound, and which you should arrange to rotate, archive, or truncate periodically.
thresh currently only works with SNMP V1. The logging to syslog (3) should be internalized in some way, and not depend on logger (1). There should be some mechanism for "including" one file from another, to reduce the dependance on symbolic links for sharing files. thresh is unlikely to scale to handle arbitrarily large networks.
threshdata - thresh data hierarchy description
The thresh(1) SNMP poller uses a configuration hierarchy to direct its actions, maintain its status information, and store its logs. Each directory in the configuration hierarchy (under the topdir directory) is assumed to relate to a network host or device, or to an intermediate name in a DNS naming hierarchy.
By default, the topdir directory is assumed to refer to a device named "" - the empty string. Each directory below topdir normally adds one more element on the right hand end of a DNS name. For example, below topdir , the directory org/usenix/conference is related to the DNS sub-domain "conference.usenix.org", and the directory org/usenix/conference/ts1 is related to the device "ts1.conference.usenix.org". This naming relation can be overridden by the use of the name variable.
Each directory may contain a DEFAULTS file, which contains variable settings (see threshvars (5)) that apply to that directory, and to all directories below that point, unless overridden on the command line or by other, lower, DEFAULTS files. Any other files found in a directory (other than those ignored by the baseignore and ignore variables) are assumed to contain a list of SNMP variables to monitor. thresh uses sub-directories named .thresh to store status and log information. thresh data files consist of zero or more lines in the following format: <ws>type<ws>variable-or-OID<ws>value<ws> <ws># comment ... <ws> where "<ws>" indicates white space. The data file elements are defined as follows: type A single capital letter indicating the comparison to be made in determining "normal". C Changeable - the variable's value may change, but should be reported each time it changes. This is useful for semi-static data, or for monitoring things such as device interface status changes. G Greater than - the variable is reported if its current value is less than or equal to the specified value. I Increasing - the variable is reported if its current value is less than its previous value. This is handy for watching for reset times, such as the "system.sysUpTime.0" variable resetting when a device reboots. L Less than - the variable is reported if its current value is greater than or equal to the specified value. S Static - the variable is reported if its current value is not equal to the specified value. If no value is specified, then it is compared against the first-retrieved value of the variable. This is useful for monitoring things that should never change, such as "system.sysName.0". V Variable - the value can be anything, but it is queried and tracked to allow for later investigation or review. variable-or-OID An SNMP variable name (or name fragment that scotty (1) can interpret) or SNMP OID (e.g. 220.127.116.11.18.104.22.168.0) to be monitored. value A value for the comparison. Required for G and L, optional for S, and not allowed for C, I, and V.
threshmail - mail notifier for thresh messages
threshmail recipient ...
threshmail expects notification messages from thresh (1) on its standard input, which it appropriately reformats into a mail message, and sends to every recipient given on the command line. threshmail would typically be set as the notifier in a thresh DEFAULTS file.
threshvars - configuration variables understood by thresh
thresh (1) understands and observes certain configuration variables. Those variables can be provided on the command line or in files named DEFAULTS within the thresh data hierarchy.
Variable names are case sensitive and are expected to be in lower case letters.
DEFAULTS files consist of zero or more lines in the following format:
<ws>VARNAME<ws>=<ws>VALUE<ws> <ws># comment ... <ws>where "<ws>" indicates white space. Values can contain embedded blanks.
Variables set by a DEFAULTS file apply at that level of the data hierarchy and below, unless overridden on the command line or in a DEFAULTS file further down the tree.
debug Generate debugging output.
Boolean. Default: true
verbose Generate informational messages.
Boolean. Default: true
walkonly Walk the data tree, describing the hierarchy, but not querying, reporting, or logging.
Boolean. Default: false
topdir The top of the data hierarchy.
name The DNS name or partial name of the device or hierarchy represented by the current directory in the data hierarchy. Gets extended by the name of each directory as thresh descends down the hierarchy, but can be overridden in a DEFAULTS file.
baseignore The base list of "glob" patterns of file and directory names to ignore in the data hierarchy. If you override this variable, make sure that you end up ignoring the .thresh status directories.
Default: . .. .* CVS RCS README README.* DEFAULTS core *.core
ignore The extended list of "glob" patterns of file and directory names to ignore in the data hierarchy. Having two variables makes it easy to augment the default list of names to ignore without overriding the base list. Note that setting ignore = *
will cause the hierarchy rooted at that location to be excluded from all processing.
prune If true, do not process further down this hierarchy if the current node is unreachable. This is essentially the equivalent of setting ignore = *
if the current node is unreachable. This is useful, for example, for limiting the error messages that are generated if a gateway router is unreachable.
Boolean. Default: false
community The SNMP V1 read community string to use to query hosts and devices.
mib Specifies a file name that contains an SNMP MIB that will immediately be read and compiled into the running program.
timeout How long to wait for a response to an SNMP get request, in seconds.
retries Number of times to retransmit an SNMP get request during the timeout interval.
notifier Pipe notification messages to this program, often a mailer or mail interface.
frequency Minimum time before sending another identical notification message, in minutes.
describe Whether or not to include a variable's MIB description field in notification messages.
Boolean. Default: true
msgformat A printf-style format string used to print notification messages, with the (cryptically named) variables smnpvar, message, complabel, compval, newlabel, newval, desc. This could use a little more sophistication.
Default: %s: %s\n %s %s\n %s %s%s\n
log Write out of spec entries to a log file.
Boolean. Default: true
logger A command like logger (1) that writes to syslog (3).
syslog A syslog facility.level pair, as accepted by the logger (1) command, such as "local1.info". If set, out of spec entries are piped to the logger command.
syslogtag Tag to use on syslog'd entries.