;login: The Magazine of USENIX & SAGE

 

system and network monitoring

sellens_john

by John Sellens

John Sellens is Associate Director, Technical Services, with GNAC in Toronto. He is also proud to be husband to one and father to two.

<jsellens@gnac.com>

As computers have gotten smaller and networks have gotten bigger, most of us have found ourselves worrying about more and more machines and network devices. In the old days, the typical installation of a small number of central servers, a larger number of ASCII terminals, and a few point-to-point serial or network links meant that a lot of system monitoring could be handled by periodic manual inspection or a few shell scripts, cron jobs, and mail messages.

These days, when there seems to be at least one server for each possible function; when everyone has a machine with what used to be thought of as major processing power on their desk; when networks are bigger, more complicated, and smarter; and when everything from modems to printers to soda machines is network-connected, local monitoring just isn't enough. Virtually every site needs some form of distributed or network-based monitoring mechanism, if only to have some ability to keep track of the worst problems.

In this article I'll discuss what monitoring is, along with where and why you might want to use it, typical components of a monitoring system, and some criteria against which to measure different monitoring systems and tools.

This is the first article in a planned series on system and network monitoring. Future articles will examine a variety of monitoring software packages, measure them against the evaluation criteria, and attempt to discuss the pros and cons of each and identify where the software might be most appropriately used. I'll primarily be examining open source and freely available software, but I'll also try to cover some commercial packages.

And, for the benefit of those who are unfamiliar with professional concert sound-reinforcement systems, I promise that I'll try to avoid attempting weak puns about asking for more SNMP in the monitors.

What Is Monitoring?

Monitoring is primarily intended to identify what has gone wrong or is about to go wrong. In general, monitoring systems can be thought of as having four components:

  • Data collection and/or generation
  • Data logging or storage
  • Analysis, comparison, or evaluation
  • Reporting and exception alerting

Basically, you collect some data points, stash them somewhere, compare them against established limits or failure indicators, and raise a flag if something's wrong.

That is a bit of a simplification, but the basic truth is there. Fortunately, most monitoring systems are a little more sophisticated than the bare-bones description above.

Data Collection

Data collection usually takes a few different forms, but most forms can be classified as some sort of probe. Some examples of common probes are:

  • ICMP pings to indicate network connectivity
  • Simple port probes (e.g., does a TCP/IP connection to port 80 succeed?)
  • SNMP queries to determine specific states or activity levels

(SNMP is, of course, the Simple Network Management Protocol — for an introduction to SNMP, see Elizabeth Zwicky's articles in recent issues of ;login:.)

Data points can also be generated and submitted by a system or network element to the monitoring system, through SNMP traps, mail, or some other form of network connection. An obvious example of this is the use of a centralized syslog host that receives syslog messages from various hosts on the network.

Data Logging

In many cases, the collected data is logged in a fairly basic way, often through syslog or some flat file. Some systems log every data point they receive or generate; some log only the "interesting" ones.

More sophisticated logging mechanisms can make it easier to identify trends or multiple failures that are due to a single cause (such as a loss of connectivity that's due to a router or communications link failure). Logging mechanisms such as relational databases can add some complexity, but can also make certain kinds of reporting easier and more effective.

Analysis

In most cases, monitoring analysis takes the form of an immediate, realtime, good/no-good decision. For example, failed pings would normally be assumed to indicate a machine or network connection that's down, and a disk that reports 99% full may call for some attention.

Some systems can correlate multiple failures that are due to a single cause (that communications link mentioned above), and some can react or escalate after multiple consecutive failures (e.g., call your boss if the Web server doesn't respond for the third time).

The other type of analysis that is sometimes overlooked is trend reporting -- if you can notice trends while they are happening, you'll have a better chance of adding more disk space to the /news partition before you run into major problems. I haven't (yet) seen this myself, but it would be nice, in a twisted sort of way, to get an automated message noting that disk use has been increasing, and that if current trends continue, the disk will be full in 42 hours.

In general, better analysis gets you better results, but you'll likely pay for it in increased complexity and increased cost.

Reporting

Probably the first type of reporting that people think of in relation to monitoring systems is alpha or numeric pager messages (which of course always come at the worst possible time). But there is far more to a proper reporting system.

Reporting is generally concerned with three types of information:

  • Exceptions — problems that should be reported in some form of "alert" for investigation, action, and resolution.
  • History — specific data for specific time periods, for such uses as traffic-level and outage reporting, as well as usage or capacity-based billing.
  • Trends — aggregated (typically) data used for trend analysis and capacity planning.

In general, two styles of exception reporting are used with monitoring systems: report everything, or report only the problems. And, furthering that distinction, do you report events only as they happen, or do you report on the current state, identifying all "unresolved" issues?

How should exceptions be reported? A number of mechanisms can be used individually or in combination.

  • One-time messages, pager, email, fax, and the like. These are most useful at the time of the first identification of a given problem — getting paged every few minutes about the very problem that you're working to solve can be somewhat distracting. But in the absence of 100% reliable communication, an effective method might keep sending alerts until they are acknowledged.
  • Full-screen ASCII or Web-based status screen. These are often seen in "showplace" Network Operations Centers (NOCs) and list outstanding alerts in order of priority, or in forward or reverse chronological order.
  • Query lists that can be reviewed periodically. This is perhaps most often seen in trouble-ticket systems.

The key being, regardless of reporting mechanism, that exceptions must be reported on a timely basis, and there should be some form of tracking mechanism so that unresolved problems are less likely to get lost or moved to the bottom of the to-do list.

Historical-data reporting is often built specifically for a particular intended purpose, with reports tailored to provide the needed information in the most effective manner possible. This is often used for client usage-based billings, or periodic outage or performance reports. Historical data is often needed only for a limited time, after which it can be archived or deleted.

Trend reporting is often best understood when visualized in some graphical format, making it easy to see where things are headed. The data used for trend reporting can often be aggregated as time passes. For example, you may want to have complete detailed data collected every five minutes for network bandwidth usage for the past week, but you may only want something like monthly average, maximum, and minimum peak usage for last year. This means that a lot less data needs to be stored and analyzed to generate trend reports.

Where and Why Would You Use Monitoring?

Quite simply, you should use some form of system and network monitoring any time you have a system or network that someone cares about or relies on. Your particular circumstances will dictate what level of monitoring you will need and what style of monitoring system or systems will best meet your needs.

If you operate a nontrivial network, you'll probably want to monitor your routers, switches, and communication links for such things as bandwidth utilization, CPU and memory utilization, and interface-state changes. If you operate a collection of servers or a Web-hosting farm, you'll want to monitor things such as uptime, service availabil-ity, disk space and memory utilization, print queues, and network connectivity.

A properly functioning and configured monitoring system, with appropriate alert-dispatch mechanisms, is one more tool to help you identify and correct problems before your users and customers identify them for you.

Even in the absence of utilization-based billing and formal service-level agreements, historical and trend data can be very useful, providing evidence of the superior quality and availability of your systems and helping to provide additional data in support of your requests for additional equipment or personnel.

How Are Monitoring Systems Built?

Most monitoring systems are built from components, either real or virtual. Concept-ually, most systems can be characterized as having the four components discussed above.

Monitoring systems come with some number of probes, either hard-coded into a larger program or implemented as separate programs communicating through some API or data-interchange format. There may be as few as two or three different probes or as many as 100 or more included with a particular monitoring system. Probes typically deal with a single type of query, sometimes restricted to dealing with equipment from a particular vendor.

There is often some form of "trap manager." An SNMP "trap" is a message sent by a device to alert a management or monitoring system of a change in state or an error condition, such as a failed component or a network interface going up or down.

Some form of configuration file or language is used to indicate which probes to direct at which devices, how often, and what the acceptable limits are for the values returned by those probes. There would also need to be some configuration information for use by the trap manager (if one exists), dictating what actions to take when traps are received, based on the type of trap, originating system, or other factors. And the configuration information would also need to indicate how to send alerts for the various types of exceptions as they are identified.

Beyond that, there is some form of logging, and some interface to tools for dispatching alerts (e.g., by mail, or calls to a system that sends messages to pagers), and various reporting tools. The reporting tools can range from rudimentary or nonexistent to sophisticated graphical interfaces with tools for filtering alerts prior to display, problem assignment, and so on.

In operation, a typical monitoring system becomes a series of interconnected "event loops," periodically probing, recording, dispatching, and resolving alerts, and logging data points for historical or trend reporting.

Evaluation Criteria

In future articles in this series I'll review various monitoring systems and software packages, and I'm going to use some or all of the following criteria as the basis for describing and evaluating those systems.

Size and Complexity

Monitoring systems range from very small (simpleminded ping tests that generate mail messages on failure) to very large (multiple probe machines, report generators, display and dispatch engines, and a high-availability database engine). Different organizations will need (and be able to cope with) different-sized monitoring systems.

Scalability

While closely tied to size and complexity, scalability is important if your network contains more than a small number of devices, or if you expect ongoing growth. A process or mechanism that works well against 20 or 30 machines can fail miserably in all sorts of interesting and painful ways when used with 200 or 2,000 machines.

Reliability

A monitoring system that is prone to failure, or that misreports failures or problems in other devices, can be worse than no monitoring at all.

Cost

Monitoring-system implementations can range from almost free (an hour's worth of work) to several hundreds of thousands of dollars (hardware, software, consulting, maintenance). My personal knee-jerk preference is for freely available software, but there are many instances where commercial monitoring products are the only reasonable way to go.

Number and Type of Probes

Can the monitoring system handle typical devices, more specifically, your devices? Can it be easily extended to handle other devices that only you have?

Configuration Complexity and Flexibility

A configuration format that can be machine-generated (from, say, an existing database), that can express a hierarchy of connectivity (so that you don't try to contact all the machines behind a gateway if the gateway is down), and has reasonable defaults and per-device customization is a very good thing to have. And if it's understandable and easy to maintain too, then so much the better.

Exception-Reporting Style

Does the monitoring system always display full status information for every device or service being monitored (which doesn't scale well), does it display only those devices that have unresolved exceptions, or does it offer the user a choice?

Exception-Reporting Tools

How are exceptions reported? Are there pager, email, fax, and Web interfaces? Is there a command line and curses-based text terminal interface and display? Is it easy to add additional exception reporting interfaces?

Logging and Data Storage

How is data logged — via syslog, flat files, a database? Are there mechanisms for data aggregation, de-duplication, and pruning?

Reporting Mechanisms

Are there suitable and flexible reporting mechanisms for historical and trend data? Is there a defined interface or tools to allow custom reporting?

What Next?

That pretty much covers my introduction to system and network monitoring. In the next article in this series, I'll review one of the many popular and easily available monitoring packages. Let me know your favorite, and I'll attempt to explain how even though it may be the best one for you and it suits your 10base2-based network to a T-connector, it probably won't fit anyone else's situation (except that of the software author's, of course).

 

?Need help? Use our Contacts page.
Last changed: 20 nov. 2000 ah
Issue index
;login: index
USENIX home