Nagios: Advanced Topics
As noted sysadmin B. Knowles said, "if you liked it then you shoulda put a [monito]ring on it." John Sellens was back on Monday with another morning session -- this one focused on using Nagios to monitor just about anything. Nagios is a host and service monitor with a long history, and a longer list of uses. One example of how Nagios can be used in unexpected ways involves a rotating amber beacon plugged into an IP-enabled PDU. When an exception occurred, Nagios would execute a command that turned on the power to that particular socket on the PDU, giving a visual indication that something had gone wrong.
Of course, Nagios isn't a toy, but that example goes to demonstrate how flexible Nagios can be. This flexibility comes from the split between the Nagios Core, which is basically a polling and reporting engine, and the user interface. The checks and notifications are arbitrary external commands that get run. While a rich variety of plugins are available, rolling your own is easy. If you can extract the data you want and parse it appropriately, the check is practically written. All you need is to return the correct exit code (depending on the state) and an optional string.
Like any training with the word "advanced" in the title, the opening portion served to ensure everyone was at a certain base level of familiarity. John covered some of the compile-time options (is "--enable-embedded-perl" really necessary? John argues that it's better to keep the core and plugins truly separate, but "--enable-event-broker" is a recommended option), as well as post-install tips (remember to setup the named pipe by running 'install-commandmode').
The next topic was configuring Nagios through the use of the configuration files. The main nagios.cfg allows for the definition of a configuration directory with the cfg_dir directive. Like /etc/cron.d/ or any other *.d/ directory, any number of configuration files can be placed in this directory and will be loaded by Nagios at start time. This makes managing Nagios configuration much easier because each host and service can have a separate configuration file.
Configuration is very flexible. It's possible to have services not send alerts outside of business hours for low-criticality systems, or to send initial alerts to a different person depending on who is on-call that week. Services can be defined as dependent on others so that your Nagios server doesn't waste time trying to check services that are down. Unfortunately, with large installations, configuration can be very complicated as well. Using the -V argument to the nagios binary causes the configuration to be checked for syntactical correctness before use.
Where Nagios really gets interesting is in large-scale deployments. The Nagios server that my group uses currently runs almost 30,000 active checks from a single server. As you can imagine, the performance is a little on the sub-optimal side. Fortunately there are ways to scale up Nagios installs. Perhaps the easiest is to simply have several servers. These can also watch each other so you solve the "who's watching the watcher?" problem.
Tools like DNX (Distributed Nagios eXecutor) can spread the load by passing some of the check-running off to satellite nodes. These nodes can come and go (or die) as they please and the Nagios service will continue to run. Using parent/child and host/service dependencies helps eliminate checks when a target is down. The "use_large_installation_tweaks" option in nagios.cfg reduces the number of forks() a check uses and is more efficient at freeing memory used by child processes.
There's much more information in John's slides, and even more in the Nagios documentation. Now if only we could all agree how to pronounce it.