Computer System Performance Problem Detection Using Time Series Models
Peter Hoogenboom and Jay Lepreau
University of Utah
Computer systems require monitoring to detect performance anomalies
such as runaway processes, but problem detection and diagnosis is a
complex task requiring skilled attention. Although human attention
was never ideal for this task, as networks of computers grow larger
and their interactions more complex, it falls far short. Existing
computer-aided management systems require the administrator manually
to specify fixed "trouble" thresholds. In this paper we report on an
expert system that automatically sets thresholds, and detects and
diagnoses performance problems on a network of Unix computers. Key to
the success and scalability of this system are the time series models
we developed to model the variations in workload on each host.
Analysis of the load average records of 50 machines yielded models
which show, for workstations with simulated problem injection, false
positive and negative rates of less than 1%. The server machines most
difficult to model still gave average false positive/negative rates of
only 6%/32%. Observed values exceeding the expected range for a
particular host cause the expert system to focus on that machine.
There it applies tools with finer resolution and more discrimination,
including per-command profiles gleaned from process accounting
records. It makes one of 18 specific diagnoses and notifies the
administrator, and optionally the user [a].
Download the full text of this paper in
ASCII (61,543 bytes) form.
To Become a USENIX Member, please see our