Computer System Performance Problem Detection Using Time Series Models

Peter Hoogenboom and Jay Lepreau
University of Utah


Computer systems require monitoring to detect performance anomalies such as runaway processes, but problem detection and diagnosis is a complex task requiring skilled attention. Although human attention was never ideal for this task, as networks of computers grow larger and their interactions more complex, it falls far short. Existing computer-aided management systems require the administrator manually to specify fixed "trouble" thresholds. In this paper we report on an expert system that automatically sets thresholds, and detects and diagnoses performance problems on a network of Unix computers. Key to the success and scalability of this system are the time series models we developed to model the variations in workload on each host. Analysis of the load average records of 50 machines yielded models which show, for workstations with simulated problem injection, false positive and negative rates of less than 1%. The server machines most difficult to model still gave average false positive/negative rates of only 6%/32%. Observed values exceeding the expected range for a particular host cause the expert system to focus on that machine. There it applies tools with finer resolution and more discrimination, including per-command profiles gleaned from process accounting records. It makes one of 18 specific diagnoses and notifies the administrator, and optionally the user [a].

