Check out the new USENIX Web site. next up previous
Next: Using Rank Correlation for Up: Analyzing System Logs: A Previous: Introduction


Ranking Messages by Exceptionality

Given a system log of a computer system, we generate a summarized ranked view of this log. This view can help administrators and support personnel to identify and diagnose problems in the computer system more effectively, by displaying a much shorter log view of messages ordered by their importance to the user. A tool based on the described methods is being used to aid support personnel in the IBM xSeries support center. To generate the ranked log view from the original log of a computer system, we first group the messages in the original log into mutually exclusive sets that correspond to message types. A message type is characterized by a base string that generates all the messages of this type, though possibly with different parameters. Grouping into types is trivial if the original log specifies the source and unique identification of each message, as in the Windows Event Log. Identifying message types without this information is a challenge that we do not address in this paper. (See [10], [7] for some approaches to this problem.) We henceforth refer to messages of the same type as instances of the same message, though the string parameters may differ between instances. In the ranked log view, a single log line is displayed for each message type that appeared in the original log. This line lists the number of message instances, the largest common string pattern of the message instances, and the time-range in which the message instances appeared. Ranks are assigned to each message type and the lines are sorted in order of rank. Our ranking method is based on the premise that a message in a system log is more important to the user if it has more instances in the log than is expected for this computer system. This is based on the idea that although it is possible that many computer systems have some problems reported in the system log, it would usually not be the same problem in all systems. To formalize this notion, let us represent system log $ i$ by a vector $ \vec{c_i}=(c_i\!\left[1\right],\ldots c_i\!\left[n\right])$, where $ n$ is the number of possible message types, and $ c_i\!\left[m\right]$ is the number of instances of message $ m$ in system log $ i$.1 Also, let $ P=\{p_1,\ldots, p_n\}$ be a set of probability cumulative distribution functions $ p_m\!:\!\ensuremath{\mathbb{N}}\rightarrow \left[0,1\right]$, where $ p_m(c)$ is the probability that message $ m$ would appear $ c$ or less times in a system log. If the probability of getting more than $ c_i\!\left[m\right]$ instances of message type $ m$ is low, then the number of appearances of message $ m$ is more than expected, and therefore message $ m$ should be ranked higher. Therefore, the ranking of messages should approximate an ascending ordering of $ (p_1(c_i\!\left[1\right]),\ldots p_n(c_i\!\left[n\right]))$. Given a large enough dataset of system logs from actual computer systems, we can estimate $ P$ from the empirical distribution $ \hat{P}=\{\hat{p}_1,\ldots, \hat{p}_n\}$ of the number of instances of each message type in each system. We define the Score of message type $ m$ in a log $ i$ to be $ \hat{p}_m(c_i\!\left[m\right])$, and use this score to rank the messages within the log.2 The messages that are top-ranked by this method usually indicate important problems in the system. This is illustrated in the ranked log view in Table [*], which was generated from one of the samples in our dataset.

Table: A Ranked log view of an actual system (with shortened messages). Bold font indicates hardware problems.
Rank Times Source Message
1 10 ViperW2K The device, $ \backslash$Device $ \backslash$Tape1, has a bad block.
2 4 Oracle.cq1 Audit trail: ACTION : 'CONNECT' DATABASE USER: '/' PRIVILEGE : SYSOPER ...
3 1 SAPCQ1_20 SAP Basis System: Run-time error "TIME_OUT" occurred
4 1014 SAPCQ120 SAP Basis System: Transaction Canceled 00 158 ( )
5 1 MRxSmb Delayed Write Failed ...may be caused by a failure of your computer hardware ...
6 8 ql2300 The device, $ \backslash$Device $ \backslash$Scsi $ \backslash$ql23002, did not respond within the timeout period.
7 54 DnsApi The system failed to register pointer (PTR) resource records (RRs) for network adapter ...
8 1 Kerberos The kerberos subsystem encountered a PAC verification failure. ...
9 1 Windows Update Agent Installation Failure: Windows failed to install the following update with error ...
10 1 NETLOGON The Netlogon service could not read a mailslot message from The system ...


The estimation of $ P$ using the empirical distribution of the entire population is based on the implicit assumption that the population of computer systems in our dataset is homogeneous enough to treat all of them as generated from the same distribution. In actuality, different computer systems are used for very different purposes. Each purpose dictates a use-model that results in a different message distribution. For example, a computer system that serves as a file-server would probably be more likely to issue `File Not Found' messages than a personal workstation. On the other hand, a personal workstation might issue more system-restart messages. To improve the accuracy of our estimation of $ P$, we group the computer systems in our dataset into sets of systems with a similar use-model, and estimate $ P$ separately for each set. We group the systems using k-means clustering [1] on the system log dataset. To generate the ranked log view for a given system, we first find the cluster it belongs to, and then rank its log messages based on the estimation of $ P$ for that cluster. In the following section, we present a new feature construction scheme for the system log dataset. This scheme achieves a significantly better clustering than the original feature-set.
next up previous
Next: Using Rank Correlation for Up: Analyzing System Logs: A Previous: Introduction
2007-03-12