Next: Testing Different Feature Construction
Up: Analyzing System Logs: A
Previous: Ranking Messages by Exceptionality
Using Rank Correlation for Feature Construction
In the original featureset of our dataset, system log is represented as the message count vector defined above. There are message types in our dataset, hence this results in a
matrix. This matrix is very sparse; only about 0.6% of the entries are nonzero. The high dimensionality of the data and its sparseness make kmeans clustering impractical for this representation.
Since our objective is message ranking, we propose a new feature construction scheme of the systemlog dataset that measures the difference in the ranking of messages between system logs. Two known rank correlation measures can be used to achieve this: The Spearman rank correlation [12] and Kendall's tau rank correlation [12].
Let and be vectors of dimension . Let and be vectors of ranks for and , i.e.
if
is the 'th largest number in , and similarly for .^{3}
The two correlation measures are defined as follows:
Definition 1 (Spearman Rank Correlation)
Let
.
The Spearman rank Correlation between
and
is defined by:

(1) 
Definition 2 (Kendall's Tau Rank Correlation)
Let
be the number of pairs
such that both
and
.
Kendall's tau rank correlation between
and
is defined by:

(2) 
We first define a new Spearmanbased featureset. In this featureset, system log is represented by the vector
, where is the number of samples in our dataset. A Kendall'staubased featureset can be generated in an analogous way. The resulting matrix is a sample correlation matrix, and the new featureset has dimensions instead of the much larger (the number of different message types in the dataset). It is generally expected that in a typical dataset, would be much larger than as it is in our own dataset, because of the diversity of possible messages in a computer system.
It is interesting to note that using either the Spearmanbased featureset or the Kendall'staubased feature set generates a kernel similarity matrix for the original dataset. This opens the possibility of using these correlation measures in kernelbased algorithms [9]. We prove that both sample correlation matrices are kernel matrices, using the fact that a kernel matrix is a Positive SemiDefinite (PSD) matrix. A matrix is PSD if for any nonzero vector ,
[9].
We first prove that the Pearson Sample Correlation matrix [12] is PSD, and then conclude that so are the Spearman rank correlation matrix and the Kendall'stau sample correlation matrix.
Definition 3 (Pearson Sample Correlation)
Let
and
be vectors of the same dimension.
The Pearson correlation coefficient is defined by:

(3) 
where
is the sample covariance function and
is the sample variance function.
Theorem 1
A Pearson sample correlation matrix is PSD.
Proof.
Let
be a matrix in which each row is a sample. Let
be a diagonal matrix such that entry
is the variance of row
in
. Assume, without loss of generality, that the mean of each sample in
is zero. Then the Pearson correlation matrix can be written in vector form as:
For any nonzero vector
, the expression
can be written as:

(4) 
The rightmost element is a square term, hence it is greater than or equal to zero. Therefore
is PSD.
Theorem 2
A Spearman rank correlation matrix is PSD.
Proof.
Spearman correlation is Pearson correlation applied to ranks [
4]. Therefore, the Spearman rank correlation matrix is PSD.
Theorem 3
A Kendall'stau correlation matrix is PSD.
Proof.
For a vector
of dimension
, let
of dimension
be defined by:

(5) 
Then
, and it can be easily verified that Kendall's tau correlation of
and
is the Pearson correlation of
and
, for
a constant that depends on
. Hence, Kendall's tau correlation matrix is also a Pearson correlation matrix, and so it is PSD.
Next: Testing Different Feature Construction
Up: Analyzing System Logs: A
Previous: Ranking Messages by Exceptionality
20070312