Check out the new USENIX Web site. next up previous
Next: Testing Different Feature Construction Up: Analyzing System Logs: A Previous: Ranking Messages by Exceptionality


Using Rank Correlation for Feature Construction

In the original feature-set of our dataset, system log $ i$ is represented as the message count vector $ \vec{c_i}$ defined above. There are $ 15,\!000$ message types in our dataset, hence this results in a $ 3,\!000\times 15,\!000$ matrix. This matrix is very sparse; only about 0.6% of the entries are non-zero. The high dimensionality of the data and its sparseness make k-means clustering impractical for this representation. Since our objective is message ranking, we propose a new feature construction scheme of the system-log dataset that measures the difference in the ranking of messages between system logs. Two known rank correlation measures can be used to achieve this: The Spearman rank correlation [12] and Kendall's tau rank correlation [12]. Let $ \vec{x}$ and $ \vec{y}$ be vectors of dimension $ N$. Let $ \vec{r_x}$ and $ \vec{r_y}$ be vectors of ranks for $ \vec{x}$ and $ \vec{y}$, i.e. $ r_x\!\left[i\right] = k$ if $ x\!\left[i\right]$ is the $ k$'th largest number in $ \vec{x}$, and similarly for $ \vec{r_y}$.3 The two correlation measures are defined as follows:

Definition 1 (Spearman Rank Correlation)   Let $ \vec{d} \stackrel{def}{=}\vec{r_x} - \vec{r_y}$. The Spearman rank Correlation between $ \vec{x}$ and $ \vec{y}$ is defined by:

$\displaystyle \rho(\vec{x},\vec{y}) \stackrel{def}{=}1-\frac{6\Vert\:\vec{d}\:\Vert^2}{N(N^2-1)}$ (1)

Definition 2 (Kendall's Tau Rank Correlation)   Let $ P(\vec{x},\vec{y})$ be the number of pairs $ i,j$ such that both $ r_x\!\left[i\right]>r_x\!\left[j\right]$ and $ r_y\!\left[i\right]>r_y\!\left[j\right]$. Kendall's tau rank correlation between $ \vec{x}$ and $ \vec{y}$ is defined by:

$\displaystyle \tau(\vec{x},\vec{y}) \stackrel{def}{=}\frac{4P(\vec{x},\vec{y})}{N(N-1)}-1$ (2)

We first define a new Spearman-based feature-set. In this feature-set, system log $ i$ is represented by the vector $ (\rho(\vec{c_i},\vec{c_1}), \ldots, \rho(\vec{c_i},\vec{c_k}))$, where $ k$ is the number of samples in our dataset. A Kendall's-tau-based feature-set can be generated in an analogous way. The resulting matrix is a sample correlation matrix, and the new feature-set has $ k$ dimensions instead of the much larger $ n$ (the number of different message types in the dataset). It is generally expected that in a typical dataset, $ n$ would be much larger than $ k$ as it is in our own dataset, because of the diversity of possible messages in a computer system. It is interesting to note that using either the Spearman-based feature-set or the Kendall's-tau-based feature set generates a kernel similarity matrix for the original dataset. This opens the possibility of using these correlation measures in kernel-based algorithms [9]. We prove that both sample correlation matrices are kernel matrices, using the fact that a kernel matrix is a Positive Semi-Definite (PSD) matrix. A matrix $ A$ is PSD if for any non-zero vector $ x$, $ x'Ax\geq 0$ [9]. We first prove that the Pearson Sample Correlation matrix [12] is PSD, and then conclude that so are the Spearman rank correlation matrix and the Kendall's-tau sample correlation matrix.

Definition 3 (Pearson Sample Correlation)   Let $ \vec{x}$ and $ \vec{y}$ be vectors of the same dimension. The Pearson correlation coefficient is defined by:

$\displaystyle r(\vec{x},\vec{y}) = \frac{cov(\vec{x},\vec{y})}{\sqrt{var(\vec{x})\cdot var(\vec{y})}}$ (3)

where $ cov$ is the sample covariance function and $ var$ is the sample variance function.

Theorem 1   A Pearson sample correlation matrix is PSD.

Proof. Let $ X$ be a matrix in which each row is a sample. Let $ S$ be a diagonal matrix such that entry $ \left(i,i\right)$ is the variance of row $ i$ in $ X$. Assume, without loss of generality, that the mean of each sample in $ X$ is zero. Then the Pearson correlation matrix can be written in vector form as:

$\displaystyle R = S^{-\frac{1}{2}} X X' S^{-\frac{1}{2}}$    

For any non-zero vector $ \vec{x}$, the expression $ \vec{x}'R\vec{x}$ can be written as:

$\displaystyle \vec{x}'R\vec{x} =\vec{x}'\left( S^{-\frac12} X X' S'^{-\frac12}\right)\vec{x} = \left(\vec{x}' S^{-\frac12} X \right) ^2$ (4)

The rightmost element is a square term, hence it is greater than or equal to zero. Therefore $ R$ is PSD. $ \qedsymbol$

Theorem 2   A Spearman rank correlation matrix is PSD.

Proof. Spearman correlation is Pearson correlation applied to ranks [4]. Therefore, the Spearman rank correlation matrix is PSD. $ \qedsymbol$

Theorem 3   A Kendall's-tau correlation matrix is PSD.

Proof. For a vector $ \vec{x}$ of dimension $ N$, let $ \vec{x_K}$ of dimension $ N^2$ be defined by:

$\displaystyle x_K\!\left[j+(i-1)\cdot N\right] = 
 \left\{
 \begin{array}{ll}
 ...
...i\right]>r_x\!\left[j\right]  
 0 & \textrm{otherwise} 
 \end{array}
 \right.$ (5)

Then $ P(\vec{x},\vec{y}) = \sum_{k=1}^{N^2}x_K\!\left[k\right]\cdot y_K\!\left[k\right]$, and it can be easily verified that Kendall's tau correlation of $ \vec{x}$ and $ \vec{y}$ is the Pearson correlation of $ c\cdot\vec{x_K}$ and $ c\cdot\vec{y_K}$, for $ c$ a constant that depends on $ n$. Hence, Kendall's tau correlation matrix is also a Pearson correlation matrix, and so it is PSD. $ \qedsymbol$


next up previous
Next: Testing Different Feature Construction Up: Analyzing System Logs: A Previous: Ranking Messages by Exceptionality
2007-03-12