Defining Similarity

Next: Role Classification Up: Model Previous: Model

Defining Similarity

We use connection behavior as a basis for host grouping, because that information is easily available by just monitoring the network. To group hosts, we need to define similarity in a way that captures the extent to which pairs of hosts establish connections to the same set of other hosts. We start by defining similarity between hosts as a function of the number of common hosts with which they communicate. Intuitively, hosts that share the same logical role communicate with similar sets of hosts. A connection is a pair consisting of a source host address and a destination host address. The connection set of a host, C(h), is the set, $\{ a \mid a \in I$ and there is a connection between h and a}. If $h_1 \in C(h_2)$ , then $h_2 \in C(h_1)$ . We define the relation neighbor(h₁, h₂) to be true if and only if h₁ = h₂ or $h_1 \in C(h_2)$ . For later use, we extend the definition of neighbor to groups by defining neighbor(G₁, G₂) to be true if and only if there exists a host $h_1 \in G_1$ that is a neighbor of another host $h_2 \in G_2$ . We can use the notion of a connection set to provide a simple definition of similarity:

$\begin{displaymath} \mbox{\emph{similarity}}(h_1, h_2) = \vert C(h_1) \cap C(h_2)\vert \end{displaymath}$

(1)

That is to say that similarity(h₁, h₂) is equal to the number of neighbors that h₁ and h₂ have in common. We are now in a position to specify the requirements of a grouping algorithm. Given a set of hosts, I and a similarity threshold, S_min, it must find a partitioning, P, of I that is maximal with respect to avg_similarity and S_min, i.e.,

P respects avg_similarity. This constraint guarantees that each host is within the group with which it has the strongest average similarity.
For all $h \in G$ and $G \in P$ , avg_similarity(h, G) $\geq S_{min}$ . This requirement guarantees that each host in a group is sufficiently closely related to every other host in the group, thus ensuring that groups are not too large.
There is no other partitioning P of I that meets the first two requirements and has a larger average group size. This guarantees that groups are not too small.

This specification is independent of the definition of avg_similarity. For some networks, such as the one represented in Figure 1, the above definition of avg_similarity yields excellent results. However, for others a slightly more complex definition works better. We present such a definition in Section 4.2.

Next: Role Classification Up: Model Previous: Model

Godfrey Tan 2003-04-01