Representing correlated failures

Consider a system composed of a set $\cal H$ of hosts each of which is capable of holding certain objects. These hosts can fail (for example, by crashing) and, to keep these objects available, they need to be replicated. A simple replication strategy is to determine the maximum number

of hosts that can fail at any time, and then maintain more than

replicas of each object.

However, using more than

replicas may lead to excessive replication when host failures are correlated. As a simple example, consider three hosts $\{h_1, h_2, h_3\}$ where the failures of

and

are correlated while

fails independent of the other hosts. If

fails, then the probability of

failing is high. As a result, one might set

and thereby require

replicas. However, if we place replicas on

and

, the object's availability may be acceptably high with just two replicas.

To better address issues of optimal replication in the face of correlated failures, we have defined an abstraction that we call a core [15]. A core is a minimal set of hosts such that, in any execution, at least one host in the core does not fail. In the above example, both $\{h_1, h_3\}$ and $\{h_2, h_3\}$ are cores. $\{h_1, h_2\}$ would not be a core since the probability of both failing is too high and $\{h_1, h_2, h_3\}$ would not be a core since it is not minimal. Using this terminology, a central problem of informed replication is the identification of cores based on the correlation of failures.

An Internet catastrophe causes hosts to fail in a correlated manner because all hosts running the targeted software are vulnerable. Operating systems and Web servers are examples of software commonly exploited by Internet pathogens [27,36]. Hence we characterize a host's vulnerabilities by the software they run. We associate with each host a set of attributes, where each attribute is a canonical name of a software package or system that the host runs; in Section 3.2 below, we discuss the tradeoffs of representing software packages at different granularities. We call the combined representation of all attributes of a host the configuration of the host. An example of a configuration is $\{$ Windows, IIS, IE $\}$ , where Windows is a canonical name for an operating system, IIS for a Web server package, and IE for a Web browser. Agreeing on canonical names for attribute values is essential to ensure that dependencies of host failures are appropriately captured.

An Internet pathogen can be characterized by the set of attributes

that it targets. Any host that has none of the attributes in

is not susceptible to the pathogen. A core is a minimal set

of hosts such that, for each pathogen, there is a host

that is not susceptible to the pathogen. Internet pathogens often target a single (possibly cross-platform) vulnerability, and the ones that target multiple vulnerabilities target the same operating system. Assuming that any attribute is susceptible to attack, we can re-define a core using attributes: a core is a minimal set

of processes such that no attribute is common to all hosts in

. In Section 5.4, we relax this assumption and show how to extend our results to tolerate pathogens that can exploit multiple vulnerabilities.

To illustrate these concepts, consider the system described in Example 3.1. In this system, hosts are characterized by six attributes which we classify for clarity into operating system, Web server, and Web browser.

Example 3.1

Attributes: Operating System = {Unix, Windows};

Web Server = {Apache, IIS};

Web Browser = {IE, Netscape}.

Hosts: $H_1 = \{\mbox{\emph{Unix, Apache, Netscape}} \}$ ;

$H_2 = \{\mbox{\emph{Windows, IIS, IE}} \}$ ;

$H_3 = \{\mbox{\emph{Windows, IIS, Netscape}} \}$ ;

$H_4 = \{\mbox{\emph{Windows, Apache, IE}} \}$ .

Cores $= \{ \{H_1, H_2\}, \{H_1, H_3, H_4\} \}$ .

The smaller core $\{H_1, H_2\}$ might appear to be the better choice since it requires less replication. Choosing the smallest core, however, can have an adverse effect on individual hosts if many hosts use this core for placing replicas. To represent this effect, we define load to be the amount of storage a host provides to other hosts. In environments where some configurations are rare, hosts with the rare configurations may occur in a large percentage of the smallest cores. Thus, hosts with rare configurations may have a significantly higher load than the other hosts. Indeed, having a rare configuration can increase a host's load even if the smallest core is not selected. For example, in Example 3.1,

is the only host that has a flavor of Unix as its operating system. Consequently,

is present in both cores.

To make our argument more concrete, consider the worms summarized in Table 1, which are well-known worms unleashed in the past three years. For each worm, given two hosts with one not running Windows or not running a specific server such as a Web server or a database, at least one survives the attack. With even a very modest amount of heterogeneity, our method of constructing cores includes such pairs of hosts.

Attributes:	Operating System = {Unix, Windows};
	Web Server = {Apache, IIS};
	Web Browser = {IE, Netscape}.

Hosts:	$H_1 = \{\mbox{\emph{Unix, Apache, Netscape}} \}$ ;
	$H_2 = \{\mbox{\emph{Windows, IIS, IE}} \}$ ;
	$H_3 = \{\mbox{\emph{Windows, IIS, Netscape}} \}$ ;
	$H_4 = \{\mbox{\emph{Windows, Apache, IE}} \}$ .