Next: Experimental Results Up: Session State: Beyond Soft Previous: Brick MTTF vs. Availability

Pinpoint + SSM = Self-Healing

Pinpoint is a framework for detecting likely failures in componentized systems. To detect failures, a Pinpoint server dynamically generates a model of the ``good behavior'' of the system. This good behavior is based on both the past behavior and the majority behavior of the system, under the assumption that most of the time, most of the system is likely to be behaving correctly.

When part of the system deviates from this believed good behavior, Pinpoint interprets the anomaly as a possible failure. Once Pinpoint notices a component generating a threshold number of these anomalies, Pinpoint triggers a restart of the component.

To detect failures in bricks, Pinpoint monitors each brick's vital statistics. Each brick sends its own statistics to the Pinpoint server at one-second intervals. Statistics are divided into activity and state statistics.

Activity statistics, e.g., the number of processed writes, represent the rate at which a brick is performing some activity. When Pinpoint receives an activity statistic, it compares it to the statistics of all the other bricks, looking for highly deviant rates. Because we want to be able to run SSM on a relatively small number of nodes, we calculate the median absolute deviation of the activity statistics. This metric is robust to outliers even in small populations, and lets us identify deviant activity statistics with a low-false positive rate.

State statistics represent the size of some state, such as the size of the message inbox. In SSM, these state statistics often vary in periodic patterns, e.g., in normal behavior, the MemoryUsed statistic grows until the garbage collector is triggered to free memory, and the pattern repeats. Unfortunately, we do not know a priori the period of this pattern - in fact, we cannot even assume a regular period.

To discover the patterns in the behavior of state statistics, we use the Tarzan algorithm for analyzing time series [23]. For each state statistic of a brick, we keep an N-length history or time-series of the state. We discretize this time-series into a binary string. To discover anomalies, Tarzan counts the relative frequencies of all substrings shorter than k within these binary strings. If a brick's discretized time-series has a surprisingly high or low frequency of some substring as compared to the other brick's time series, we mark the brick as potentially faulty. This algorithm can be implemented in linear time and linear space, though we have found we get sufficient performance from our simpler non-linear implementation.

Once a brick has been identified as potentially faulty through three or more activity and state statistics, we conclude that the brick has indeed failed in some way; Pinpoint restarts the node. In the current implementation, a script is executed to restart the appropriate brick, though a more robust implementation might make use of hardware-based leases that forcibly reboot the machine when they expire [10].

Because restarting a brick will only cure transient failures, if Pinpoint detects that a brick has been restarted more than a threshold number of times in a given period, which is usually indicative of a persistent fault, it can take the brick off-line and notify an administrator.

Next: Experimental Results Up: Session State: Beyond Soft Previous: Brick MTTF vs. Availability

Benjamin Chan-Bin Ling 2004-03-04