Check out the new USENIX Web site. next up previous
Next: Related Work Up: Discussion Previous: It's OK to Say

It's OK to Make Mistakes

The result of the application of redundancy and the interchangeability of components is that recovery is fast, simple, and unintrusive: a brick is recovered by rebooting it without worrying about preserving its pre-crash state, and recovery does not require coordination with other bricks.

As a result, the monitoring system that detects failures is allowed to make mistakes. In contrast, in other systems, false positives usually reduce performance, lower throughput, or cause incorrect behavior. Since false positives are not a problem in SSM, generic methods such as statistical anomaly based failure detection can be made quite aggressive, to avoid missing real faults.


Benjamin Chan-Bin Ling 2004-03-04