A number of papers and patents in the literature describe various components of semi-automatic problem diagnosis systems that were developed and used in the context of mainframe computer systems [14,31,22], other highly reliable systems [1,17,32] and the phone system [35,21]. These systems used the technique of continuous monitoring of the health of the system. Events affecting the health of the system were fed into a decision tree based expert diagnosis system. The expert system used the input events to walk down its decision tree to narrow down the set of possible problematic situations that might be present.
The hardest part of building such a system was defining the set of events to be monitored and building the knowledge base (the decision tree) of the expert system. There is some literature that describes at an abstract level how such knowledge base rule-sets can be created for a specific system based on probabilistic data about events and problems [35,34]. Presumably, in practice, these knowledge bases were created based on experience information gathered from the field.
In some ways the work that we describe in this paper is similar to this older work. We also use continuous monitoring and have a rule-set and use thresholds to trigger off further diagnosis steps including various kinds of active tests.
Our work differs from this older work in that it provides novel guiding principles and a certain structure to the problem of designing rule-sets, thresholds, causality in related diagnosis procedures and active tests. The four auto-diagnosis techniques that we present were designed based on our knowledge of common field problems and how the occurrence of such field problems effects the dynamics of modern layered operating systems. The technique of protocol augmentation directly targets problems that arise out of inadequate, incompletely specified or poorly implemented open network protocols. Such problems are much more widespread and important in today's network-centric open computing infrastructures than in older environments where communication was based on closed proprietary protocols.