Check out the new USENIX Web site. next up previous
Next: Related work Up: Auto-diagnosis of field problems Previous: Implementation in other operating

Performance and experience

In this section, we will briefly discuss the performance of the NetApp Auto-diagnosis System, and our experience with its effectiveness in making the task of debugging field problems simple.

The continuous monitoring subsystem of ONTAP takes very few resources. Its CPU overhead is less than 0.25% CPU, even on the slowest systems that we ship. The memory footprint is less than 400KB for a typical system. The time that the netdiag command takes depends on the configuration of the system and the load on the system. On our slowest filer that is configured with the maximum number of allowable network interfaces and is saturated with client load, netdiag takes no more than 15 seconds to execute. On most systems, it takes less than 5 seconds.

Specifically, on a F760 class filer (600 Mhz 21164 Alpha, 2GB RAM) configured with 4 network interfaces and under full client load, the CPU usage of auto-diagnosis continuous monitoring code is less than 0.1% CPU. On this system, netdiag takes approximately 4 seconds to execute.

The version of ONTAP that contains the NetApp Auto-diagnosis System has only recently been made available to customers. However, since this version of ONTAP has not yet shipped to our customers in volume, we have not been able to see how well the auto-diagnosis subsystem is able to deal with real-life problems in the field. Instead, we have been forced to rely on a study in the laboratory. In this study we simulated a sample of field problem cases from our customer support call record database and measured the effectiveness of the auto-diagnosis system in solving the problems. For each case, we re-created the specific problem situation in the laboratory and measured the effectiveness of the auto-diagnosis logic.

We first looked at a sample of 961 calls that came in during the month of September 1999. This set did not include calls corresponding to hardware or software faults. We also did not consider calls that were related to general information about the product asked for by the customer. All other types of calls were considered. The month of September 1999 was the first month whose call data we did not include in our analysis of historical call record data while designing ONTAP's auto-diagnosis logic.

Of these 961 calls, 84 had something to do with the networking code and its interactions with the rest of ONTAP. Auto-diagnosis, when simulated on these cases, was able to auto-detect the problem cause for all but 12 of these calls, at a success-rate of 84.5%. The average time that it took the netdiag command to diagnose the problem was approximately 2.5 seconds. We did not even attempt to quantify the secondary effect on the customer's level of satisfaction that auto-diagnosis would cause due the the dramatic reduction in average problem diagnosis time.

Of the 12 calls on which auto-diagnosis did not diagnose, 7 were related to transient problems with external networking hardware, 1 was due to a NIC that was exhibiting very occasional errors and had needed re-seating and 4 were problems for which we did not have appropriate auto-diagnosis logic.

Of the 877 calls not corresponding to networking, we performed a static manual analysis in order to figure out which of these problems could be auto-diagnosed by the complete ONTAP auto-diagnosis system. This analysis was performed against a design description of the auto-diagnosis logic for other subsystems of ONTAP. Our study indicates that about three quarters of these problems could indeed by addressed by auto-diagnosis. Another 124 (about 20%) of these calls corresponded to problems whose diagnosis could be sped-up significantly by the partial auto-diagnosis information that the diagnosis system provided.

We repeated this simulation and analysis for calls that came in during October 1999. We considered 1023 cases, 97 networking and 926 other. Simulation of the networking cases indicated that auto-diagnosis could solve 88% of these. All but 5 of the networking problems that could not be auto-diagnosed were related to misconfigured clients. The rest were problems for which we have not yet developed appropriate auto-diagnosis logic. Static manual analysis of the non-networking cases indicated a success-rate of about 70%.

We also considered 500 randomly chosen samples from the customer call data from the months of November 1999 through February 2000. We repeated the above described analysis and simulation for these 500 calls. Our results for this sample of calls were very similar to the results for September and October 1999.

In summary, our historical call data seems to indicate that our auto-diagnosis system will be hugely successful in making a lot of problems that currently require human intervention to be automatically addressed. This should lead to a big reduction in the cost of handling customer calls because of a significant reduction in the number of calls per installed system. We were unable to directly quantify the increase in simplicity of the problem diagnosis process; the only (relatively weak) metric that we could quantify was turnaround time for the problem, with and without auto-diagnosis. This metric was at least three orders of magnitude lower for auto-diagnosis.

next up previous
Next: Related work Up: Auto-diagnosis of field problems Previous: Implementation in other operating
Gaurav Banga