Check out the new USENIX Web site. next up previous
Next: Problem auto-diagnosis methods Up: The nature of field Previous: Hardware and software faults

Why are field problems hard to debug?

When a field problem occurs with an appliance system due to any of the reasons described above (except faults), it is often hard to debug. Consider a filer customer who observes performance that is substantially lower than the filer's rated performance. The reason for this poor performance may be a misconfiguration somewhere in the client-to-filer distributed system, i.e., in the client, in the filer, or in the network fabric. Alternately, the problem may be an overloaded filer; this particular environment may have an atypical load and the filer may have a lower capacity for this workload than for the standard SFS workload.

As the end effect of all of these potential causes is usually the same, i.e., poor file access performance as seen from the client system, it is not easy to discern the exact cause of the problem. The problem debugger is forced to perform a sanity check of all the components of the client-to-filer distributed system in order to ensure that each component is functioning correctly. For the filer, this implies a verification of all filer subsystems performed by invoking the various statistic commands and analyzing the output for aberrations.

This process is time-consuming, tedious and error-prone. As explained earlier, this task requires a fair amount of expertise, and a certain debugging ``instinct'' that comes from experience. This task is also complicated by the fact that the person debugging the field problem, being a member of the filer vendor's organization, often has no direct access to the system being debugged. In that case, the various statistic commands are executed by the customer who is in communication with the support person via email or phone. This aspect of the problem debugging process makes it slow, causing large down-time. Combined with the high expectations of appliance-like simplicity that most appliance customers have, it makes the problem debugging experience frustrating for both parties involved, the customer and the support person.

The discussion above is fully applicable to general-purpose systems; appliances are usually considerably easier to debug than general-purpose systems. However, the debugging of field problems with appliances is certainly not as simple, or ``appliance-like'', as we would like. In the next section, we will present a new problem diagnosis methodology that attempts to apply the appliance to the debugging of field problems with appliance systems.

next up previous
Next: Problem auto-diagnosis methods Up: The nature of field Previous: Hardware and software faults
Gaurav Banga