Check out the new USENIX Web site. next up previous
Next: DNS behaviors Up: Results Previous: Node Stability

Reasons to Avoid a Node


Similar to other research on peer-to-peer systems, we initially assumed that churn, the act of nodes joining and leaving the system, would be the underlying cause of staleness-related failures. However, as can be seen from the stability results, failure occurs at a much greater rate than churn. To investigate the root causes, we gather the logs from 4 of redirectors and investigate what causes nodes to switch from viable to avoided. Therefore, our counts also take time into account, and a long node failure receives more weight. We present each reason category with a non-negligible percentage in Table 2. We find that the underlying cause is roughly common across nodes - mainly dominated by DNS-related avoidance and many nodes down for long periods, followed by missed ACKs. Even simple overload, in the form of late ACKs, is a significant driver of avoidance. Finally, the HTTP fetch helper process can detect TCP-level or application-level connectivity problems.


Table: Average Percentage of Reasons to Avoid A Node
Site Fetch Miss ACKs Node Down Late ACKs DNS
pr-1 6.2 18.3 29.6 13.6 32.1
ny-1 4.7 16.1 31.7 14.0 33.9
uw-1 10.4 16.8 30.0 12.8 29.7
st-1 5.0 14.7 27.2 15.4 34.3



In terms of design, these measurements show that a UDP-only heartbeat mechanism will significantly underperform our more sophisticated detection. Not only are the multiple schemes useful, but they are complementary. Variation occurs not only across nodes, but also within a node over a span of multiple days. The data for the ny-1 node, calculated on a daily basis, is shown in Figure 12.

Figure: Daily counts of avoidance on ny-1 proxy
\begin{figure}
\begin{center}
\psfig {file=figs/avoidance_counts.ps,width=3.25in,height=2in}\vspace{-.125in}\vspace{-.15in}\end{center}
\end{figure}



next up previous
Next: DNS behaviors Up: Results Previous: Node Stability
Vivek Pai
2004-05-04