Check out the new USENIX Web site. next up previous
Next: Reasons to Avoid a Up: Results Previous: Response Performance

Node Stability


The distributed node health monitoring system employed by CoDeeN, described in Section 2.2, provides data about the dynamics of the system and insight into the suitability of our choices regarding monitoring. One would expect that if the system is extremely stable and has few status changes, an active monitoring facility may not be very critical and probably just increases overhead. Conversely, if most failures are short, then avoidance is pointless since the health data is too stale to be useful. Also, the rate of status changes can guide the decisions regarding peer group size upper bounds, since larger groups will require more frequent monitoring to maintain tolerable staleness.

Our measurements confirm our earlier hypothesis about the importance of taking a monitoring and avoidance approach. They show that our system exhibits fairly dynamic liveness behavior. Avoiding bad peers is essential and most failure time is in long failures so avoidance is an effective strategy. Figure 9 depicts the stability of the CoDeeN system with 40 proxies from four of our CoDeeN redirectors' local views. We consider the system to be stable if the status of all 40 nodes is unchanged between two monitoring intervals. We exclude the cases where the observer is partitioned and sees no other proxies alive. The $x$-axis is the stable period length in seconds, and the $y$-axis is the cumulative percentage of total time. As we can see, these 4 proxies have very similar views. For about 8% of the time, the liveness status of all proxies changes every 30 seconds (our measurement interval). In Table 1, we show the 50$^{th}$ and the 90$^{th}$ percentiles of the stable periods. For 50% of time, the liveness status of the system changes at least once every 6-7 minutes. For 90% of time, the longest stable period is about 20-30 minutes. It shows that in general, the system is quite dynamic - more than what one would expect from few joins/exits.

Figure: System Stability View from Individual Proxies
\begin{figure}
\begin{center}
\epsfig {file=figs/liveness_dist-all.eps,width=3.25in,height=2in}\vspace{-.125in}\vspace{-.15in}\end{center}
\end{figure}

Figure: System stability for smaller groups
\begin{figure}
\centering\subfigure[Divided into 2 Groups]
{\epsfig {file=figs/l...
...eps,width=1.5in,height=1.5in,clip=}}
\vspace{-.125in}\vspace{-.15in}\end{figure}

The tradeoff between peer group size and stability is an open area for research, and our data suggests, quite naturally, that stability increases as group size shrinks. The converse, that large groups become less stable, implies that large-scale peer-to-peer systems will need to sacrifice latency (via multiple hops) for stability. To measure the stability of smaller groups, we divide the 40 proxies into 2 groups of 20 and then 4 groups of 10 and measure group-wide stability. The results are shown in Figure 10 and also in Table 1. As we can see, with smaller groups, the stability improves with longer stable periods for both the 50$^{th}$ and 90$^{th}$ percentiles.


Table: System Stable Time Period (Seconds)
40-node 2 $\times$ 20-node 4 $\times$ 10-node
50% 90% 50% 90% 50% 90%
pr-1 445 2224 1345 6069 3267 22752
ny-1 512 3451 1837 10020 4804 25099
uw-1 431 2085 1279 5324 3071 19579
st-1 381 2052 1256 5436 3008 14334



Figure: Node Failure Duration Distribution. Failures spanning across a system-wide downtime are excluded from this measurement, so that it only includes individual node failures. Also, due to the interval of node monitoring, it may take up to 40 seconds for a node to be probed by another nodes, thus failures that last a shorter time might be neglected.
\begin{figure}
\centering\subfigure[CDF by \char93  of Occurrences]
{\epsfig {fi...
...eps,width=1.5in,height=1.5in,clip=}}
\vspace{-.125in}\vspace{-.15in}\end{figure}

The effectiveness of monitoring-based avoidance depends on the node failure duration. To investigate this issue, we calculate node avoidance duration as seen by each node and as seen by the sum of all nodes. The distribution of these values is shown in Figure 11, where ``Individual'' represents the distribution as seen by each node, and ``System-Wide'' counts a node as failed if all nodes see it as failed. By examining the durations of individual failure intervals, shown in Figure 11a, we see that most failures are short, and last less than 100 seconds. Only about 10% of all failures last for 1000 seconds or more. Figure 11b shows the failures in terms of their contribution to the total amount of time spent in failures. Here, we see that these small failures are relatively insignificant - failures less than 100 seconds represent 2% of the total time, and even those less than 1000 seconds are only 30% of the total. These measurements suggest that node monitoring can successfully avoid the most problematic nodes.



next up previous
Next: Reasons to Avoid a Up: Results Previous: Response Performance
Vivek Pai
2004-05-04