Check out the new USENIX Web site. next up previous
Next: HTTP/TCP Heartbeat Up: Peer Monitoring Previous: Peer Monitoring

UDP Heartbeat


As part of its tests to avoid unhealthy peers, CoDeeN uses UDP heartbeats as a simple gauge of liveness. UDP has low overhead and can be used when socket exhaustion prevents TCP-based communication. Since it is unreliable, only small amounts of non-critical information are sent using it, and failure to receive acknowledgements (ACKs) is used to infer packet loss.

Each proxy sends a heartbeat message once per second to one of its peers, which then responds with information about its local state. The piggybacked load information includes the peer's average load, system time CPU, file descriptor availability, proxy and node uptimes, average hourly traffic, and DNS timing/failure statistics. Even at our current size of over 100 nodes, this heartbeat traffic is acceptably small. For larger deployments, we can reduce heartbeat frequency, or we may divide the proxies into smaller groups that only exchange aggregate information across groups.

Heartbeat acknowledgments can get delayed or lost, giving some insight into the current network/node state. We consider acknowledgments received within 3 seconds to be acceptable, while any arriving beyond that are considered ``late''. The typical inter-node RTT on CoDeeN is less than 100ms, so not receiving an ACK in 3 seconds is abnormal. We maintain information about these late ACKs to distinguish between overloaded peers/links and failed peers/links, for which ACKs are never received.

Several policies determine when missing ACKs are deemed problematic. Any node that does not respond to the most recent ACK is avoided, since it may have just recently died. Using a 5% loss rate as a limit, and understanding the short-term nature of network congestion, we avoid any node missing 2 or more ACKs in the past 32, since that implies a 6% loss rate. However, we consider viable any node that responds to the most recent 12 ACKs, since it has roughly a 54% chance of having 12 consecutive successes with a 5% packet loss rate, and the node is likely to be usable.

By coupling the history of ACKs with their piggybacked local status information, each instance in CoDeeN independently assesses the health of other nodes. This information is used by the redirector to determine which nodes are viable candidates for handling forwarded requests. Additionally, the UDP heartbeat facility has a mechanism by which a node can request a summary of the peer's health assessment. This mechanism is not used in normal operation, but is used for our central reporting system to observe overall trends. For example, by querying all CoDeeN nodes, we can determine which nodes are being avoided and which are viable.



next up previous
Next: HTTP/TCP Heartbeat Up: Peer Monitoring Previous: Peer Monitoring
Vivek Pai
2004-05-04