Local Monitoring

Next: Peer Monitoring Up: Reliability and Monitoring Previous: Reliability and Monitoring

Local Monitoring

Local monitoring gathers information about the CoDeeN instance's state and its host environment, to assess resource contention as well as external service availability. Resource contention arises from competition from other processes on a node, as well as incomplete resource isolation. External services, such as DNS, can become unavailable for reasons not related to PlanetLab.

We believe that the monitoring mechanisms we employ on PlanetLab may be useful in other contexts, particularly for home users joining large peer-to-peer projects. Most PlanetLab nodes tend to host a small number of active experiments/projects at any given time. PlanetLab uses vservers, which provide a view-isolated environment with a private root filesystem and security context, but no other resource isolation. While this system falls short of true virtual machines, it is better than what can be expected on other non-dedicated systems, such as multi-tasking home systems. External factors may also be involved in affecting service health. For example, a site's DNS server failure can disrupt the CoDeeN instance, and most of these problems appear to be external to PlanetLab [17].

The local monitor examines the service's primary resources, such as free file descriptors/sockets, CPU cycles, and DNS resolver service. Non-critical information includes system load averages, node and proxy uptimes, traffic rates (classified by origin and request type), and free disk space. Some failure modes were determined by experience - when other experiments consumed all available sockets, not only could the local node not tell that others were unable to contact it, but incoming requests appeared to be indefinitely queued inside the kernel, rather than reporting failure to the requester.

Values available from the operating system/utilities include node uptime, system load averages (both via ``/proc''), and system CPU usage (via ``vmstat''). Uptime is read at startup and updated inside CoDeeN, while load averages are read every 30 seconds. Processor time spent inside the OS is queried every 30 seconds, and the 3-minute maximum is kept. Using the maximum over 3 minutes reduces fluctuations, and, at 100 nodes, exceeds the gap between successive heartbeats (described below) from any other node. We avoid any node reporting more than 95% system CPU time, since we have found it correlates with kernel/scheduler problems. While some applications do spend much time in the OS, few spend more than 90%, and 95% generally seems failure-induced.

Other values, such as free descriptors and DNS resolver performance, are obtained via simple tests. We create and destroy 50 unconnected sockets every 2 seconds to test the availability of space in the global file table. At our current traffic levels, 50 sockets are generally sufficient to handle two seconds of service on a single node. Any failures over the past 32 attempts are reported, which causes peers to throttle traffic for roughly one minute to any node likely to fail. Similarly, a separate program periodically calls gethostbyname() to exercise the node's DNS resolver. To measure comparable values across nodes, and to reduce off-site lookup traffic, only other (cacheable) PlanetLab node names are queried. Lookups requiring more than 5 seconds are deemed failed, since resolvers default to retrying at 5 seconds. We have observed DNS failures caused by misconfigured ``/etc/resolv.conf'' files, periodic heavyweight processes running on the name servers, and heavy DNS traffic from other sources.

Next: Peer Monitoring Up: Reliability and Monitoring Previous: Reliability and Monitoring

Vivek Pai
2004-05-04