Next: Conclusion Up: Understanding and Addressing Blocking-Induced Previous: Latency Scalability

Related Work

Performance optimization of network servers has been an important research area, with much work focused on improving throughput. Some addressed coarse-grained blocking - e.g. Flash (13) demonstrated how to avoid some disk-related blocking using non-blocking system calls. Much evaluation about disk I/O associated overheads has focused on Web proxies (10). Some of the most aggressive designs have used raw disk, eliminated standard interfaces, and eliminated reliable metadata in order to gain performance (18). In comparison, we have shown that no kernel or filesystem changes are necessary to achieve much better latency, and that these techniques can be retrofitted to legacy servers with low cost.

More recently, much attention is paid to latency measurement and improvement. Rajamony & Elnozahy (15) measure the client-perceived response time by instrumenting the documents being measured. Bent and Voelker explore similar measurements, but focus on how optimization techniques affect download times (3). Improvement techniques have been largely limited to connection scheduling, with most of the attention focused on the SRPT policy (4,5). Our work examines the root cause of the blocking, and our solutions subsume any need for application-level connection scheduling. Our new servers use the existing scheduling within the operating system, and the results suggest that eliminating the obstacles yields automatic improvement with existing service and fairness policies.

Synchronization-related locking has been a major concern in parallel programming research. Rajwar et al. (16) proposed a transactional lock-free support for multi-threaded systems. The reasons of locking in our study have a broader range and differ in application domain. While head-of-line blocking is a well-known phenomenon in the network scheduling context, e.g. Puente et al. (14) and Jurczyk et al. (8) studied various blocking issues in network environment, we demonstrate that this phenomenon also exists in network server applications and has severe effects on user-perceived latency.

This paper also takes a different approach to fairness than other work, and the difference may be important in some contexts. The SEDA approach (21) tried to schedule requests based on size, but no measurements are presented on the effectiveness of the scheduling itself. Interestingly, despite the four orders of magnitude variation in SPECWeb's file sizes, SEDA handles 80% of requests in 200-1000ms, with a median of over 500ms, over 10 times slower than Flash or Apache. This uniformly slow response time gives it a high score on the Jain fairness index (7) when fairness is evaluated on a per-client basis. On a per-request level, however, we believe that shorter responses should be served faster, and believe that our service inversion metric is more useful. With coast-to-coast latencies in the continental US on the order of 100ms, and with news sites (Yahoo, CNN, etc.) routinely having over 100 embedded objects per page, SEDA's server-induced latency would be a noticeable problem for real Web use.

At the other extreme, Bansal & Harchol-Balter (2) investigate the unfairness of the SRPT scheduling policy under heavy-tailed workloads and draw the conclusion that the unfairness of their approach is barely noticeable. By addressing the latency issues directly rather than scheduling around them, our approach removes the need for the application to explicitly schedule connections. Network scheduling can still be used, particularly for traffic shaping, prioritization, etc.

Finally, we should mention that the FreeBSD SMPng effort, which is released as FreeBSD 5, has completely rewritten much of the locking in FreeBSD, using finer-granularity locks to improve performance on multiprocessor machines. Additionally, some of the filesystem locking appears to have introduced read-shared locks in addition to exclusive locks. These locks could reduce the chance of lock convoys in pathname resolution, eliminating some of the blocking we observed. While we would like to evaluate the behavior of this system, we have been unable to operate it under sufficient load without kernel panics or significant performance degradation.

Next: Conclusion Up: Understanding and Addressing Blocking-Induced Previous: Latency Scalability

Yaoping Ruan
2006-04-18