Much of the performance-related research in network servers has focused on improving throughput, with less attention paid to latency (13,6). In an environment with large numbers of users accessing the Web over slow links, the focus on throughput was understandable, since perceived latency was dominated by wide area network (WAN) delays. Additionally, early servers were often unable to handle high request rates, so throughput research directly affected service availability. The development of popular throughput-centric benchmarks, such as SPECWeb96 (19) and WebStone (12), also gave developers extra incentive to improve throughput.
Several trends are reducing the non-server latencies, thereby increasing the relative contribution of server-induced latency. Improvements in server-side network connectivity reduce server-side network delays, while growing broadband usage reduces client-side network delays. Content distribution networks, which replicate content geographically, reduce the distance between the client and the desired data, reducing round-trip latency. With latencies between most major cities in the mainland US on the order of tens of milliseconds, server induced latency could be a significant portion of end-user perceived latency. Some recent work addresses the issue of measuring end-user latency (15,3), with optimization approaches mostly focusing on scheduling (20,5,9,21).
However, comparatively little is understood about trends in network server latencies, or how system components affect them. Current research generally assumes that server latency is largely caused by queuing delays, that it is inherent to the system, and that scheduling techniques are the preferred solution to address them. Unfortunately, these assumptions are not explicitly tested, complicating attempts to systematically address issues of latency. Based on these observations, our goal is to understand the root causes of network server latency and address them, so that server latency can be improved. A better understanding of latency's origins can also help other research, such as improving Quality-of-Service (QoS) or scheduling policies.
By instrumenting the kernel, we find that Web servers can incur latency blocked in filesystem-related system calls, even when the needed data is in physical memory. As a result, requests that could have been served from main memory are forced to wait unnecessarily for disk-bound requests. This batching behavior may have little impact on throughput, it can significantly affect latency. It causes head-of-line blocking in the OS and manifests itself as other problems, such as a degradation of the kernel's service policies that are designed to ensure fairness. By examining individual request latencies, we find that this blocking reduces the fairness of response orders, a phenomenon we call service inversion, where short requests are often served with much higher latencies than much larger requests. We also find that this phenomenon increases with load, and that it is responsible for most of the growth in server latency under load.
By addressing the blocking issues both in the application and the kernel, we improve response time by more than an order of magnitude, and demonstrate qualitatively different change in the latency profiles. The resulting servers also exhibit much lower service inversion and better fairness. These latency profiles in our resulting servers generally scale with processor speed, where cached requests are no longer bound by disk-related issues. In comparison, experiments using the original servers only show that server throughput improves with increases in processor speed, but not server latency. We believe that our solution is more portable than redesigning kernel locking, and that our findings also apply to Web proxies, where more disk activity is required and the working sets generally exceed physical memory.
The rest of the paper is organized as follow: In Section 2, we present the servers used throughout this paper, test environment, workloads, and methodology. In Section 3 we identify the latency problems and explain their causes. We introduce a new metric to quantify the effects in Section 4. In Section 5, we discuss how we address these problems, describe the resulting servers, present the experimental results on the new servers, and examine latency scalability with processor speeds. We discuss related work in Section 6 and conclude in Section 7.