FeatureUSENIX

 

speedier squid:

A Case Study of an Internet Server Performance Problem

by Jeffrey Mogul
<mogul@pa.dec.com>

Jeffrey Mogul, of Compaq's Western Research Lab, currently works on Internet and operating system performance issues. He was program chair for the Winter 1994 USENIX Conference,and is a co-author of the HTTP/1.1 specification.

It's not always easy to discover what limits the performance of an Internet server application. Is the network overloaded? Is a Web page too complicated? Is the operating system inefficient? Is the application doing something the wrong way?

People make guesses about why a server runs slowly, and their intuition is often very wrong. Internet servers, such as Web servers and proxies, are a particularly tricky domain for intuition, because their performance depends on so many different aspects of computer system design. We may not even have a good feel for how fast a particular server ought to be, let alone why it runs slower than we would like.

This article is a case study of one such performance problem and what I did to solve it. Of course, I'll discuss the solution, but my main goal is to show how one can combine an advanced performance-measurement tool and some understanding of computer performance to rapidly discover and solve a performance bottleneck.

What Is Squid?

Squid[7] is a Web (HTTP) proxy server. It was based on software from the earlier Harvest research project[4] and is being developed and maintained by a community of Internet users, under the supervision of the NSF-funded National Laboratory for Applied Network Research. The Squid source code is freely available and runs on a wide variety of UNIX-like systems, making it a convenient platform for research; as a result, Squid is one of the most widely used proxy-server applications, even in nonresearch environments.

The primary goal of the Squid project is to provide a Web caching service. Squid servers are typically used as proxy caches, storing previously retrieved Web objects for later reuse. However, Squid can also be used as a noncaching proxy, typically as part of a security "firewall" between an organization and the Internet. For example, the Palo Alto firewall between the Internet and Digital Equipment Corporation's internal networks has used Squid as both a caching proxy and a noncaching proxy. (Although Digital has merged with Compaq, most or all of the users of this proxy are in the ".dec.com" domain.) My experiments were performed at this firewall.

When Squid is used as a caching proxy, its performance is greatly complicated by its use of disk I/O for storing, finding, and retrieving cache entries. In some contexts, such as the Digital Palo Alto firewall, where external network connectivity is reasonably fast, the disk I/O overhead of caching turns out to be much higher than the latency improvement from cache hits (which could be the subject of a future case study!). Thus, we typically run the Squid server in its noncaching mode.

A noncaching proxy makes little use of the filesystem, but it still faces performance bottlenecks. As the HTTP request rate increases, so does CPU utilization. Since our proxy often handles 2.5 million requests on a busy day, with peak rates much higher than the average, it can't afford to spend much CPU time per request. More important, we want it to scale well: The CPU time per request should not increase significantly as the request rate increases. For example, if the cost per request is proportional to the request rate, then the system's cost will increase in proportion to the square of the request rate (called O(N2) behavior).

The use of Squid as a noncaching proxy stresses many aspects of the system, including those that also affect Squid caching-proxy performance. Thus, although this article concentrates on noncaching performance, the results are still applicable to a caching Squid system.

For these tests, I used Squid version 1.1.20. Later versions have been released, but this is relatively close to the current "stable" version of the software.

Lab Tests versus Live Tests

Many Internet performance tests are conducted in laboratory environments. For example, the SPECweb96 benchmark, used for measuring Web server performance, assumes that the clients are connected via a LAN. This assumption may be necessary for a feasible benchmark test (because it is expensive and difficult to perform repeatable measurements at high request rates in the real Internet), but it can seriously skew the results.

In lab tests of Web servers, connections typically last for a few tens of milliseconds. However, HTTP connections in the real Internet often take hundreds of milliseconds, or even several seconds, because of much longer round-trip times, higher packet loss rates, and lower link bandwidths.

Consider a server handling 100 requests/sec. If the mean request duration is, say, 25 msec., then on average the server will be handling 100 x 0.025 = 2.5 connections at once. But if the mean request duration is 2 sec. (not atypical for the real Internet), then the same request rate means that the server will be handling, on average, 200 connections at once. If the server application uses algorithms that scale badly with the number of connections, or if it uses operating system features that scale badly, a request load that is easily handled in the lab may overwhelm the system in live use.

The measurements in this article were made on our live proxy system, in order to reflect real-world performance issues accurately. This makes it harder to obtain repeatable results and so requires some attention to experimental procedures. For example, trials must last long enough to avoid short-term fluctuations in load. But since the request load varies by a factor of ten between "prime time" (usually late morning on weekdays) and the middle of the night, it's also important to decide whether one wants to measure performance over a wide range of offered loads, or during a period of relatively constant load. I used one-hour prime-time trials when I wanted to measure peak-load performance, and full-day trials when I wanted to measure performance over a wide range of loads.

Tools in action

Performance measurements are only as good as the tools one uses to obtain them. I'll briefly explain the tools that I used.

Measuring Load and CPU Utilization

Since my primary concern was how much CPU time the Squid system uses and how this scales with request rate, I needed to measure CPU utilization over short periods of time and then correlate these measurements with the request rate.

Most Web servers, including Squid, can record per-request information in a log. We already had software to analyze the Squid request logs (offline, so as not to add load to the server) and compute request rates over various intervals. I chose a 15-minute interval, which yields 96 samples over a 24-hour period.

Our servers run shell scripts every 15 minutes to log various system statistics (including output from netstat, ps, vmstat, etc.). Although vmstat does display CPU utilization (by fraction spent in user mode, kernel mode, and idle time), in order to get 15-minute totals for these fractions it proved simpler to write a short program that provides the total number of clock ticks spent in each mode since the system was booted. A postprocessing script computes the differences between subsequent samples, avoiding any errors introduced by clock skew in the sampling process.

With measurements for both the request load and the CPU utilizations over identical 15-minute sampling intervals, I could then plot the relationship between request rate and CPU utilization. For example, see Figure 4. I could also run linear regressions to quantify the relationship. For example, see Table 2.

DCPI for Procedure Profiling

The CPU-utilization measurements show how much CPU time is being spent at various request rates, but they don't show where it is being spent (aside from the coarse breakdown of kernel-mode time versus user-mode time). To localize bottlenecks more precisely, one needs a procedure profile. Profiling software has been around for many years, going back to early versions of UNIX[6]. However, most profiling tools have several drawbacks:

  • They cover only one application at a time (and don't include kernel functions).
  • They count instructions executed, not cycles spent during execution.
  • They require recompilation or post-processing of application binaries.
  • They may add significant execution overhead.

A system called DCPI[1, 5] (for "Digital Continuous Profiling Infrastructure") has none of these drawbacks. DCPI, which is available for Compaq's Digital UNIX and Windows NT on Alpha processors, consists of a dynamically loaded kernel device driver and a set of other programs. It is capable of profiling the entire system, including the kernel, shared libraries, and all applications, while adding an overhead of only 1% to 3%. DCPI counts cycles, not instructions, so it accounts for the cost of delays such as cache misses and TLB misses.

This makes DCPI an effective tool for profiling an Internet server, since it quickly and easily reveals the functions that account for CPU consumption, no matter where they lie. Table 1 is a DCPI profile of the unmodified Squid 1.1.20 software running on a modified version of Digital UNIX V4.0B, on a 500MHz AlphaStation 500 system (21164A processor, SPECInt95 = 15.0) with 512MB of RAM. The table includes all procedures that account for at least 0.5% of the non-idle CPU time. The kernel modifications, described in another paper[3], improved the performance of the kernel's select() system call, although Table 1 shows that the various kernel procedures that implement select() still account for about 25% of the non-idle CPU time.

However, the single most expensive procedure, in this profile, is the comm_select() procedure in the Squid program. In fact, no other nonlibrary procedure in Squid accounts for more than 0.69% of the nonidle CPU time, and all of the other Squid procedures together account for less than 5% of the non-idle CPU time. This suggests that comm_select() is the best target for further analysis.

DCPI for detailed analysis

Suppose we want to compute the number of CPU cycles required to execute a particular piece of code. We could start by counting instructions. However, most high-performance CPUs are able to issue multiple instructions in a single cycle, although resource limits and operation latencies usually prevent the CPU from filling all issue slots on every instruction.

More important, modern CPUs experience "dynamic stalls," where they must stop issuing instructions for a short period. For example, if the CPU encounters a conditional branch instruction and has incorrectly guessed whether the branch will be taken (a "branch mispredict"), its instruction pipeline must be refilled before it can continue at full speed.

The lengthiest dynamic stalls usually occur when the memory system is unable to deliver data or instructions fast enough. While caches can help, most real applications don't fit in on-chip caches. Main memory latencies are often measured in hundreds of nanoseconds, while fast CPUs have cycle times of 2 nanoseconds or less, so one cache miss can cost more than several hundred instructions. This is especially problematic for operating-system code and similar programs, such as Squid, that to do relatively little arithmetic and that manage complex data structures.

DCPI's procedure-level profile can include the number of data-cache (D-cache) misses encountered by a procedure (not shown in Table 1). DCPI found that comm_select() encountered far more D-cache misses than any other procedure. Perhaps something in comm_select() is causing lots of dynamic stalls?

DCPI includes a tool called dcpicalc that uses the information in a profile, together with an accurate model of the CPU, to deduce which instructions in a procedure are stalling, and why. The full dcpicalc output for comm_select() is too lengthy to include, so I will show some short excerpts.

A common measure of CPU efficiency for a particular pieces of code is the average "cycles per instruction" (CPI). For example, a dual-issue machine with no stalls can, at best, execute at 0.5 CPI. Dcpicalc found, by analyzing the comm_select() profile against a model of the Alpha 21164A CPU, that the best-case execution could have averaged 0.71 CPI, but the actual execution averaged 2.69 CPI.

Dcpicalc then analyzed each profiled instruction to deduce how often it stalled, and why. This analysis is complex and based somewhat on heuristics, but generally gives plausible results. It found that 71.4% of the cycles spent in comm_select() went to D-cache misses, and only 3.7% went to other dynamic stalls. What is causing those D-cache misses?


 *** Best-case 4/6 = 0.67CPI, Actual 48/6 = 8.00CPI

 *** (8% execution without dynamic stalls)
015798 1011: ldlt5, 156(t2)11.0cy
01579c 1013: andt0, 0x1f, a201.0cy
0157a0 1013: biszero, 0x1, a311.0cy
0157a4 1013: slla3, a2, a211.0cy
0157a8 1011: cmplet5, t6, t5143.5cy
0157ac 1011: beqt5, 0x1200157c000

Figure 1. DCPI analysis of the worst basic block in comm_select().


Figure 1 shows excerpts from dcpicalc's annotation of the worst basic block in the comm_select() profile. (A basic block is a set of instructions always executed as a group.) Dcpicalc starts by saying that this block could have executed at 0.67 CPI in the absence of any dynamic stalls, but actually executed at 8 CPI. It then gives, for each instruction:

  • the instruction address

  • the source-code line number

  • the opcode and operands

  • the best-case number of cycles (which might be zero, for instructions eligible for multiple-issuing)

  • The estimated actual number of cycles (which also might be zero).

Note that the cmple instruction takes, on average, 43.5 cycles. Dcpicalc, in cryptic annotations not shown in Figure 1, also tells us that the most likely cause of this delay is a D-cache miss, that the instruction is probably waiting for data loaded from memory by the ldl instruction, and that this instruction was generated from source line 1011.

Dcpicalc shows similarly lengthy delays at source lines 1010 and 1016. Now we know the location of the problem (or at least a problem): lots of D-cache misses at source lines 1010, 1011, and 1016 in comm_select(). Figure 2 shows that neighborhood of the source code.



1008   for (i = 0; i < maxfd; i++) {

1009      /* Check each open socket for a handler. */

>1010      if (fd_table[i].read_handler) {

>1011        if (fd_table[i].stall_until <= squid_curtime) {

1012          nfds++;

1013          FD_SET( i, &readfds);

1014        }

1015      }

>1016      if (fd_table[i].write_handler) {

1017         nfds++;

1018         FD_SET(i, &writefds);

1019      }

1020   }

Figure 2: Source-code excerpt from comm_select()


From this code excerpt, it's pretty clear what is happening. Comm_select() is building up two bitmaps for use as input to the select() system call, by iterating over an array of structures, one entry per file descriptor. The loop looks at three fields in each array entry, and is getting a D-cache miss on many (or perhaps all) of these references. fd_table is declared so that the three fields used in one iteration of this loop are not adjacent, and might even lie on separate cache lines.

Because the number of active file descriptors is roughly proportional to event rate, and the cost of executing this loop is proportional to the number of file descriptors, the cost per event is proportional to the event rate.This causes O(N2) behavior, which prevents efficient scaling.

Statistics from our proxy server show that the maxfd variable, the upper limit of this loop, can reach values of 2380 or more, so each complete execution of this loop incurs as many as 2380 x 3 = 7140 D-cache misses.No wonder comm_select() takes so much time.

Fixing the Problems

Since I'm basically lazy, my first attempt to eliminate these D-cache misses was to simply change the order of the fields in the declaration of fd_table, so that the three fields referenced on each loop iteration were adjacent, and then recompile Squid.This had the advantage that it shouldn't require any debugging, and because the Alpha 21164A has 64-byte cache lines, it should result in just one cache miss per iteration.

That modification reduced the cost of comm_select() to 14% of the nonidle CPU cycles (from 30%).The average CPI for this procedure decreased from 2.69 to 2.43, mostly by reducing the number of cache misses on source line 1016.The discrepancy between the small change in CPI and the somewhat larger improvement in total time may reflect variation in Squid's request rate, which was significantly higher during the second trial.

Making the three interesting fields adjacent in each array element does improve the D-cache miss rate, but because the array elements contain many other fields that are not accessed during this loop, the D-cache is still filled mostly with useless information.I tried another small modification, creating a separate array to hold just these three fields.I reasoned that this would make full use of the D-cache during the loop.The result was a dramatic decrease in the CPI of comm_select(), to 0.75, but only a small decrease in the non-idle CPU cycle utilization of comm_select(), to 11%.

This was frustrating.I had been able to eliminate most of the delays directly associated with D-cache misses, but the procedure was still taking more of the CPU than anything except the kernel's select() implementation.I finally admitted that I would have to try an algorithmic change, at the risk of more extensive debugging.

The goal of the loop in question is to create two bitmaps for use by select(), based on the state of the entries in the fd_table. I changed the code to keep the bitmaps as long-lived data structures, updating them whenever the fields in fd_table are updated.Because select() destroys its input bitmaps, one actually has to make a copy of each bitmap before calling select().Even accounting for the cost of this copy, the size of the data referenced before each call to select() decreases by a factor of about 40 (because the Alpha uses 64-bit pointers; on a 32-bit CPU, the decrease is by a factor of 24).

One complication is that the loop compares the stall_until field against the current time, so the readfds bitmap cannot simply be updated just when the read_handler field changes.However, these timestamps have a granularity of one second, and the comm_select() function already iterates over the entire fd_table once per second (to check for other timeouts), so it adds almost no overhead to also do the stall_until comparisons once per second.

With this change made (and debugged), the mean CPI for comm_select() declined again, to 0.61, and its fraction of the nonidle CPU cycles declined to just 3.7%.

Further work with dcpicalc showed that most of the remaining cycles in comm_select() were being spent on the second line of the code excerpted (in a heavily abstracted form) in Figure 3.This loop is going through the bitmaps returned by select(), looking for any "set" bits, and calling the corresponding handlers.The FD_ISSET() macro, which extracts a single bit from a large bitmap, executes numerous instructions, and so this loop executes many instructions per bit.



for (fd = 0; fd < maxfd; fd++) {

 if (!FD_ISSET(fd, &readfds) && !FD_ISSET(fd, &writefds))

  continue;

 if (FD_ISSET(fd, &readfds)) {

  if (fd_table[fd].read_handler) {

     fd_table[fd].read_handler(fd, fd_table[fd].read_data);

  }

 }

 if (FD_ISSET(fd, &writefds)) {

  if (fd_table[fd].write_handler) {

     fd_table[fd].write_handler(fd, fd_table[fd].write_data);

  }

 }

}

Figure 3. Another source-code excerpt from comm_select()


The bitmaps typically have only a few bits set, and so it is much more efficient to treat each of them as an array of words, then iterate over this array looking for nonzero words (those with at least one bit set). This effectively parallelizes the search, and uses fewer instructions per step, so the revised loop executes (on average) a small fraction of an instruction per bit. Once it discovers a nonzero word, a simple loop then finds the nonzero bits in this word, calculates the corresponding bitmap index, and invokes the corresponding handler.

With this change, comm_select() now uses only 0.54% of the nonidle CPU cycles.Its CPI increases somewhat, to 1.18, because the last change eliminated lots of "useless" instructions but still examines the same bitmaps (and so still has about the same set of memory references). In spite of this dramatic improvement, comm_select() still has O(N2) behavior; the size of the bitmaps it manipulates remains proportional to the request rate.However, the changes reduce the cost by such a large constant factor as to make it unlikely that the comm_select() would ever dominate system costs at any request rate.

Measuring the Results

Having succeeded in reducing the profiled cost of comm_select() from 30% of the nonidle cycles to just 0.54%, my last task was to verify that this actually improves overall performance.

Recall that I have samples, taken every 15 minutes, of both the request rate and the CPU utilization.On separate days, I ran trials of the original Squid proxy and the modified software; each 24-hour period yielded 96 samples.Figure 4 shows both sets of samples.Idle-time measurements are shown using circles, and user-mode CPU time measurements are shown using squares.Samples from the trial with the original software are distinguished by open marks; samples from the trial with the modified software are shown using filled marks.

figure4_2
Figure 4. CPU costs as a function of request rate

Figure 4 includes lines showing linear regressions for each sample set.Table 2 summarizes these regressions.They show that the modifications make a 45% reduction in the slope of the regression for user-mode time; that is, the modified Squid expends much less user-mode CPU time per request than the original version.The slope of the regression for kernel-mode time has also improved slightly, apparently because the reduction in user-mode D-cache misses has improved the cache performance for other parts of the system.Overall, the slope of the idle-time regression has improved by 30%.In other words, for a given request rate, the modified software leaves more idle time and hence should support a higher request rate.

The X-intercepts of the idle-time regression lines imply that the modified version of Squid, on this hardware, can support a 45% higher request rate than the original.(The X-intercept is the projected point where the idle time drops to zero.)However, this projection is misleading; the system can actually sustain much higher request rates.The peak one-second rate actually logged during the trial with the original version was about 107 requests/sec, and was about 139 requests/sec with the modified version.

There's a simple reason for the discrepancy between the measured peak rates and the projected X-intercepts.When the system is lightly loaded, events arrive rarely enough that select() typically returns with just one ready descriptor.At increased loads, when the system has no idle time and the software falls behind, select() will return multiple events per call.In this region, each call to select() is amortized over multiple events, so its cost per event is reduced.Because select() and/or comm_select() account for the majority of the CPU-time costs, this "batching" effect allows the system to handle increasing load even after running out of idle time.Ultimately, however, the more efficient software can handle a higher peak load, and provides less latency at lower loads, than the original version.

Next Steps

The results of this case study suggest further steps to take.First, my patches to comm_select() will be integrated into a future release of Squid.

After application of these patches, and the kernel modifications to improve the performance of select()[3], the profiles still show that select() consumes a lot of CPU time, with user-level malloc() not far behind.A forthcoming release of Squid maintains its own pools of allocated memory for frequently-allocated data types, and so should put a lot less pressure on the malloc() functions, but it might still be worthwhile to study this cost in more detail.

Reducing the cost of select() while preserving its programming interface, beyond the kernel modifications we have already tried, may prove difficult.In another paper [2], Gaurav Banga, Peter Druschel, and I propose a new system call interface to replace select() for this kind of application; preliminary results from simulated-WAN testing suggest that this interface eliminates virtually all of the select()-related costs, and scales to arbitrary numbers of file descriptors.

Conclusions

I set out to discover something interesting about how Squid's CPU utilization scales with request rate, and why.By using the appropriate performance-debugging tools, I quickly discovered the cause of a significant inefficiency.After a few false starts and some additional tool use, I devised a relatively simple change to the Squid code.This eliminated most of the user-mode CPU time spent in the Squid program.More important, it increased the efficiency of the entire system, as measured by tests on a live system.

Of course, not every program has a single procedure whose performance can be improved by a factor of 55 through a simple programming change.But many programs, especially Internet servers in complex environments, have performance characteristics that are not necessarily obvious from their source code.Further, the influence of memory-system stalls is both especially critical on modern CPUs and especially difficult to discern from the source.One can gain tremendous leverage over these problems by using advanced performance tools, such as DCPI, that provide whole-system performance views and expose memory-system issues.

References

[1]Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R. Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T. Vandevoorde, Carl A. Waldspurger, and William E. Weihl. "Continuous Profiling: Where Have All the Cycles Gone?" ACM Transactions on Computer Systems 14(4):357-390, November, 1997.

[2]Gaurav Banga, Peter Druschel, and Jeffrey C. Mogul. "Better operating system features for faster network servers." Proceedings of the Workshop on Internet Server Performance, pp. 69-79.Madison, WI, June, 1998. Available as <http://www.cs.rice.edu/~gaurav/papers/wisp98.ps>; revised version in ACM SIGMETRICS Performance Evaluation Review 26(3), Dec. 1998.

[3]Gaurav Banga and Jeffrey C. Mogul. "Scalable kernel performance for Internet servers under realistic loads." Proceedings of the USENIX 1998 Annual Technical Conference, pp. 1-12.New Orleans, LA, June, 1998.

[4]Anawat Chankhunthod, Peter B. Danzig, Chuck Neerdaels, Michael F. Schwartz, and Kurt J. Worrell. "A Hierarchical Internet Object Cache." Proceedings of the 1996 USENIX Technical Conference, pp. 153-163.San Diego, CA, January, 1996.

[5]Digital Equipment Corporation. DIGITAL Continuous Profiling Infrastructure Project. <http://www.research.digital.com/SRC/dcpi/>. You can obtain DCPI software and documentation from this URL.

[6]John Lions. Lions' Commentary on UNIX 6th Edition with Source Code. Peer-to-Peer Communications, San Jose, CA, 1996. Shows that, at least as far back as 1976, UNIX included a prof() system call.

[7]National Laboratory for Applied Network Research. Squid Internet Object Cache. <http://squid.nlanr.net/Squid/>. You can obtain Squid software and documentation from this URL.


I couldn't have done this work without help from Kathy Richardson and Carlos Maltzahn, who set up the proxy software; Gaurav Banga, whose work on select() performance got me started on this problem; the DCPI team, who produced a wonderful tool; and Duane Wessels and the Squid team, for providing the Squid software. Jeff Dean and Gaurav helped with proofreading.


?Need help? Use our Contacts page.
23 Mar. 1999 jr
Last changed: 8 Mar. 2000 mc
Issue index
;login: index
SAGE home