################################################ # # # ## ## ###### ####### ## ## ## ## ## # # ## ## ## ## ## ### ## ## ## ## # # ## ## ## ## #### ## ## ## ## # # ## ## ###### ###### ## ## ## ## ### # # ## ## ## ## ## #### ## ## ## # # ## ## ## ## ## ## ### ## ## ## # # ####### ###### ####### ## ## ## ## ## # # # ################################################ The following paper was originally published in the Proceedings of the USENIX 2nd Symposium on Operating Systems Design and Implementation (OSDI '96) Seattle, Washington, October 28-31, 1996 For more information about USENIX Association contact: 1. Phone: 510 528-8649 2. FAX: 510 548-5738 3. Email: office@usenix.org 4. WWW URL: https://www.usenix.org Studies of Windows NT Performance using Dynamic Execution Traces Sharon E. Perl and Richard L. Sites Digital Systems Research Center Palo Alto, CA fperl,sitesg@pa.dec.com Abstract sion, and how could we get closer to linear scaling for the multiprocessor version? We studied two aspects of the performance of Win- To answer these questions we found that we needed to dows NTtm : processor bandwidth requirements for look at the detailed behavior of the system under load. memory accesses in a uniprocessor system running com- We created a tool for obtaining complete traces of the in- mercial and benchmark applications, and locking behav- struction and data streams executed by the processor in ior of a commercial database on a small-scale multipro- all operating system and application code. We proceeded cessor. Our studies are based on full dynamic execution to examine these traces and to use them to run simula- traces of the systems, which include all instructions exe- tions that revealed interesting properties of the then cur- cuted by the operating system and applications over peri- rent system. The results of the simulations also had im- ods of a few seconds (enough time to allow for significant plications for future processor design. computation). The traces were obtained on Alpha PCs, The first puzzle we studied is how the processor band- using a new software tool called PatchWrx that takes ad- width requirements of applications_the bandwidth re- vantage of the Alpha architecture's PAL-code layer to quired to service on-chip instruction and data cache implement efficient, comprehensive system tracing. Be- misses from off-chip caches or memory_limits the cause the Alpha version of Windows NT uses substan- achievable execution speed of the applications. This tially the same code base as other versions, and therefore study was motivated by a discussion with some col- executes nearly the same sequence of calls, basic blocks, leagues about using a prefetching strategy to improve and data structure accesses, we believe our conclusions performance. Their studies showed that prefetching are relevant for non-Alpha systems as well. This paper would not help in the particular situation because there describes our performance studies and interesting aspects wasn't enough processor-chip pin bandwidth to support of PatchWrx. the workload. When pin bandwidth is a bottleneck, We conclude from our studies that processor band- some common techniques for trying to improve perfor- width can be a first-order bottleneck to achieving good mance do not help. These include multiple instruction performance. This is particularly apparent when study- issue, code scheduling, prefetching, and improved exter- ing commercial benchmarks. Operating system code nal cache latency. Pin bandwidth puts a ceiling on how and data structures contribute disproportionately to the fast an application can run. We discovered that pin band- memory access load. We also found that operating sys- width is indeed a bottleneck for interesting commercial tem software lock contention was a factor preventing the applications, as well as for data accesses in one of the database benchmark from scaling up on the small mul- SPEC benchmarks that we studied. tiprocessor, and that the cache coherence protocol em- ployed by the machine introduced more cache interfer- The second puzzle we studied is how lock contention ence than necessary. fortahmultiprocessoreapplicationalimitspthepscalabilityloficat* *ion. Detailed executi@ of locking that may not have been expected by the operat- 1 Introduction ing system designers, application designers, or hardware designers responsible for the cache coherence protocol. This work was triggered by two performance puzzles Lock contention and the related cache coherence over- (circa 1995) related to Microsoft's SQL Server running head prevent the application from scaling up beyond a on Alpha [SW95 ] PCs under the Windows NT operating small number of processors. system: how could we speed up the uniprocessor ver- We choose these two aspects to study because they 1 seem to be important factors in overall workload perfor- target of every change of control flow (branch, call, re- mance. turn, system call, system return, interrupt, and return- The contributions of this work are threefold: we pro- from-interrupt), and some base register values for d- vide evidence of processor pin bandwidth limitations and stream memory accesses. A reconstruction program locking problems for commercial applications; we in- working from the log and binary program images re- troduce a new tool for obtaining full traces of a system produces the trace of the full i-stream and d-stream that that allows us to study such problems; and we are mak- was executed. Our logs are typically about 5-10 times ing available some of the traces upon which our studies smaller than the resulting traces. are based for other researchers to use.1 Also, the un- With the entire operating system patched for just derstanding we gained from this work led to significant branches (not load/store) and logging on, everything runs improvements to the hardware and software involved, so at about 1/4 of normal speed until the log buffer fills up. the results presented here do not fully apply to currently- Then logging is turned off and the run speed is about 1/2 shipping hardware and software. of normal speed. This is sufficiently fast that our per- In the next section we describe some of the more inter- sonal machines have run patched all the time for over a esting aspects of the tracing tool, called PatchWrx. Sec- year. With loads and stores patched in the operating sys- tion 3 describes our studies of pin bandwidth require- tem and applications, the worst slowdown we've seen is ments for four different applications on two different about 1/8 of normal speed. Patched images are 30-50% Alpha processors. Section 4 describes studies of lock- larger than the originals. ing behavior of one of these applications_the Microsoft PatchWrx is an offshoot of the ATUM work in tracing SQL Server database_on a small multiprocessor. Sec- using microcode [ASH86 ], and work with binary trans- tion 5 concludes. lation [SCK+ 93 ]. Our approach is similar to other in- line tracing efforts, but differs significantly in at least one dimension. Most published studies are of user code 2 PatchWrx only [EKKL90 , LB94 ], or are done on a single p* *ro- cessor [BKW90 , CB93 ], or require rebuilding source To understand the traces that are the input to our perfor- code [SJF92 ], or trace only cache misses, not all instruc- mance studies it is helpful to understand the properties tions [CHRG95 , TGH92 ]. None use Windows NT. The of PatchWrx, the tool used to produce them. In this sec- excellent Shade paper [CK94 ] summarizes about thirty tion we give an overview of PatchWrx, and then describe previous tools. Using that paper's classification, Patch- some of the highlights of its design and implementation. Wrx, like ATUM, traces executables, user and system code, multiple domains, multiple processors, signals, dy- namic linking, and bugs, with performance similar to 2.1 Overview Shade. PatchWrx is a software-only technique for capturing full We chose to produce traces rather than do on-the-fly time-stamped traces of the dynamic instruction stream (i- data analysis [SE94 ] because of the difficulty of recre- stream) and data stream (d-stream) of operating system ating complex execution environments months after the and user code. originalainvestigation.skWitheaddetailedatrace,yquestionsear o* *r more later can stil@ The goal of the PatchWrx effort is to capture traces Somewhat like an electron microscope for computing, of every single instruction executed throughout many the PatchWrx approach is for studying a small amount of seconds of a real operating system, running some com- execution in excruciating detail, rather than summarizing plex workload on a non-microprogrammed multiproces- long-running executions. sor. We wanted to build a software-only solution, rather than requiring one-of-a-kind hardware that cannot easily All of our experiments have been performed under be applied at a customer computing environment. We Windows NT, version 3.5. The uniprocessor experiments wanted to gather traces with less than a factor of ten were run on an Alpha AXP 150 with 128 MB of main performance degradation, so that nothing in the operat- memory. The multiprocessor experiments were run on ing system broke due to timeouts or excessive latencies. a four-processor AlphaServer 2100, with 190 MHz pro- Finally, we wanted to work with arbitrary binaries for cessors and 256 MB of memory. which we did not have access to the source code. The technique we have adopted rewrites binary exe- 2.2 Trace Contents cutable images, inserting patches that record in a log the __________________________1 The final output of PatchWrx is a trace containing the se- Our uniprocessor traces are publicly available on CD-ROM to full-quence of instructions executed by the operating system time faculty members. Contact one of the authors to obtain a copy. and all applications from the time logging was enabled 2 rupt (the instruction that would normally be executed im- mediately after the interrupt handler returns). This infor- mation is used during trace reconstruction to determine exactly where the interrupt occurred in the i-stream. Additional log entries record information about mem- ory load and store instructions, and process context switches. In the Alpha implementation of PatchWrx, the log is recorded in a 45 MB portion of physical main mem- ory that is reserved at boot time, and is therefore in- visible to the operating system. The log buffer holds about 5.9 million eight-byte log entries, which is enough for 5-20 seconds of real time. There is so much in- formation in a single reconstructed trace that we have not been motivated to try stitching multiple traces to- gether [CHRG95 , AH90 ]; a single reconstructed trace Figure 1: Formats of PatchWrx log entries. Each entry is contains about 650 MB of dynamic i-stream with instruc- 64 bits. tion and data addresses. Recording the log in main memory is much faster than up until the log buffer (from which the trace is recon- recordingionndisksorttape.eRecordingaindphysicalomemoryf virtu* *al memory allows us t@ structed) filled up. levels of the operating system, including the page-fault Each instruction in a trace is tagged with its program handler, without generating recursive page faults. It also counter value, and if it is a load or store instruction, the allows us to trace across multiple threads running in mul- memory virtual address that is its source or target. In a tiple address spaces, without needing to write a log entry multiprocessor trace, the instruction is also tagged with to one address space while executing in a different ad- its processor number. dress space. Some versions of our traces also contain times- tamps on all branching instructions except for two-way branches. These are used to help line up the i-streams 2.4 PAL Subroutines for Logging from different processors in the multiprocessor traces, To implement the logging code, we use the Alpha ar- and to compute lock holding times in our second study, chitecture's PALcall instruction, which traps to one of a described in Section 4. set of Privileged Architecture Library (PAL) specialized subroutines without disturbing any programmer-visible 2.3 Log Entries state, such as registers. These subroutines have access to physical main memory and to internal hardware regis- While a patched system is running, a log is collected ters and they run with interrupts turned off. We extended recording the information necessary to eventually repro- the PAL-code for Alpha NT with eight additional subrou- duce a full system trace. A log is a sequence of entries tines, and we modified some of the existing subroutines, describing branching events or data references. Figure 1 as summarized in Table 1. Other architectures may have summarizes the different kinds of log entries. supervisor call or trap instructions that, in conjunction As in most computer designs, the Alpha architecture with modified operating system kernel interrupt routines, has two basic forms of branching instructions: a jump could be used to get a similar effect. to an address in a register, and a PC-relative two-way conditional branch. For each jump instruction, we record 2.5 Collecting Data Addresses one log entry. For two-way branches, we accumulate a bit-vector recording the outcome of up to 60 two-way It is possible to capture data addresses by patching all branches in a single log entry, one bit per branch. This load and store instructions, but this fills up the log buffer gives us a compact encoding of the exact flow within a quite quickly and so we would like to avoid it. We ob- subroutine, taking only about 10% of the log entries. serve that many pieces of code use multiple references When an interrupt or page fault occurs during log- off the same base register. Since the i-stream recon- ging, the address of the first instruction of the interrupt struction recovers the actual instructions executed, we or fault handler is put in the log, along with the address will have these address specifications in the dynamic i- of the first instruction not executed because of the inter- stream. Thus, for a sequence of references over which 3 __________________________________________________________________ ___PAL_routine_________________Action/Recorded_info._______________ nitude and lets us encode the processor number once per _ PalReset _ (Set aside log memory) _ chunk of entries, rather than in every entry. With log _ InterruptStackDispatch _ next addr., interrupt target _ entries generated on each processor at the rate of about _ SoftwareInterrupt _ next addr., interrupt target _ one per microsecond, a group of four chunks represents _ DispatchMmFault _ next addr., page fault target _ about 100 microseconds of real time on a four-processor _ UNALIGNED _ next addr., align. fault target _ system. _ RFE _ return from exception target _ _ CALLSYS _ sys. call target _ 2.7 Trace Reconstruction from Logs _ RETSYS _ return from sys. call target _ ___SWPCTX______________________new_process_ID_____________________ Reconstruction of a full trace given a log and a set of _ pwrdent _ read log entry from buffer _ binaries is mostly straightforward. As described above, _ pwctrl _init. log, turn logging on/off _ some special care and techniques are applied for obtain- _ pwbsr _ branch entry _ ing data addresses from a limited set of base register _ pwjsr _jump/call/return entry _ values recorded in a log. Two additional tricky issues _ pwldst _ load/store base register entry _ involve handling interrupts and merging multiprocessor _ pwbrt _ cond. branch taken _ traces. _ pwbrf _ cond. branch fall-through _ The first issue concerns where to insert interrupts in a ___pwpeek______________________(for_debugging_only)_______________ reconstructed i-stream. During reconstruction, one po- tential place to divert the i-stream to the interrupt han- Table 1: Logging-related PAL subroutines. First set dler is after all jump entries that precede the interrupt are modifications to existing PAL subroutines. Second have been consumed, when the instruction that matches set (starting with pwrdent) are new PAL subroutines for the not-executed address in the next log entry is encoun- PatchWrx. tered. If the not-executed instruction is inside a loop, the interrupt must be delivered during the right iteration. changes to the base register value can be computed from Loop iterations are controlled by jumps or conditional the i-stream, we need only record the base register's branches. So it is necessary to consume not only all the value once. The effect is that only one out of every 5- preceding jump entries in the trace, but also exactly the 10 load or store instructions is actually patched, and that right number of conditional branch bits before delivering for loops with constant strides through memory, only the the interrupt. For this reason, we flush the partially accu- initial base-register value outside the loop is traced. mulated taken/fallthrough vector into the log buffer just Our patching and reconstruction algorithms are some- before recording the trace entries for an interrupt. This what simplistic, causing the reconstruction to reach some allows a perfect reconstruction of where to deliver an in- loads and stores without knowing the base register value terrupt. (as with certain interrupts). When this happens during The second issue concerns merging traces from mul- reconstruction, we make up a random synthetic value tiple processors. This requires special care because the for the base register, then track any incremental changes time stamps within the entries come from four differ- from there. Even in code with no load/store patches at ent cycle clocks (oscillators) on a four-processor system. all, this is surprisingly useful. The effect seen in the These clocks are not synchronized with each other, and d-stream is that the bases of arrays and structures and we observe drift of up to 100 parts per million (100 mi- stacks are random, but the relative access patterns within croseconds per second) within logs. All we really know each aggregate are accurately reflected. about the clocks is that the first time stamp in chunk N was created after the first time stamp in chunk N 1 and 2.6 Handling Multiprocessors beforeBtheyfirstconeainrchunkeNf+u1.lly applying running in* *equalities between When PatchWrx is running on multiple processors, the hundreds of chunks, we can map the drifting time bases log entries for all processors are merged into a single into a single absolute time from the beginning of the log, log buffer. This allows us to see the dynamics of the in- within a window about two microseconds wide. We use teractions between processors. Writing a single merged this derived absolute time base to interleave the instruc- log requires doing a multiprocessor-atomic update of the tions in the reconstructed merged i-stream. The effect is next-log-entry pointer, and requires encoding the proces- that an instruction on one processor that stores to a shared sor number in each log entry. variable is close (within dozens of instructions at a 4x Rather than interleaving single entries, we allocate tracing slowdown) in the final i-stream to an instruction chunks of 128 entries to each processor. This cuts down on another processor that reads the shared variable and in the frequency of atomic updates by two orders of mag- fact sees the new value. Although we cannot guarantee 4 that the store precedes the load in the final i-stream, we find this close enough for understanding multi-processor dynamics, including cache interactions. 2.8 I-stream Distortion Because the patches introduce extra instructions and ex- tra memory references on an instrumented system, there is necessarily some distortion in the timing of the recon- structed i-stream. There is little distortion of the i-stream itself, since the reconstruction excludes the patches. The paths through subroutines and the sequence of subrou- tine calls are the same in the reconstructed i-stream as they would have been in an uninstrumented system. The reconstructed and uninstrumented i-streams differ in four subtle ways, however. Figure 2: Processor pin bandwidth First, the patched images are bigger than the originals, so shared-library images loaded into consecutive mem- ory locations end up at somewhat different addresses. 3 Pin Bandwidth Study This can have a slight effect on instruction cache (i- cache) simulations for large caches based on the traced Our first study using PatchWrx traces is a comparison instruction addresses. Patching expands images by mul- of processor pin bandwidth requirements for a few dif- tiples of 64 kilobytes, so the larger size images have ferent applications running under Windows NT on two no effect on small cache simulations; the low 16 bits Alpha processors. As shown in Figure 2, we are inter- of patched and original i-stream addresses are identical. ested in the bandwidth between the processor chip (with There can be a similar slight effect on data cache (d- on-chip i- and d-caches) and the board-level cache. This cache) and translation buffer simulations, as patches can memory traffic is due to i- and d-cache misses. The ques- also cause data to be moved. tionpis:inhowsmanyobytesfperasecondpneedrtoocrossctheessor chi* *p when running a give@ Second, the patched images can take i-stream page at full speed? The answer is a function of the workload, faults on the pages containing the patches. While the including user and system activity, the properties of the patches themselves never show up in the reconstructed i- processor chip, the clock speed, and the desired clock stream, the extra page fault traps and processing do. This cycles per instruction (CPI). For our study, "full speed" appears to be only a slight distortion. is 1 CPI. The answers are interesting because, as we'll Third, timer interrupts happen four times more fre- see, processor pin bandwidth is a first-order bottleneck quently in the traced code than in the uninstrumented for important applications. code, due to the 4x tracing slowdown. In Win- dows NT 3.5, this increases the total number of instruc- tions executed by about 1 percent. We have chosen to 3.1 Configuration ignore this in our studies, but one could mechanically We looked at four applications: remove three of every four timer interrupts to further re- duce the distortion. 1. Microsoft SQL Server, October 1994 beta version, Finally, external interrupts (disk, network, etc.) hap- a commercial database server, running the TPC- pen approximately four times sooner in the traced code B [Gra91 ] benchmark. The trace contains 64.5 mil- than in the uninstrumented code (again due to the 4x trac- lion instructions, spanning several hundred transac- ing slowdown). To the extent that they are a consequence tions. of the workload being traced, there are not more inter- 2. GEM compiler [B+ 92 ] back-end, from Digital's rupts per million instructions executed_each one just commercially available optimizing C compiler, occurs relatively sooner in the i-stream. Running an in- compiling a 3000 line C program. The trace con- terrupt routine sooner than it would have run in the unin- tains 29 million instructions. strumented code can have a subtle effect on the memory access patterns of a processor, but there appears to be no 3. tomcatv, from the SPEC92 floating point bench- interesting distortion when aggregated over several sec- marks [SPE ]. The trace contains 47 million instruc- onds. tions. 5 __________________________________________________________________ _ _ _ System _ Appl. _ Synthetic _ The Alpha 21164 runs at 400 MHz and has a sin- _ _ _ Ld/st _ Ld/st _ data _ gle 96-kilobyte 3-way associative write-back combined ___Application_____#instr.____patched_____patched__________addr.____ i-cache and d-cache with a 64-byte cache line size. Each _ SQL Server _ 64M _ n _ n _ 73% _ i-cache miss and load miss causes 64 bytes to cross the _ GEM comp. _ 29M _ n _ Y _ 8% _ pins. This processor has a write-back cache: writes that _ Tomcatv _ 47M _ n _ Y _ 24% _ hit in the cache do not cause any bytes to cross the pins ___Ear_______________83M_________n___________n_____________14%____ but do make a cache line dirty. Writes and reads that cause a dirty cache line to be flushed ("victim writes") Table 2: Trace characteristics. cause 16-64 bytes to cross the pins, depending on what portion of the line has been modified; dirty bits are kept 4. ear, also from the SPEC92 benchmarks. The trace for each 16-byte subblock. Writes that miss also cause a contains 83.84 million instructions. cacheclineatoubesallocatedeand6filled4frombmemory,ysottheyes t* *o cross the pins for @ We chose these applications because they give a vari- some of these reads will be unnecessary overhead if all ety of data points, from heavy memory system usage to 64 bytes are eventually overwritten ("redundant reads" as light usage. At the time we started the experiments, the compared to "useful reads"). The realistic pin bandwidth SPEC95 benchmarks were not yet available. It would on this machine is about 750 MB/sec, and the maximum be interesting to do the same experiments for the gcc bandwidth is about 1.6 GB/sec. SPEC95 benchmark, to see how it compares to the GEM results. We took traces of these applications running on a 3.2 Results uniprocessor Windows NT 3.5 system. The traces var- For each of the applications on each of the processors ied as to how much load/store patching was done. The we looked at a plot of the processor pin bandwidth re- reasons for not just doing full load/store patching relate quirements over time. This shows the dynamics of the to current PatchWrx limitations on image sizes and the traffic in addition to giving us upper bounds on the re- slowdowns introduced by load/store patching. As shown quirements. in Table 2, none of the uniprocessor traces had loads and stores patched in the operating system. For the SQL Server and ear uniprocessor traces no loads or stores 3.2.1 SQL Server were patched in the application images. For the GEM compiler and tomcatv traces, load and store instructions Figure 3 shows the simulation results for pin bandwidth were patched in the application images. None of the i- requirements on the Alpha 21064 and 21164 processors stream references in these traces are synthetic, but the for the SQL Server workload. The x-axis marks time in d-stream synthetic references vary from 8% to 73%. chunks of 0.5 million instructions, and the y-axis shows We used these traces to simulate the caching behav- the number of gigabytes per second of2traffic across the ior of different processor chips under the assumption that pins for a given instruction trace. the trace executes at 1 CPI. The simulation computes the The left-hand graph shows the 21064 results. The traf- number of bytes that would need to cross the pins to the fic is broken down into i-cache misses (imiss), d-cache board-level cache. These numbers were broken down by read misses (dmiss), and write buffer flushes (dwrite). the source of the traffic: instruction reference misses, and At 1 CPI, the pins are overcommitted by a factor of two data reference misses of a variety of kinds. just on instruction cache misses alone. Three bytes of We looked at the behavior of each application on instruction miss in the cache for every four bytes exe- two different Alpha processors: the Alpha 21064 at cuted. The required bandwidth for all traffic is about 200 MHz, and the Alpha 21164 at 400 MHz. 1.2 GB/sec, or about four times the bandwidth that the The Alpha 21064 runs at 200 MHz and has 8 kilo- processor realistically can deliver. In fact, the uninstru- bytes each of direct-mapped write-through i-cache and mented workload achieves about 4.3 CPI, giving us con- d-cache with a 32-byte cache line size. Each instruction fidence that the cache simulations are valid. The simula- cache miss and data read (load) miss causes 32 bytes tion is intended to give lower bounds on CPIs based on to cross the pins. Writes go into a group of four write pin bandwidth requirements; the extra 0.3 CPI in prac- buffers. A write to a different address causes 32 bytes tice is due to non-zero latencies that are not included in to cross the pins to flush one of the dirty write buffers. the simulation. The actual pin bandwidth on this processor is realisti- __________________________2Note: the bandwidth pictures show * *only the first 25 mil@ cally about 300 MB/sec. The maximum bandwidth is structions of each trace; very little changes in any of the tr* *aces beyond about 600 MB/sec. that. 6 Figure 3: Pin bandwidth requirements of SQL Server running on Alpha 21064 and 21164 processors. The right-hand graph shows the 21164 results. Here phases where the pins are overcommitted by factors of we split up the d-cache write traffic into victim writes 1.5 to 3 or more, and then are okay. The actual program (dvict), and useful and redundant reads triggered by write runs at between 1 and 1.5 CPI on this machine. On the misses (dread). The required pin bandwidth here is as 21164, more of the workload falls within the bandwidth high as 1.7 GB/sec, so the pins are overcommitted by limitations, with the periodic lower plateaus well within more than a factor of 2 at 1 CPI. The pins are still of- the capacity of the processor pins, but there are still pe- ten overcommitted just on i-cache misses alone, although riodic benchmark array access bursts that are factors of the i-cache miss traffic periodically dips down below the two or more higher than the available bandwidth. Note 750 MB/sec that the processor typically delivers. that the instruction stream mostly fits in the cache, and so doesn't contribute much to the bandwidth requirements. 3.2.2 Compiler Back-end Also, the demands for data bandwidth fall into a pre- dictable pattern. Perhaps prefetching during times when The second trace for which we simulated the pin traffic bandwidth is underutilized could help for this workload. is from the GEM compiler run. The results are shown Figure 6 shows the simulation results for the ear trace in Figure 4. The left-hand graph shows that the required running on the same two processors. We see that even pin bandwidth on the Alpha 21064 is about 600 MB/sec. on the 21064, the pin bandwidth is essentially sufficient This is about half as much as the SQL Server workload with required bandwidth about 200 MB/sec. The actual on this machine, with a somewhat lower ratio of i-cache workload runs at 1 CPI or less (because of the dual in- to d-cache traffic. The pins are still overcommitted by a struction issue capabilities of the processor). The re- factor of two according to the simulation. The uninstru- quirements for ear on the 21164 400 MHz processor are mented workload on this machine runs at about 2.5 CPI. well within the range of the processor. A large part of the The right-hand graph shows the results for this work- workload fits in the caches in this case. load on the Alpha 21164 400 MHz processor. Note that the i-stream here mostly fits into the 96-kilobyte cache, with about 200 MB per second of traffic left over. The to- 3.3 Memory Maps tal pin bandwidth requirements are generally under about 500 MB/sec, with occasional spikes over 800 MB/sec. Looking at the patterns of memory accesses over time Thus, pin bandwidth is mostly not a bottleneck for GEM helps illuminate the results of the pin bandwidth simu- on the 21164. lationsfshownoabove.otFiguresp7-11rshowithenmemoryts for the f* *irst 25 million inst@ trace3. The x-axis shows time from left to right, in mil- 3.2.3 SPEC Benchmarks lions of instructions executed. The y-axis shows virtual For comparison with the commercial workloads, we addresses accessed, modulo 4 MB (chosen because it is chose two of the SPEC92 floating point benchmarks to an interesting board cache size). There is a dot in the pic- examine. ture for each instruction address and for each load/store Figure 5 shows the simulation results for the tomcatv data_address._____________ trace. On the 21064, the i-stream essentially fits into the 3[except for the ear trace, which has about 21 million inst* *ructions 8 kilobyte i-cache. The program repeatedly goes through shown in this version] 7 Figure 4: Pin bandwidth requirements of GEM compiler running on Alpha 21064 and 21164 processors. Figure 5: Pin bandwidth requirements of tomcatv running on Alpha 21064 and 21164 processors. Figure 6: Pin bandwidth requirements of ear running on Alpha 21064 and 21164 processors. 8 Figure 7: tomcatv memory access plot showing the dis- Figure 9: ear memory accesses, user and system. tribution of user-space addresses over time. Looking at any vertical slice we see all the addresses (mod 4 MB) that were accessed within that time slice. Figure 10: GEM memory accesses, user and system. Figure 8: tomcatv memory accesses, user and system. Figure 7 shows the memory footprint for the user-only part of the tomcatv benchmark. The bottom solid line shows the instruction fetches for the benchmark, all in a tight loop on a single page of code. The leftmost group of six upward-sloping lines are data references sweeping forward through six half-megabyte arrays. These are fol- lowed by some faster forward sweeps through different arrays, four backward sweeps, and the data pattern re- peats. This program has very predictable branching pat- terns and very predictable data accesses. Figure 11: SQL Server memory accesses, user and sys- In Figure 8, we have added the operating system part tem. of the trace. The periodic vertical tics about one sixth of the way up from the bottom of the picture are the i-stream for the timer interrupt routine (about four times too of- ing patterns and data accesses. ten because of the tracing slowdown). The other dots Figure 9 shows, on the same scale, a trace of both that correlate with these are the data references made system and user addresses for the ear benchmark. The in the timer routine. The occasional vertical lines of memory usage is relatively light, with few strong patterns high-density references are other operating-system activ- other than the horizontal lines due to timer interrupts and ity, including the thread scheduler and network traffic. some accesses to application data structures, and the ver- This operating system code has less predictable branch- tical lines due to operating system activity. 9 Figure 12: Incremental I-cache Hit Curves. Figure 13: Incremental D-cache Hit Curves. Figure 10 shows, on the same scale, a trace of the From the bandwidth charts above, we see a marked GEM compiler benchmark. The horizontal lines rep- difference in i-stream miss rates across the uniprocessor resent frequently used subroutines and frequently used workloads. Figure 12 shows the incremental i-cache hit data. The visually dominant diagonal line appears to curves for three of them (we exclude ear from this Sec- be the buffer of program text being compiled; the com- tion's discussion because its memory demands are un- piler is sweeping forward through it (modulo 4 MB). interesting). For all three curves, the 1KB points are The timer-interrupts are still there, but a bit harder to off the chart at 845-957 hits/1000 instructions. This is see. Most of the rest of the dots are data references to to be expected - if straightline code is executed with no linked-list data structures used by the operating system. repetition from 64-byte (16 instruction) cache lines, we This workload has less predictable memory access pat- would expect 15/16 hits, or 937 hits/1000 instructions. terns then tomcatv. The short vertical bars mark 8K and 96K cache sizes. Finally, Figure 11 shows, on the same scale, a trace of The tomcatv i-stream fits almost entirely in 8KB, with the SQL Server benchmark. The trace looks somewhat a little tail that hits only in main memory. The GEM similar to white noise_there is very little correlation be- compiler i-stream extends out to 32KB before it drops tween addresses. This application falsifies the premise off, and the SQL i-stream has significant incremental hits of a cache: that accessing a location is a good predictor all the way out to a 1MB cache. This large footprint is that the same or a nearby location will be accessed in the why the SQL i-stream puts such a bandwidth load on the near future. processor pins. This result is roughly consistent with the These memory maps highlight the differences between i-cache miss rates measured by [MDO94 ] for TPC-A and the memory requirements of the workloads. There seems TPC-C (although they had a larger percentage of time to be at least some rough correlation between the den- in operating system code than we observed). The GEM sity of the memory maps and the pin bandwidth require- footprint produces a large bandwidth load with an 8KB ments. on-chip cache but a small load with a 96KB cache. The tomcatv footprint presents very little off-chip bandwidth 3.4 Cache Demand in More Detail load. Figure 13 shows the incremental d-cache hit curves. To understand the processor bandwidth results better, we All three d-streams show a tail of references that miss looked at the incremental cache hit curves for cache sizes even in 16MB and go all the way to main memory. The from 1KB to 16MB. Except for the smallest cache, the tomcatv curve has peaks at 8KB (76 misses per 1000 incremental hit rate for a given size cache is the number instructions) and 4MB that correspond to sizes of fre- of additional hits that occur in the given cache and miss quently accessed arrays in that program. The 4MB peak in smaller caches. Portions of the incremental hit curve is why the tomcatv d-stream puts a large periodic band- for sizes bigger than processor on-chip caches represent width load on the processor pins. misses that produce the off-chip bandwidth demand. In Figure 14, we have broken down the SQL incremen- In this section, we used direct-mapped combined i- tal i-cache hits by system and user code. The system hits and d-stream caches and a line size of 64 bytes, and ran are per 1000 system instructions, and the user hits per the simulations over the initial 25M instructions of each 1000 user instructions. The SQL system code is about trace. 25% of the total i-stream. The system code not only has 10 _______________________________ ____________21064_____21164____ _ SQL _ 3.13 _ 1.87 _ _ GEM _ 1.20 _ 0.28 _ _ tom _ 0.04 _ 0.06 _ ______ear___0.05______0.02______ Table 3: Average bytes per instruction of pin traffic due to the i-streams of each of the benchmarks on the two processors studied. that hit in a real memory subsystem, we get the total time spent waiting on memory accesses for that workload. Using realistic times for early Alpha computer sys- Figure 14: SQL User vs. System I-stream. tems, the SQL i-stream and d-stream hits beyond 96KB each total about 1.8 CPI. Combined with a 100%-hits execution rate of about 1 CPI, these total about 4.6 CPI. This is roughly consistent with the measurements in [CB94 ] of 4.3 CPI, and 20-25% operating system time, and 39%+36% i-stream plus d-stream stall time, on the TP1 database workload on an older AlphaSystem 7000 computer. 3.5 Discussion We conclude from these experiments that processor pin bandwidth is a bottleneck for some applications. SQL Server needs about two to four times the band- width of tomcatv, and is limited to about 4 CPI on at least Figure 15: SQL User vs. System D-stream. one current processor. According to another experiment that we ran, a future processor similar to the Alpha 21164 a bigger cache footprint, it has a larger tail of hits in main but running at 500 MHz would need at least 2 GB/sec memory. of pin bandwidth to run the SQL Server benchmark at The SQL d-stream breakdown, Figure 15, shows sys- 1 CPI. tem vs. user and also real vs. synthetic d-stream ad- Both SQL Server and GEM have higher pin band- dresses. The d-stream has a high proportion of synthetic width requirements due to i-stream traffic than either of addresses, so is not very precise. The major difference the SPEC benchmarks, as summarized in Table 3. This between the real and synthetic addresses is a larger pro- i-stream traffic contributes significantly to the pin band- portion of synthetic addresses that miss all the way out to width bottleneck for SQL Server. Therefore, chip de- main memory. For both the real and synthetic d-stream signers should be looking at more than SPEC bench- addresses, the system code has a larger footprint than the marks if they want their future chips to run commercial user code. workloads well. Both the GEM compiler and tomcatv i-stream break- Our experiments also suggest that it is increasingly im- downs (not shown here) reveal the system i-stream to portant for algorithm designers to pay attention to mem- have a much larger footprint than the user i-stream. Simi- ory structure and cache parameters if they want their larly, the d-stream breakdowns show the system d-stream code to run fast. One+example of such an effort is the to have a much larger footprint than the user one (al- AlphaSort work [NBC 94 ], where careful attention to though tomcatv is close, due to the large array accesses). these memory details paid off handsomely. In addition to observing larger footprints in system and SQL server code, the incremental cache hit curve 4 Multiprocessor Study allows us to compute the relationship between the CPI of a workload and the performance of various levels of the Our second study examines the locking behavior of memory hierarchy. If we multiply each point of the in- SQL Server running the TPC-B benchmark on a four- cremental cache hit curves by the time it takes to service processor AlphaServer 2100 symmetric multiprocessor. 11 The traces were taken for SQL Server version 6.00.85a the time_about 45% of the total elapsed time. This sug- (March 1995 beta release), running Windows NT 3.5, gests that the workload could not scale up beyond eight with 12 clients running transactions against the database. processors, which would cause this lock to be held nearly We chose this number of clients to minimize the idle time 100% of the time. of the system, which ended up being about 2-3%. More that 16% of the time is spent spinning while The trace contains about 69 million instructions. In waiting for a lock, usually the dispatcher lock. This spin contrast to the uniprocessor traces, operating system time would go up substantially as the lock-held time ap- loads and stores are patched. For the application code, proaches 100%. Note the convoy effect in the second only loads and stores to lock variables are patched. rectangle, at about 0.041 seconds. Processors line up We use the trace in two ways. First, we look at a pic- spinning for the dispatcher lock. ture showing the timeline of software lock acquisition During the long stretch where processor 0 holds the and release activity. Then, we simulate the cache coher- dispatcher lock, a disk interrupt arrives and the processor ence protocol employed on the machine to understand services it while still holding the lock. Other processors how the locking activity affects communication among are waiting to get the lock during that time. the processors. All of the evidence above suggests that the dispatcher lock is a bottleneck. Because of the convoy effect, it 4.1 Software Lock Activity wouldcbeenearlysimpossiblestoogetrbeyondsabout'6wpro-orth of w* *ork from this code. Figure 16 shows the timeline of locking activity for a 20- The picture reveals that in addition to being held for a millisecond portion of the SQL Server trace, encompass- long time overall, the dispatcher lock is held frequently. ing portions of several transactions on the four proces- This is because the SQL Server user code goes through sors. The picture shows groups of four lines (with back- the dispatcher lock to block when one of its own locks grounds alternately shaded white and light gray), one line is unavailable. This type of lock usage was perhaps not per processor, with shaded bars representing the elapsed anticipated by the Windows NT implementers. Showing times that locks are held. Time runs from left to right the picture to the Windows NT kernel and SQL Server and top to bottom. Each group of four horizontal lines implementers at the same time brought out assumptions represents 2 milliseconds of elapsed time. that each group was making about the other's code. In this picture, we distinguish three cases: the thick The full-color version of this picture makes other lock- black bars are the operating system's dispatcher lock ing patterns visible. For example, there is a repeated (KiDispatcherLock), the thick gray bars are all other lock pattern of nine lock acquisitions (some of which are the holding times, and the thin black lines are times spent same lock repeatedly acquired and released) in the same spinning while waiting to acquire a lock.4 In the rectan- order when manipulating the operating system page ta- gle at 0.030 seconds, for example, processor 0 starts out bles. Examining the code may suggest ways to make the holding no locks, processor 1 holds the dispatcher lock, associated operation more efficient. processor 2 is spinning on the dispatcher lock, and pro- cessor 3 hold no locks. Processor 1 then releases the dis- patcher lock and processor 2 acquires it. Four other locks 4.2 Interference from Cache Coherence are used, then processors 1 and 2 use the dispatcher lock The software lock activity above for achieving mutual again. Finally, processor 1 uses another lock and proces- exclusion is best tuned by changing the software. We sor 0 uses a lock. also looked at multiprocessor hardware cache interfer- Lock acquisitions and releases are not reported di- ence caused by writes on one processor affecting caches rectly in the trace, but rather, must be detected by pattern on other processors. matching against the common sequences of instructions We looked in detail at the behavior of the cache co- used for this purpose. Therefore, the picture showing herence protocol with four AlphaServer 2100 proces- lock acquisition and release points is not perfect. How- sors. A pure invalidate protocol is used_a write by ever, we have taken some care to ensure that the picture is one processor to a cache line invalidates all other cached reasonably self-consistent, and consistent with the code. copies of the line in other processors. We found that 40% While this picture shows just a small slice of the trace, of the board-level cache misses in the 4-processor SQL it reveals several interesting behaviors. The dispatcher trace were due to interference from other processors_ lock is held for relatively long stretches of time (200-900 invalidates due to writes of shared variables. instructions). Also, it is held for a large percentage of Most of these misses were in fact to about 24 cache __________________________4 lines containing software locks. (Note that the time lost We usually look at this picture in color, with different colors and line widths for different locks, making more of the locking behavior on servicing these misses is smaller than the time lost apparent at once. above spinning on held locks.) Some of the invalidations 12 Figure 16: SQL Server lock activity. Time runs left-to-right and top-to-bottom. came from false sharing. The programmers who wrote tem reveals properties that are more difficult or impossi- the code assumed 32-byte cache lines and took care to ble to see from statistical information. We can see clearly put locks in different 32-byte lines than other variables. the convoy effect for locks that are in high demand, and However, this particular machine has 64-byte lines, so we can see patterns of locking that may suggest ineffi- a lock and another unrelated shared variable could end ciencies in the code. We can also identify cases where up in the same line and cause unnecessary invalidations. locks are probably not a performance problem, as is true The ability to track shared sub-blocks in the hardware for most of the locks in our trace. would help this problem. A better cache coherence pro- Having the traces available for simulating cache co- tocol for this workload would be a hardware design that herence protocols helped us to understand the causes of writes through for shared blocks, having recipients up- scaling problems in our multiprocessor system, and can date or invalidate depending on how recently they had help the designers of future systems to avoid these prob- touched the line. Such a design was used in the Alpha lems. Demonstration Unit [TCS93 ]. 5 Conclusions 4.3 Discussion These studies have demonstrated that full, dynamic exe- The Windows NT 3.5 operating system code holds the cution traces are useful for designing future systems, as dispatcher lock for so long because it is doing a full con- well as for diagnosing difficult performance problems of text switch while holding the lock. It would be better current systems. to make the decision about what to dispatch next while Obtaining such traces is practical using PatchWrx. holding the lock, release the lock, then perform the actual The overhead is low enough on the uniprocessor systems context switch without delaying the other processors. Al- (about a factor of 2 slowdown) that we always ran the lowing interrupts while holding the dispatcher lock also systems with instrumentation in place. Overhead on mul- lengthens the path and slows down other processors. It tiprocessor systems was reasonable enough to get useful would be better to turn off interrupts before acquiring the traces. PatchWrx traces are in the range of about 50-100 lock and turn them back on after releasing the lock. million instructions, which was enough to observe inter- The application code that triggers too-frequent use of esting dynamic behavior in the system while still being the dispatcher lock could adopt other locking strategies, practical to collect. such as trying a handful of times to acquire an applica- The specific versions of software that we studied are tion lock before blocking or to do user-level dispatching inevitably already out of date with respect to currently between multiple transactions. available versions, and some of the specific problems re- Examining the dynamic locking behavior of the sys- vealed in our studies may have been fixed by now. In fact, 13 the Microsoft SQL Server 95 and Windows NT 3.51 and simulations. We also thank Jim Gray and David Cutler 4.0 products contain performance improvements 5 that at Microsoft. resulted from our studies [Gra95 ] . However, the same The observations of Richard Draves and the anony- techniques and tools can be used to look for performance mous OSDI referees substantially improved the paper. problems in the new versions and the traces still provide valuable input for testing future chip designs. References We find that real pieces of large software are big- ger than any practical on-chip caches, and therefore use [AH90] Anant Agarwal and M. Huffman. Blocking: much more processor pin bandwidth. Pin bandwidth can Exploiting spatial locality for trace com- be a bottleneck to running commercial workloads at the paction. Performance Evaluation Review, desired speed. For the near future at least, computer de- 18(1):48-57, May 1990. sign is memory design, and software design is memory design. Pin bandwidth is an important consideration (as [ASH86] Anant Agarwal, Richard L. Sites, and Mark much as memory latency) because it sets a lower bound Horowitz. ATUM: A new technique for cap- on the number of clock cycles per instruction needed to turing address traces using microcode. In execute a workload. Proc. of the 13th International Symposium Although this paper emphasizes the effect of band- on Computer Architecture, pages 119-127, width on the performance of these processors, one June 1986. shouldn't ignore the effect of latency. If there is insuffi- [B+ 92] David S. Blickstein et al. The GEM op- cient bandwidth to perform a given workload, the work- timizing compiler system. Digital Tech- load will run proportionally slowly, and there is nothing nical Journal, 4(4), 1992. Also avail- software can do to compensate for this. Systems that able as https://ftp.digital.com/pub/Digital- have sufficient bandwidth but high memory-access la- /DECinfo/DTJ/axp-gem.ps. tency can possibly perform well, but only for software that is designed to hide the high latency. Both high [BKW90] Anita K. Borg, R. E. Kessler, and David W. bandwidth and low latency are necessary for peak per- Wall. Generation and analysis of very long formance on most commercial software. address traces. In Proc. of the 17th An- nual Symposium on Computer Architecture, pages 270-279, May 1990. 6 Acknowledgments [CB93] Brad J. Chen and B. Bershad. The impact of A number of people at Digital have contributed to this operating system structure on memory sys- effort. Benn Schreiber, Tom van Baak, Miche Baker- tem performance. In Proc. of the 16th Inter- Harvey, and Joe Notarangelo gave us immeasurable help national Symposium on Operating Systems in getting the NT PAL-code modifications running. Rich Principles, pages 120-133, December 1993. Witek wrote the original single-processor boot PAL- [CB94] Zarka Cvetanovic and Dileep Bhandarkar. code. Mike Burrows contributed expertise on binary Characterization of Alpha AXP perfor- modification techniques. Wim Colgate provided the key mance using TP and SPEC workloads. In insight into booting NT multiprocessor systems. Dave Proc. of the 21st Annual Symposium on Hunter generously provided time on a four-processor Al- Computer Architecture, pages 60-70, April phaServer 2100 system. Chuck Thacker, Dave Conroy, 1994. Bill Weihl, Greg Nelson, Amitabh Srivastava and Alan Eustace provided continuing encouragement. Our sum- [CHRG95] John Chapin, Stephen A. Herrod, Mendel mer interns, Cliff Mercer and David Martin struggled Rosenblum, and Anoop Gupta. Mem- with early software and each provided key insights for ory system performance of UNIX on CC- increasing performance of the workloads studied. David NUMA multiprocessors. In ACM SIGMET- Martin was responsible for the cache coherence protocol RICS, pages 1-13, May 1995. __________________________5 Improvements include: some collapsing of the 36 memory alloca- [CK94] Robert F. Cmelik and David Keppel. Shade: tion calls per TPC-B transaction, each of which calls EnterCriticalSec- A fast instruction-set simulator for execu- tion; replacing a slow locking sequence macro expansion with a more tion profiling. In ACM SIGMETRICS, pages efficient one, worth 5-8% all by itself; removing unnecessary calls to floating point initialization routines; avoiding calls to OS services for 128-137, May 1994. CPU time and clock time; compiler improvements in setjmp/longjmp implementation; minimizing context switches; and increasing cache lo-[EKKL90] Susan J. Eggers, David Keppel, Eric J. cality. Koldinger, and Henry M. Levy. Techniques 14 for efficient inline tracing on a shared- [TCS93] Charles P. Thacker, David G. Conroy, and memory multiprocessor. In ACM SIGMET- Lawrence C. Stewart. The Alpha Demon- RICS, pages 37-47, May 1990. stration Unit: A high-performance multi- processor. Communications of the ACM, [Gra91] Jim Gray, editor. The Benchmark Handbook, 36(2):55-67, February 1993. chapter 2, pages 79-117. Morgan Kauf- mann, San Mateo, California, 1991. [TGH92] J. Torrellas, A. Gupta, and John Hennessy. Characterizing the cache performance and [Gra95] Jim Gray, December 1995. Private commu- synchronization behavior of a multiproces- nication. sor operating system. In Proc. of the 5th International Conference on Architectural [LB94] James R. Larus and Thomas Ball. Rewrit- Support for Programming Languages and ing executable files to measure program be- Operating Systems, pages 162-174, October havior. Software_Practice and Experience, 1992. 24(2):197-218, February 1994. [MDO94] Ann Marie Grizzaffi Maynard, Colette M. Donnelly, and Bret R. Olszewski. Contrast- ing characteristics and cache performance of technical and multi-user commercial work- loads. In Proc. of the 6th International Con- ference on Architectural Support for Pro- gramming Languages and Operating Sys- tems, pages 145-155, October 1994. [NBC+ 94] Chris Nyberg, Tom Barclay, Zarka Cve- tanovic, Jim Gray, and David B. Lomet. Al- phaSort: A RISC machine sort. In SIGMOD Conference, pages 233-242, 1994. [SCK+ 93] Richard L. Sites, Anton Chernoff, Matthew B. Kirk, Maurice P. Marks, and Scott G. Robinson. Binary translation. Communications of the ACM, 36(2):69-81, February 1993. [SE94] Amitabh Srivastava and Alan Eustace. ATOM: A system for building customized program analysis tools. In Proc. of the ACM Conference on Programming Language De- sign and Implementation, pages 1-1, 1994. [SJF92] Craig B. Stunkel, Bob Janssens, and W. Kent Fuchs. Address tracing of parallel systems via TRAPEDS. Microprocessors and Mi- crosystems, 16(5):249-261, 1992. [SPE] SPEC CPU92 benchmarks. World Wide Web URL: https://www.specbench.org/osg- /cpu92/. [SW95] Richard L. Sites and Richard T. Witek. Al- pha AXP Architecture Reference Manual. Digital Press, Newton MA, 2nd edition edi- tion, 1995. 15