################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
	      Proceedings of the USENIX 2nd Symposium on
	Operating Systems Design and Implementation (OSDI '96)
	       Seattle, Washington, October 28-31, 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


              Studies of Windows NT Performance using Dynamic


                                                 Execution Traces


                                          Sharon E. Perl and Richard L. Sites

                                            Digital Systems Research Center

                                                         Palo Alto, CA

                                                 fperl,sitesg@pa.dec.com


Abstract                                                             sion, and how could we get closer to linear scaling for
                                                                     the multiprocessor version?
We  studied  two  aspects  of  the  performance  of  Win-               To answer these questions we found that we needed to
dows  NTtm :   processor  bandwidth  requirements  for               look at the detailed behavior of the system under load.
memory accesses in a uniprocessor system running com-                We created a tool for obtaining complete traces of the in-
mercial and benchmark applications, and locking behav-               struction and data streams executed by the processor in
ior of a commercial database on a small-scale multipro-              all operating system and application code. We proceeded
cessor. Our studies are based on full dynamic execution              to examine these traces and to use them to run simula-
traces of the systems, which include all instructions exe-           tions that revealed interesting properties of the then cur-
cuted by the operating system and applications over peri-            rent system. The results of the simulations also had im-
ods of a few seconds (enough time to allow for significant           plications for future processor design.
computation).  The traces were obtained on Alpha PCs,                   The first puzzle we studied is how the processor band-
using a new software tool called PatchWrx that takes ad-             width requirements of applications_the bandwidth re-
vantage of the Alpha architecture's PAL-code layer to                quired  to  service  on-chip  instruction  and  data  cache
implement efficient, comprehensive system tracing. Be-               misses  from  off-chip  caches  or  memory_limits  the
cause the Alpha version of Windows NT uses substan-                  achievable  execution  speed  of  the  applications.   This
tially the same code base as other versions, and therefore           study  was  motivated  by  a  discussion  with  some  col-
executes nearly the same sequence of calls, basic blocks,            leagues about using a prefetching strategy to improve
and data structure accesses, we believe our conclusions              performance.   Their  studies  showed  that  prefetching
are relevant for non-Alpha systems as well.  This paper              would not help in the particular situation because there
describes our performance studies and interesting aspects            wasn't enough processor-chip pin bandwidth to support
of PatchWrx.                                                         the  workload.   When  pin  bandwidth  is  a  bottleneck,
   We conclude from our studies that processor band-                 some common techniques for trying to improve perfor-
width can be a first-order bottleneck to achieving good              mance do not help.  These include multiple instruction
performance.  This is particularly apparent when study-              issue, code scheduling, prefetching, and improved exter-
ing  commercial  benchmarks.   Operating  system  code               nal cache latency.  Pin bandwidth puts a ceiling on how
and data structures contribute disproportionately to the             fast an application can run. We discovered that pin band-
memory access load.  We also found that operating sys-               width is indeed a bottleneck for interesting commercial
tem software lock contention was a factor preventing the             applications, as well as for data accesses in one of the
database benchmark from scaling up on the small mul-                 SPEC benchmarks that we studied.
tiprocessor, and that the cache coherence protocol em-
ployed by the machine introduced more cache interfer-                   The second puzzle we studied is how lock contention
ence than necessary.                                                 fortahmultiprocessoreapplicationalimitspthepscalabilityloficat*
 *ion. Detailed executi@

                                                                     of locking that may not have been expected by the operat-

1    Introduction                                                    ing system designers, application designers, or hardware
                                                                     designers responsible for the cache coherence protocol.

This work was triggered by two performance puzzles                   Lock contention and the related cache coherence over-
(circa 1995) related to Microsoft's SQL Server running               head prevent the application from scaling up beyond a
on Alpha [SW95   ] PCs under the Windows NT operating                small number of processors.
system:  how could we speed up the uniprocessor ver-                    We choose these two aspects to study because they


                                                                  1


seem to be important factors in overall workload perfor-             target of every change of control flow (branch, call, re-
mance.                                                               turn, system call, system return, interrupt, and return-
   The contributions of this work are threefold: we pro-             from-interrupt),  and  some  base  register  values  for  d-
vide evidence of processor pin bandwidth limitations and             stream  memory  accesses.   A  reconstruction  program
locking problems for commercial applications;  we in-                working from the log and binary program images re-
troduce a new tool for obtaining full traces of a system             produces the trace of the full i-stream and d-stream that
that allows us to study such problems; and we are mak-               was executed.  Our logs are typically about 5-10 times
ing available some of the traces upon which our studies              smaller than the resulting traces.
are based for other researchers to use.1  Also, the un-                 With  the  entire  operating  system  patched  for  just
derstanding we gained from this work led to significant              branches (not load/store) and logging on, everything runs
improvements to the hardware and software involved, so               at about 1/4 of normal speed until the log buffer fills up.
the results presented here do not fully apply to currently-          Then logging is turned off and the run speed is about 1/2
shipping hardware and software.                                      of normal speed.  This is sufficiently fast that our per-
   In the next section we describe some of the more inter-           sonal machines have run patched all the time for over a
esting aspects of the tracing tool, called PatchWrx. Sec-            year. With loads and stores patched in the operating sys-
tion 3 describes our studies of pin bandwidth require-               tem and applications, the worst slowdown we've seen is
ments  for  four  different  applications  on  two  different        about 1/8 of normal speed.  Patched images are 30-50%
Alpha processors.  Section 4 describes studies of lock-              larger than the originals.
ing behavior of one of these applications_the Microsoft                 PatchWrx is an offshoot of the ATUM work in tracing
SQL Server database_on a small multiprocessor.  Sec-                 using microcode [ASH86    ], and work with binary trans-
tion 5 concludes.                                                    lation [SCK+ 93   ].  Our approach is similar to other in-
                                                                     line tracing efforts,  but differs significantly in at least
                                                                     one dimension. Most published studies are of user code
2    PatchWrx                                                        only  [EKKL90     ,  LB94  ],  or  are  done  on  a  single  p*
 *ro-
                                                                     cessor  [BKW90     ,  CB93  ],  or  require  rebuilding  source
To understand the traces that are the input to our perfor-           code [SJF92  ], or trace only cache misses, not all instruc-
mance studies it is helpful to understand the properties             tions [CHRG95     , TGH92    ].  None use Windows NT. The
of PatchWrx, the tool used to produce them. In this sec-             excellent Shade paper [CK94   ] summarizes about thirty
tion we give an overview of PatchWrx, and then describe              previous tools.  Using that paper's classification, Patch-
some of the highlights of its design and implementation.             Wrx, like ATUM, traces executables, user and system
                                                                     code, multiple domains, multiple processors, signals, dy-
                                                                     namic linking,  and bugs,  with performance similar to
2.1     Overview                                                     Shade.


PatchWrx is a software-only technique for capturing full                We chose to produce traces rather than do on-the-fly
time-stamped traces of the dynamic instruction stream (i-            data analysis [SE94  ] because of the difficulty of recre-
stream) and data stream (d-stream) of operating system               ating complex execution environments months after the
and user code.                                                       originalainvestigation.skWitheaddetailedatrace,yquestionsear o*
 *r more later can stil@
   The goal of the PatchWrx effort is to capture traces                 Somewhat like an electron microscope for computing,
of  every  single  instruction  executed  throughout  many           the PatchWrx approach is for studying a small amount of
seconds of a real operating system, running some com-                execution in excruciating detail, rather than summarizing
plex workload on a non-microprogrammed multiproces-                  long-running executions.
sor. We wanted to build a software-only solution, rather
than requiring one-of-a-kind hardware that cannot easily                All of our experiments have been performed under
be applied at a customer computing environment.  We                  Windows NT, version 3.5. The uniprocessor experiments
wanted to gather traces with less than a factor of ten               were run on an Alpha AXP 150 with 128 MB of main
performance degradation, so that nothing in the operat-              memory.  The multiprocessor experiments were run on
ing system broke due to timeouts or excessive latencies.             a four-processor AlphaServer 2100, with 190 MHz pro-
Finally, we wanted to work with arbitrary binaries for               cessors and 256 MB of memory.
which we did not have access to the source code.
   The technique we have adopted rewrites binary exe-                2.2    Trace Contents
cutable images, inserting patches that record in a log the
__________________________1                                          The final output of PatchWrx is a trace containing the se-
    Our uniprocessor traces are publicly available on CD-ROM to full-quence of instructions executed by the operating system
time faculty members. Contact one of the authors to obtain a copy.   and all applications from the time logging was enabled


                                                                  2


                                                                     rupt (the instruction that would normally be executed im-
                                                                     mediately after the interrupt handler returns). This infor-
                                                                     mation is used during trace reconstruction to determine
                                                                     exactly where the interrupt occurred in the i-stream.
                                                                        Additional log entries record information about mem-
                                                                     ory  load  and  store  instructions,  and  process  context
                                                                     switches.
                                                                        In the Alpha implementation of PatchWrx, the log is
                                                                     recorded in a 45 MB portion of physical main mem-
                                                                     ory that is reserved at boot time,  and is therefore in-
                                                                     visible to the operating system.   The log buffer holds
                                                                     about 5.9 million eight-byte log entries, which is enough
                                                                     for 5-20 seconds of real time.   There is so much in-
                                                                     formation in a single reconstructed trace that we have
                                                                     not been motivated to try stitching multiple traces to-
                                                                     gether [CHRG95     , AH90   ];  a single reconstructed trace
Figure 1: Formats of PatchWrx log entries. Each entry is             contains about 650 MB of dynamic i-stream with instruc-
64 bits.                                                             tion and data addresses.
                                                                        Recording the log in main memory is much faster than

up until the log buffer (from which the trace is recon-              recordingionndisksorttape.eRecordingaindphysicalomemoryf virtu*
 *al memory allows us t@
structed) filled up.                                                 levels of the operating system, including the page-fault
   Each instruction in a trace is tagged with its program            handler, without generating recursive page faults. It also
counter value, and if it is a load or store instruction, the         allows us to trace across multiple threads running in mul-
memory virtual address that is its source or target.  In a           tiple address spaces, without needing to write a log entry
multiprocessor trace, the instruction is also tagged with            to one address space while executing in a different ad-
its processor number.                                                dress space.
   Some  versions  of  our  traces  also  contain  times-
tamps on all branching instructions except for two-way
branches.  These are used to help line up the i-streams              2.4    PAL Subroutines for Logging
from different processors in the multiprocessor traces,              To implement the logging code, we use the Alpha ar-
and to compute lock holding times in our second study,               chitecture's PALcall instruction, which traps to one of a
described in Section 4.                                              set of Privileged Architecture Library (PAL) specialized

                                                                     subroutines without disturbing any programmer-visible

2.3     Log Entries                                                  state, such as registers.  These subroutines have access
                                                                     to physical main memory and to internal hardware regis-
While a patched system is running,  a log is collected               ters and they run with interrupts turned off. We extended
recording the information necessary to eventually repro-             the PAL-code for Alpha NT with eight additional subrou-
duce a full system trace.  A log is a sequence of entries            tines, and we modified some of the existing subroutines,
describing branching events or data references. Figure 1             as summarized in Table 1. Other architectures may have
summarizes the different kinds of log entries.                       supervisor call or trap instructions that, in conjunction
   As in most computer designs, the Alpha architecture               with modified operating system kernel interrupt routines,
has two basic forms of branching instructions:  a jump               could be used to get a similar effect.
to an address in a register, and a PC-relative two-way
conditional branch. For each jump instruction, we record             2.5    Collecting Data Addresses
one log entry.  For two-way branches, we accumulate a
bit-vector recording the outcome of up to 60 two-way                 It is possible to capture data addresses by patching all
branches in a single log entry, one bit per branch.  This            load and store instructions, but this fills up the log buffer
gives us a compact encoding of the exact flow within a               quite quickly and so we would like to avoid it.  We ob-
subroutine, taking only about 10% of the log entries.                serve that many pieces of code use multiple references
   When an interrupt or page fault occurs during log-                off the same base register.   Since the i-stream recon-
ging, the address of the first instruction of the interrupt          struction recovers the actual instructions executed,  we
or fault handler is put in the log, along with the address           will have these address specifications in the dynamic i-
of the first instruction not executed because of the inter-          stream.  Thus, for a sequence of references over which


                                                                  3


__________________________________________________________________
___PAL_routine_________________Action/Recorded_info._______________  nitude and lets us encode the processor number once per

_  PalReset                  _ (Set aside log memory)        _       chunk of entries, rather than in every entry.  With log
_  InterruptStackDispatch  _   next addr., interrupt target     _    entries generated on each processor at the rate of about
_  SoftwareInterrupt        _  next addr., interrupt target     _    one per microsecond, a group of four chunks represents
_  DispatchMmFault        _    next addr., page fault target    _    about 100 microseconds of real time on a four-processor
_  UNALIGNED            _      next addr., align. fault target  _    system.
_  RFE                       _ return from exception target   _
_  CALLSYS               _     sys. call target                  _   2.7    Trace Reconstruction from Logs
_  RETSYS                 _    return from sys. call target    _
___SWPCTX______________________new_process_ID_____________________   Reconstruction of a full trace given a log and a set of
_  pwrdent                   _ read log entry from buffer     _      binaries is mostly straightforward.  As described above,
_  pwctrl                     _init. log, turn logging on/off   _    some special care and techniques are applied for obtain-
_  pwbsr                     _ branch entry                    _     ing data addresses from a limited set of base register
_  pwjsr                      _jump/call/return entry          _     values recorded in a log.  Two additional tricky issues
_  pwldst                    _ load/store base register entry  _     involve handling interrupts and merging multiprocessor
_  pwbrt                     _ cond. branch taken             _      traces.
_  pwbrf                     _ cond. branch fall-through     _          The first issue concerns where to insert interrupts in a
___pwpeek______________________(for_debugging_only)_______________   reconstructed i-stream.  During reconstruction, one po-
                                                                     tential place to divert the i-stream to the interrupt han-
Table  1:  Logging-related  PAL  subroutines.   First  set           dler is after all jump entries that precede the interrupt
are modifications to existing PAL subroutines.  Second               have been consumed, when the instruction that matches
set (starting with pwrdent) are new PAL subroutines for              the not-executed address in the next log entry is encoun-
PatchWrx.                                                            tered.  If the not-executed instruction is inside a loop,

                                                                     the interrupt must be delivered during the right iteration.
changes to the base register value can be computed from              Loop iterations are controlled by jumps or conditional
the  i-stream,  we  need  only  record  the  base  register's        branches. So it is necessary to consume not only all the
value once.  The effect is that only one out of every 5-             preceding jump entries in the trace, but also exactly the
10 load or store instructions is actually patched, and that          right number of conditional branch bits before delivering
for loops with constant strides through memory, only the             the interrupt. For this reason, we flush the partially accu-
initial base-register value outside the loop is traced.              mulated taken/fallthrough vector into the log buffer just
   Our patching and reconstruction algorithms are some-              before recording the trace entries for an interrupt.  This
what simplistic, causing the reconstruction to reach some            allows a perfect reconstruction of where to deliver an in-
loads and stores without knowing the base register value             terrupt.
(as with certain interrupts).  When this happens during                 The second issue concerns merging traces from mul-
reconstruction,  we make up a random synthetic value                 tiple processors.  This requires special care because the
for the base register, then track any incremental changes            time stamps within the entries come from four differ-
from there.  Even in code with no load/store patches at              ent cycle clocks (oscillators) on a four-processor system.
all,  this is surprisingly useful.   The effect seen in the          These clocks are not synchronized with each other, and
d-stream is that the bases of arrays and structures and              we observe drift of up to 100 parts per million (100 mi-
stacks are random, but the relative access patterns within           croseconds per second) within logs. All we really know
each aggregate are accurately reflected.                             about the clocks is that the first time stamp in chunk N
                                                                     was created after the first time stamp in chunk N  1 and

2.6     Handling Multiprocessors                                     beforeBtheyfirstconeainrchunkeNf+u1.lly  applying  running  in*
 *equalities  between

When PatchWrx is running on multiple processors, the                 hundreds of chunks, we can map the drifting time bases
log entries for all processors are merged into a single              into a single absolute time from the beginning of the log,
log buffer. This allows us to see the dynamics of the in-            within a window about two microseconds wide. We use
teractions between processors.  Writing a single merged              this derived absolute time base to interleave the instruc-
log requires doing a multiprocessor-atomic update of the             tions in the reconstructed merged i-stream. The effect is
next-log-entry pointer, and requires encoding the proces-            that an instruction on one processor that stores to a shared
sor number in each log entry.                                        variable is close (within dozens of instructions at a 4x
   Rather  than  interleaving  single  entries,  we  allocate        tracing slowdown) in the final i-stream to an instruction
chunks of 128 entries to each processor. This cuts down              on another processor that reads the shared variable and in
the frequency of atomic updates by two orders of mag-                fact sees the new value.  Although we cannot guarantee


                                                                  4


that the store precedes the load in the final i-stream, we
find this close enough for understanding multi-processor
dynamics, including cache interactions.


2.8     I-stream Distortion


Because the patches introduce extra instructions and ex-
tra memory references on an instrumented system, there
is necessarily some distortion in the timing of the recon-
structed i-stream. There is little distortion of the i-stream
itself, since the reconstruction excludes the patches. The
paths through subroutines and the sequence of subrou-
tine calls are the same in the reconstructed i-stream as
they would have been in an uninstrumented system.
   The reconstructed and uninstrumented i-streams differ
in four subtle ways, however.                                                    Figure 2: Processor pin bandwidth
   First, the patched images are bigger than the originals,
so shared-library images loaded into consecutive mem-
ory locations end up at somewhat different addresses.                3    Pin Bandwidth Study
This  can  have  a  slight  effect  on  instruction  cache  (i-
cache) simulations for large caches based on the traced              Our first study using PatchWrx traces is a comparison
instruction addresses. Patching expands images by mul-               of processor pin bandwidth requirements for a few dif-
tiples of 64 kilobytes,  so the larger size images have              ferent applications running under Windows NT on two
no effect on small cache simulations;  the low 16 bits               Alpha processors.  As shown in Figure 2, we are inter-
of patched and original i-stream addresses are identical.            ested in the bandwidth between the processor chip (with
There can be a similar slight effect on data cache (d-               on-chip i- and d-caches) and the board-level cache. This
cache) and translation buffer simulations, as patches can            memory traffic is due to i- and d-cache misses. The ques-
also cause data to be moved.                                         tionpis:inhowsmanyobytesfperasecondpneedrtoocrossctheessor chi*
 *p when running a give@
   Second,  the patched images can take i-stream page                at full speed? The answer is a function of the workload,
faults on the pages containing the patches.  While the               including user and system activity, the properties of the
patches themselves never show up in the reconstructed i-             processor chip, the clock speed, and the desired clock
stream, the extra page fault traps and processing do. This           cycles per instruction (CPI). For our study, "full speed"
appears to be only a slight distortion.                              is 1 CPI. The answers are interesting because, as we'll
   Third,  timer interrupts happen four times more fre-              see, processor pin bandwidth is a first-order bottleneck
quently in the traced code than in the uninstrumented                for important applications.
code,  due  to  the  4x  tracing  slowdown.    In  Win-
dows NT 3.5, this increases the total number of instruc-
tions executed by about 1 percent.  We have chosen to                3.1    Configuration
ignore this in our studies, but one could mechanically               We looked at four applications:
remove three of every four timer interrupts to further re-
duce the distortion.                                                   1. Microsoft SQL Server, October 1994 beta version,
   Finally, external interrupts (disk, network, etc.)  hap-               a  commercial  database  server,  running  the  TPC-
pen approximately four times sooner in the traced code                    B [Gra91  ] benchmark. The trace contains 64.5 mil-
than in the uninstrumented code (again due to the 4x trac-                lion instructions, spanning several hundred transac-
ing slowdown). To the extent that they are a consequence                  tions.
of the workload being traced, there are not more inter-                2. GEM  compiler  [B+ 92  ]  back-end,  from  Digital's
rupts per million instructions executed_each one just                     commercially  available  optimizing  C  compiler,
occurs relatively sooner in the i-stream.  Running an in-                 compiling a 3000 line C program.  The trace con-
terrupt routine sooner than it would have run in the unin-                tains 29 million instructions.
strumented code can have a subtle effect on the memory
access patterns of a processor, but there appears to be no             3. tomcatv,  from  the  SPEC92  floating  point  bench-
interesting distortion when aggregated over several sec-                  marks [SPE  ]. The trace contains 47 million instruc-
onds.                                                                     tions.


                                                                  5


__________________________________________________________________
_                 _         _ System  _    Appl.   _  Synthetic  _      The Alpha 21164 runs at 400 MHz and has a sin-
_                 _         _  Ld/st   _   Ld/st   _        data  _  gle 96-kilobyte 3-way associative write-back combined
___Application_____#instr.____patched_____patched__________addr.____ i-cache and d-cache with a 64-byte cache line size. Each

_  SQL Server  _     64M  _      n      _    n      _      73%  _    i-cache miss and load miss causes 64 bytes to cross the
_  GEM comp.  _      29M  _      n      _    Y     _         8%  _   pins. This processor has a write-back cache: writes that
_  Tomcatv      _    47M  _      n      _    Y     _       24%  _    hit in the cache do not cause any bytes to cross the pins
___Ear_______________83M_________n___________n_____________14%____   but do make a cache line dirty.  Writes and reads that
                                                                     cause a dirty cache line to be flushed ("victim writes")
               Table 2: Trace characteristics.                       cause 16-64 bytes to cross the pins, depending on what
                                                                     portion of the line has been modified; dirty bits are kept

  4.  ear, also from the SPEC92 benchmarks.  The trace               for each 16-byte subblock. Writes that miss also cause a
      contains 83.84 million instructions.                           cacheclineatoubesallocatedeand6filled4frombmemory,ysottheyes t*
 *o cross the pins for @


We chose these applications because they give a vari-                some of these reads will be unnecessary overhead if all
ety of data points, from heavy memory system usage to                64 bytes are eventually overwritten ("redundant reads" as
light usage.  At the time we started the experiments, the            compared to "useful reads"). The realistic pin bandwidth
SPEC95 benchmarks were not yet available.  It would                  on this machine is about 750 MB/sec, and the maximum
be interesting to do the same experiments for the gcc                bandwidth is about 1.6 GB/sec.
SPEC95 benchmark, to see how it compares to the GEM
results.
   We  took  traces  of  these  applications  running  on  a         3.2    Results
uniprocessor Windows NT 3.5 system.  The traces var-                 For each of the applications on each of the processors
ied as to how much load/store patching was done.  The                we looked at a plot of the processor pin bandwidth re-
reasons for not just doing full load/store patching relate           quirements over time.  This shows the dynamics of the
to current PatchWrx limitations on image sizes and the               traffic in addition to giving us upper bounds on the re-
slowdowns introduced by load/store patching. As shown                quirements.
in Table 2, none of the uniprocessor traces had loads and
stores patched in the operating system.   For the SQL
Server  and  ear  uniprocessor  traces  no  loads  or  stores        3.2.1   SQL Server
were patched in the application images.  For the GEM
compiler and tomcatv traces, load and store instructions             Figure 3 shows the simulation results for pin bandwidth
were patched in the application images.  None of the i-              requirements on the Alpha 21064 and 21164 processors
stream references in these traces are synthetic, but the             for the SQL Server workload. The x-axis marks time in
d-stream synthetic references vary from 8% to 73%.                   chunks of 0.5 million instructions, and the y-axis shows
   We used these traces to simulate the caching behav-               the number of gigabytes per second of2traffic across the
ior of different processor chips under the assumption that           pins for a given instruction trace.
the trace executes at 1 CPI. The simulation computes the                The left-hand graph shows the 21064 results. The traf-
number of bytes that would need to cross the pins to the             fic is broken down into i-cache misses (imiss), d-cache
board-level cache. These numbers were broken down by                 read misses (dmiss), and write buffer flushes (dwrite).
the source of the traffic: instruction reference misses, and         At 1 CPI, the pins are overcommitted by a factor of two
data reference misses of a variety of kinds.                         just on instruction cache misses alone.  Three bytes of
   We  looked  at  the  behavior  of  each  application  on          instruction miss in the cache for every four bytes exe-
two  different  Alpha  processors:   the  Alpha  21064  at           cuted.  The required bandwidth for all traffic is about
200 MHz, and the Alpha 21164 at 400 MHz.                             1.2 GB/sec, or about four times the bandwidth that the
   The Alpha 21064 runs at 200 MHz and has 8 kilo-                   processor realistically can deliver.  In fact, the uninstru-
bytes each of direct-mapped write-through i-cache and                mented workload achieves about 4.3 CPI, giving us con-
d-cache with a 32-byte cache line size. Each instruction             fidence that the cache simulations are valid. The simula-
cache miss and data read (load) miss causes 32 bytes                 tion is intended to give lower bounds on CPIs based on
to cross the pins.  Writes go into a group of four write             pin bandwidth requirements; the extra 0.3 CPI in prac-
buffers.  A write to a different address causes 32 bytes             tice is due to non-zero latencies that are not included in
to cross the pins to flush one of the dirty write buffers.           the simulation.
The actual pin bandwidth on this processor is realisti-              __________________________2Note:  the bandwidth pictures show *
 *only the first 25 mil@
cally about 300 MB/sec.  The maximum bandwidth is                    structions of each trace; very little changes in any of the tr*
 *aces beyond
about 600 MB/sec.                                                    that.


                                                                  6


         Figure 3: Pin bandwidth requirements of SQL Server running on Alpha 21064 and 21164 processors.


   The right-hand graph shows the 21164 results.  Here               phases where the pins are overcommitted by factors of
we split up the d-cache write traffic into victim writes             1.5 to 3 or more, and then are okay. The actual program
(dvict), and useful and redundant reads triggered by write           runs at between 1 and 1.5 CPI on this machine.  On the
misses (dread).  The required pin bandwidth here is as               21164, more of the workload falls within the bandwidth
high as 1.7 GB/sec, so the pins are overcommitted by                 limitations, with the periodic lower plateaus well within
more than a factor of 2 at 1 CPI. The pins are still of-             the capacity of the processor pins, but there are still pe-
ten overcommitted just on i-cache misses alone, although             riodic benchmark array access bursts that are factors of
the i-cache miss traffic periodically dips down below the            two or more higher than the available bandwidth.  Note
750 MB/sec that the processor typically delivers.                    that the instruction stream mostly fits in the cache, and so
                                                                     doesn't contribute much to the bandwidth requirements.

3.2.2    Compiler Back-end                                           Also, the demands for data bandwidth fall into a pre-
                                                                     dictable pattern. Perhaps prefetching during times when
The second trace for which we simulated the pin traffic              bandwidth is underutilized could help for this workload.
is from the GEM compiler run.  The results are shown                    Figure 6 shows the simulation results for the ear trace
in Figure 4. The left-hand graph shows that the required             running on the same two processors.  We see that even
pin bandwidth on the Alpha 21064 is about 600 MB/sec.                on the 21064, the pin bandwidth is essentially sufficient
This is about half as much as the SQL Server workload                with required bandwidth about 200 MB/sec.  The actual
on this machine, with a somewhat lower ratio of i-cache              workload runs at 1 CPI or less (because of the dual in-
to d-cache traffic. The pins are still overcommitted by a            struction issue capabilities of the processor).   The re-
factor of two according to the simulation. The uninstru-             quirements for ear on the 21164 400 MHz processor are
mented workload on this machine runs at about 2.5 CPI.               well within the range of the processor. A large part of the
   The right-hand graph shows the results for this work-             workload fits in the caches in this case.
load on the Alpha 21164 400 MHz processor. Note that
the i-stream here mostly fits into the 96-kilobyte cache,
with about 200 MB per second of traffic left over. The to-           3.3    Memory Maps
tal pin bandwidth requirements are generally under about
500 MB/sec, with occasional spikes over 800 MB/sec.                  Looking at the patterns of memory accesses over time
Thus, pin bandwidth is mostly not a bottleneck for GEM               helps illuminate the results of the pin bandwidth simu-
on the 21164.                                                        lationsfshownoabove.otFiguresp7-11rshowithenmemoryts for the f*
 *irst 25  million inst@

                                                                     trace3. The x-axis shows time from left to right, in mil-
3.2.3    SPEC Benchmarks                                             lions of instructions executed.  The y-axis shows virtual

For  comparison  with  the  commercial  workloads,  we               addresses accessed, modulo 4 MB (chosen because it is
chose two of the SPEC92 floating point benchmarks to                 an interesting board cache size). There is a dot in the pic-
examine.                                                             ture for each instruction address and for each load/store
   Figure 5 shows the simulation results for the tomcatv             data_address._____________
trace. On the 21064, the i-stream essentially fits into the             3[except for the ear trace, which has about 21 million inst*
 *ructions
8 kilobyte i-cache. The program repeatedly goes through              shown in this version]


                                                                  7


Figure 4: Pin bandwidth requirements of GEM compiler running on Alpha 21064 and 21164 processors.


   Figure 5: Pin bandwidth requirements of tomcatv running on Alpha 21064 and 21164 processors.


      Figure 6: Pin bandwidth requirements of ear running on Alpha 21064 and 21164 processors.


                                                          8


Figure 7: tomcatv memory access plot showing the dis-                    Figure 9: ear memory accesses, user and system.
tribution of user-space addresses over time.  Looking at
any vertical slice we see all the addresses (mod 4 MB)
that were accessed within that time slice.


                                                                       Figure 10: GEM memory accesses, user and system.


  Figure 8: tomcatv memory accesses, user and system.


   Figure 7 shows the memory footprint for the user-only
part of the tomcatv benchmark.  The bottom solid line
shows the instruction fetches for the benchmark, all in a
tight loop on a single page of code.  The leftmost group
of six upward-sloping lines are data references sweeping
forward through six half-megabyte arrays. These are fol-
lowed by some faster forward sweeps through different
arrays, four backward sweeps, and the data pattern re-
peats. This program has very predictable branching pat-
terns and very predictable data accesses.                            Figure 11: SQL Server memory accesses, user and sys-

   In Figure 8, we have added the operating system part              tem.
of the trace. The periodic vertical tics about one sixth of
the way up from the bottom of the picture are the i-stream
for the timer interrupt routine (about four times too of-            ing patterns and data accesses.
ten because of the tracing slowdown).  The other dots                   Figure 9 shows,  on the same scale,  a trace of both
that correlate with these are the data references made               system and user addresses for the ear benchmark.  The
in the timer routine.   The occasional vertical lines of             memory usage is relatively light, with few strong patterns
high-density references are other operating-system activ-            other than the horizontal lines due to timer interrupts and
ity, including the thread scheduler and network traffic.             some accesses to application data structures, and the ver-
This operating system code has less predictable branch-              tical lines due to operating system activity.


                                                                  9


        Figure 12: Incremental I-cache Hit Curves.                          Figure 13: Incremental D-cache Hit Curves.


   Figure 10 shows,  on the same scale,  a trace of the                 From the bandwidth charts above, we see a marked
GEM compiler benchmark.   The horizontal lines rep-                  difference in i-stream miss rates across the uniprocessor
resent frequently used subroutines and frequently used               workloads. Figure 12 shows the incremental i-cache hit
data.   The visually dominant diagonal line appears to               curves for three of them (we exclude ear from this Sec-
be the buffer of program text being compiled; the com-               tion's discussion because its memory demands are un-
piler is sweeping forward through it (modulo 4 MB).                  interesting).   For all three curves,  the 1KB points are
The timer-interrupts are still there,  but a bit harder to           off the chart at 845-957 hits/1000 instructions.  This is
see.  Most of the rest of the dots are data references to            to be expected - if straightline code is executed with no
linked-list data structures used by the operating system.            repetition from 64-byte (16 instruction) cache lines, we
This workload has less predictable memory access pat-                would expect 15/16 hits, or 937 hits/1000 instructions.
terns then tomcatv.                                                  The short vertical bars mark 8K and 96K cache sizes.
   Finally, Figure 11 shows, on the same scale, a trace of              The tomcatv i-stream fits almost entirely in 8KB, with
the SQL Server benchmark.  The trace looks somewhat                  a little tail that hits only in main memory.  The GEM
similar to white noise_there is very little correlation be-          compiler i-stream extends out to 32KB before it drops
tween addresses.  This application falsifies the premise             off, and the SQL i-stream has significant incremental hits
of a cache: that accessing a location is a good predictor            all the way out to a 1MB cache.  This large footprint is
that the same or a nearby location will be accessed in the           why the SQL i-stream puts such a bandwidth load on the
near future.                                                         processor pins. This result is roughly consistent with the
   These memory maps highlight the differences between               i-cache miss rates measured by [MDO94     ] for TPC-A and
the memory requirements of the workloads. There seems                TPC-C (although they had a larger percentage of time
to be at least some rough correlation between the den-               in operating system code than we observed).  The GEM
sity of the memory maps and the pin bandwidth require-               footprint produces a large bandwidth load with an 8KB
ments.                                                               on-chip cache but a small load with a 96KB cache. The
                                                                     tomcatv footprint presents very little off-chip bandwidth

3.4     Cache Demand in More Detail                                  load.
                                                                        Figure 13 shows the incremental d-cache hit curves.
To understand the processor bandwidth results better, we             All three d-streams show a tail of references that miss
looked at the incremental cache hit curves for cache sizes           even in 16MB and go all the way to main memory. The
from 1KB to 16MB. Except for the smallest cache, the                 tomcatv curve has peaks at 8KB (76 misses per 1000
incremental hit rate for a given size cache is the number            instructions) and 4MB that correspond to sizes of fre-
of additional hits that occur in the given cache and miss            quently accessed arrays in that program. The 4MB peak
in smaller caches.  Portions of the incremental hit curve            is why the tomcatv d-stream puts a large periodic band-
for sizes bigger than processor on-chip caches represent             width load on the processor pins.
misses that produce the off-chip bandwidth demand.                      In Figure 14, we have broken down the SQL incremen-
   In this section,  we used direct-mapped combined i-               tal i-cache hits by system and user code. The system hits
and d-stream caches and a line size of 64 bytes, and ran             are per 1000 system instructions, and the user hits per
the simulations over the initial 25M instructions of each            1000 user instructions.  The SQL system code is about
trace.                                                               25% of the total i-stream. The system code not only has


                                                                 10


                                                                                      _______________________________
                                                                                      ____________21064_____21164____
                                                                                      _   SQL  _  3.13    _ 1.87    _
                                                                                      _  GEM  _   1.20    _ 0.28    _
                                                                                      _    tom  _ 0.04    _ 0.06    _
                                                                                      ______ear___0.05______0.02______

                                                                     Table 3: Average bytes per instruction of pin traffic due
                                                                     to the i-streams of each of the benchmarks on the two
                                                                     processors studied.


                                                                     that hit in a real memory subsystem, we get the total time
                                                                     spent waiting on memory accesses for that workload.
                                                                        Using realistic times for early Alpha computer sys-
        Figure 14: SQL User vs. System I-stream.                     tems, the SQL i-stream and d-stream hits beyond 96KB
                                                                     each total about 1.8 CPI. Combined with a 100%-hits
                                                                     execution  rate  of  about  1  CPI,  these  total  about  4.6
                                                                     CPI. This is roughly consistent with the measurements
                                                                     in [CB94   ] of 4.3 CPI, and 20-25% operating system time,
                                                                     and 39%+36% i-stream plus d-stream stall time, on the
                                                                     TP1 database workload on an older AlphaSystem 7000
                                                                     computer.


                                                                     3.5    Discussion

                                                                     We conclude from these experiments that processor pin
                                                                     bandwidth is a bottleneck for some applications.
                                                                        SQL Server needs about two to four times the band-
                                                                     width of tomcatv, and is limited to about 4 CPI on at least
        Figure 15: SQL User vs. System D-stream.                     one current processor. According to another experiment

                                                                     that we ran, a future processor similar to the Alpha 21164

a bigger cache footprint, it has a larger tail of hits in main       but running at 500 MHz would need at least 2 GB/sec
memory.                                                              of pin bandwidth to run the SQL Server benchmark at
   The SQL d-stream breakdown, Figure 15, shows sys-                 1 CPI.
tem vs.  user and also real vs.  synthetic d-stream ad-                 Both SQL Server and GEM have higher pin band-
dresses. The d-stream has a high proportion of synthetic             width requirements due to i-stream traffic than either of
addresses, so is not very precise.  The major difference             the SPEC benchmarks, as summarized in Table 3.  This
between the real and synthetic addresses is a larger pro-            i-stream traffic contributes significantly to the pin band-
portion of synthetic addresses that miss all the way out to          width bottleneck for SQL Server.  Therefore, chip de-
main memory.  For both the real and synthetic d-stream               signers should be looking at more than SPEC bench-
addresses, the system code has a larger footprint than the           marks if they want their future chips to run commercial
user code.                                                           workloads well.
   Both the GEM compiler and tomcatv i-stream break-                    Our experiments also suggest that it is increasingly im-
downs (not shown here) reveal the system i-stream to                 portant for algorithm designers to pay attention to mem-
have a much larger footprint than the user i-stream. Simi-           ory  structure  and  cache  parameters  if  they  want  their
larly, the d-stream breakdowns show the system d-stream              code to run fast.  One+example of such an effort is the
to have a much larger footprint than the user one (al-               AlphaSort work [NBC  94    ], where careful attention to
though tomcatv is close, due to the large array accesses).           these memory details paid off handsomely.
   In addition to observing larger footprints in system
and SQL server code, the incremental cache hit curve                 4    Multiprocessor Study
allows us to compute the relationship between the CPI of
a workload and the performance of various levels of the              Our  second  study  examines  the  locking  behavior  of
memory hierarchy.  If we multiply each point of the in-              SQL Server running the TPC-B benchmark on a four-
cremental cache hit curves by the time it takes to service           processor AlphaServer 2100 symmetric multiprocessor.


                                                                 11


The traces were taken for SQL Server version 6.00.85a                the time_about 45% of the total elapsed time. This sug-
(March 1995 beta release),  running Windows NT 3.5,                  gests that the workload could not scale up beyond eight
with 12 clients running transactions against the database.           processors, which would cause this lock to be held nearly
We chose this number of clients to minimize the idle time            100% of the time.
of the system, which ended up being about 2-3%.                         More that 16% of the time is spent spinning while
   The trace contains about 69 million instructions.  In             waiting for a lock, usually the dispatcher lock. This spin
contrast  to  the  uniprocessor  traces,  operating  system          time would go up substantially as the lock-held time ap-
loads and stores are patched.  For the application code,             proaches 100%.  Note the convoy effect in the second
only loads and stores to lock variables are patched.                 rectangle,  at about 0.041 seconds.  Processors line up
   We use the trace in two ways. First, we look at a pic-            spinning for the dispatcher lock.
ture showing the timeline of software lock acquisition                  During the long stretch where processor 0 holds the
and release activity. Then, we simulate the cache coher-             dispatcher lock, a disk interrupt arrives and the processor
ence protocol employed on the machine to understand                  services it while still holding the lock. Other processors
how the locking activity affects communication among                 are waiting to get the lock during that time.
the processors.                                                         All of the evidence above suggests that the dispatcher
                                                                     lock is a bottleneck.  Because of the convoy effect, it

4.1     Software Lock Activity                                       wouldcbeenearlysimpossiblestoogetrbeyondsabout'6wpro-orth of w*
 *ork from this code.

Figure 16 shows the timeline of locking activity for a 20-              The picture reveals that in addition to being held for a
millisecond portion of the SQL Server trace, encompass-              long time overall, the dispatcher lock is held frequently.
ing portions of several transactions on the four proces-             This is because the SQL Server user code goes through
sors. The picture shows groups of four lines (with back-             the dispatcher lock to block when one of its own locks
grounds alternately shaded white and light gray), one line           is unavailable.  This type of lock usage was perhaps not
per processor, with shaded bars representing the elapsed             anticipated by the Windows NT implementers. Showing
times that locks are held.  Time runs from left to right             the picture to the Windows NT kernel and SQL Server
and top to bottom.  Each group of four horizontal lines              implementers at the same time brought out assumptions
represents 2 milliseconds of elapsed time.                           that each group was making about the other's code.
   In this picture, we distinguish three cases:  the thick              The full-color version of this picture makes other lock-
black  bars  are  the  operating  system's  dispatcher  lock         ing patterns visible.  For example,  there is a repeated
(KiDispatcherLock), the thick gray bars are all other lock           pattern of nine lock acquisitions (some of which are the
holding times, and the thin black lines are times spent              same lock repeatedly acquired and released) in the same
spinning while waiting to acquire a lock.4 In the rectan-            order when manipulating the operating system page ta-
gle at 0.030 seconds, for example, processor 0 starts out            bles. Examining the code may suggest ways to make the
holding no locks, processor 1 holds the dispatcher lock,             associated operation more efficient.
processor 2 is spinning on the dispatcher lock, and pro-
cessor 3 hold no locks. Processor 1 then releases the dis-
patcher lock and processor 2 acquires it. Four other locks           4.2    Interference from Cache Coherence
are used, then processors 1 and 2 use the dispatcher lock            The software lock activity above for achieving mutual
again. Finally, processor 1 uses another lock and proces-            exclusion is best tuned by changing the software.  We
sor 0 uses a lock.                                                   also looked at multiprocessor hardware cache interfer-
   Lock  acquisitions  and  releases  are  not  reported  di-        ence caused by writes on one processor affecting caches
rectly in the trace, but rather, must be detected by pattern         on other processors.
matching against the common sequences of instructions                   We looked in detail at the behavior of the cache co-
used for this purpose.  Therefore, the picture showing               herence  protocol  with  four  AlphaServer  2100  proces-
lock acquisition and release points is not perfect.  How-            sors.   A  pure  invalidate  protocol  is  used_a  write  by
ever, we have taken some care to ensure that the picture is          one processor to a cache line invalidates all other cached
reasonably self-consistent, and consistent with the code.            copies of the line in other processors. We found that 40%
   While this picture shows just a small slice of the trace,         of the board-level cache misses in the 4-processor SQL
it reveals several interesting behaviors.  The dispatcher            trace were due to interference from other processors_
lock is held for relatively long stretches of time (200-900          invalidates due to writes of shared variables.
instructions).  Also, it is held for a large percentage of              Most of these misses were in fact to about 24 cache
__________________________4                                          lines containing software locks. (Note that the time lost
    We usually look at this picture in color, with different colors and
line widths for different locks, making more of the locking behavior on servicing these misses is smaller than the time lost
apparent at once.                                                    above spinning on held locks.) Some of the invalidations


                                                                 12


                      Figure 16: SQL Server lock activity. Time runs left-to-right and top-to-bottom.


came from false sharing.  The programmers who wrote                  tem reveals properties that are more difficult or impossi-
the code assumed 32-byte cache lines and took care to                ble to see from statistical information. We can see clearly
put locks in different 32-byte lines than other variables.           the convoy effect for locks that are in high demand, and
However, this particular machine has 64-byte lines, so               we can see patterns of locking that may suggest ineffi-
a lock and another unrelated shared variable could end               ciencies in the code.  We can also identify cases where
up in the same line and cause unnecessary invalidations.             locks are probably not a performance problem, as is true
The ability to track shared sub-blocks in the hardware               for most of the locks in our trace.
would help this problem. A better cache coherence pro-                  Having the traces available for simulating cache co-
tocol for this workload would be a hardware design that              herence protocols helped us to understand the causes of
writes through for shared blocks, having recipients up-              scaling problems in our multiprocessor system, and can
date or invalidate depending on how recently they had                help the designers of future systems to avoid these prob-
touched the line.  Such a design was used in the Alpha               lems.
Demonstration Unit [TCS93   ].

                                                                     5    Conclusions
4.3     Discussion
                                                                     These studies have demonstrated that full, dynamic exe-
The Windows NT 3.5 operating system code holds the                   cution traces are useful for designing future systems, as
dispatcher lock for so long because it is doing a full con-          well as for diagnosing difficult performance problems of
text switch while holding the lock.  It would be better              current systems.
to make the decision about what to dispatch next while                  Obtaining  such  traces  is  practical  using  PatchWrx.
holding the lock, release the lock, then perform the actual          The overhead is low enough on the uniprocessor systems
context switch without delaying the other processors. Al-            (about a factor of 2 slowdown) that we always ran the
lowing interrupts while holding the dispatcher lock also             systems with instrumentation in place. Overhead on mul-
lengthens the path and slows down other processors.  It              tiprocessor systems was reasonable enough to get useful
would be better to turn off interrupts before acquiring the          traces. PatchWrx traces are in the range of about 50-100
lock and turn them back on after releasing the lock.                 million instructions, which was enough to observe inter-
   The application code that triggers too-frequent use of            esting dynamic behavior in the system while still being
the dispatcher lock could adopt other locking strategies,            practical to collect.
such as trying a handful of times to acquire an applica-                The specific versions of software that we studied are
tion lock before blocking or to do user-level dispatching            inevitably already out of date with respect to currently
between multiple transactions.                                       available versions, and some of the specific problems re-
   Examining the dynamic locking behavior of the sys-                vealed in our studies may have been fixed by now. In fact,


                                                                 13


the Microsoft SQL Server 95 and Windows NT 3.51 and                  simulations.  We also thank Jim Gray and David Cutler
4.0 products contain performance improvements 5  that                at Microsoft.
resulted from our studies [Gra95  ] .  However, the same                The observations of Richard Draves and the anony-
techniques and tools can be used to look for performance             mous OSDI referees substantially improved the paper.
problems in the new versions and the traces still provide
valuable input for testing future chip designs.                      References
   We  find  that  real  pieces  of  large  software  are  big-
ger than any practical on-chip caches, and therefore use             [AH90]        Anant Agarwal and M. Huffman. Blocking:
much more processor pin bandwidth. Pin bandwidth can                               Exploiting  spatial  locality  for  trace  com-
be a bottleneck to running commercial workloads at the                             paction.   Performance Evaluation Review,
desired speed. For the near future at least, computer de-                          18(1):48-57, May 1990.
sign is memory design, and software design is memory
design. Pin bandwidth is an important consideration (as              [ASH86]       Anant Agarwal, Richard L. Sites, and Mark
much as memory latency) because it sets a lower bound                              Horowitz. ATUM: A new technique for cap-
on the number of clock cycles per instruction needed to                            turing address traces using microcode.   In
execute a workload.                                                                Proc. of the 13th International Symposium
   Although this paper emphasizes the effect of band-                              on Computer Architecture, pages 119-127,
width  on  the  performance  of  these  processors,  one                           June 1986.
shouldn't ignore the effect of latency. If there is insuffi-         [B+ 92]       David S. Blickstein et al.   The GEM op-
cient bandwidth to perform a given workload, the work-                             timizing  compiler  system.    Digital  Tech-
load will run proportionally slowly, and there is nothing                          nical  Journal,  4(4),  1992.    Also  avail-
software can do to compensate for this.  Systems that                              able   as   https://ftp.digital.com/pub/Digital-
have sufficient bandwidth but high memory-access la-                               /DECinfo/DTJ/axp-gem.ps.
tency can possibly perform well, but only for software
that  is  designed  to  hide  the  high  latency.   Both  high       [BKW90]       Anita K. Borg, R. E. Kessler, and David W.
bandwidth and low latency are necessary for peak per-                              Wall.  Generation and analysis of very long
formance on most commercial software.                                              address  traces.   In  Proc.  of  the  17th  An-
                                                                                   nual Symposium on Computer Architecture,
                                                                                   pages 270-279, May 1990.
6    Acknowledgments
                                                                     [CB93]        Brad J. Chen and B. Bershad. The impact of

A number of people at Digital have contributed to this                             operating system structure on memory sys-
effort.  Benn Schreiber, Tom van Baak, Miche Baker-                                tem performance. In Proc. of the 16th Inter-
Harvey, and Joe Notarangelo gave us immeasurable help                              national Symposium on Operating Systems
in getting the NT PAL-code modifications running. Rich                             Principles, pages 120-133, December 1993.
Witek  wrote  the  original  single-processor  boot  PAL-            [CB94]        Zarka  Cvetanovic  and  Dileep  Bhandarkar.
code.   Mike  Burrows  contributed  expertise  on  binary                          Characterization  of  Alpha  AXP  perfor-
modification techniques. Wim Colgate provided the key                              mance using TP and SPEC workloads.  In
insight into booting NT multiprocessor systems.  Dave                              Proc.  of  the  21st  Annual  Symposium  on
Hunter generously provided time on a four-processor Al-                            Computer Architecture, pages 60-70, April
phaServer 2100 system.  Chuck Thacker, Dave Conroy,                                1994.
Bill Weihl, Greg Nelson, Amitabh Srivastava and Alan
Eustace provided continuing encouragement.  Our sum-                 [CHRG95]      John Chapin,  Stephen A. Herrod,  Mendel
mer interns,  Cliff Mercer and David Martin struggled                              Rosenblum,  and  Anoop  Gupta.     Mem-
with early software and each provided key insights for                             ory system performance of UNIX on CC-
increasing performance of the workloads studied. David                             NUMA multiprocessors.  In ACM SIGMET-
Martin was responsible for the cache coherence protocol                            RICS, pages 1-13, May 1995.
__________________________5
    Improvements include: some collapsing of the 36 memory alloca-   [CK94]        Robert F. Cmelik and David Keppel. Shade:
tion calls per TPC-B transaction, each of which calls EnterCriticalSec-            A fast instruction-set simulator for execu-
tion; replacing a slow locking sequence macro expansion with a more                tion profiling. In ACM SIGMETRICS, pages
efficient one, worth 5-8% all by itself; removing unnecessary calls to
floating point initialization routines; avoiding calls to OS services for          128-137, May 1994.
CPU time and clock time; compiler improvements in setjmp/longjmp
implementation; minimizing context switches; and increasing cache lo-[EKKL90]      Susan  J.  Eggers,  David  Keppel,  Eric  J.
cality.                                                                            Koldinger, and Henry M. Levy.  Techniques


                                                                 14


              for  efficient  inline  tracing  on  a  shared-        [TCS93]       Charles P. Thacker, David G. Conroy, and
              memory multiprocessor.  In ACM SIGMET-                               Lawrence C. Stewart.  The Alpha Demon-
              RICS, pages 37-47, May 1990.                                         stration  Unit:  A  high-performance  multi-
                                                                                   processor.   Communications  of  the  ACM,
[Gra91]       Jim Gray, editor. The Benchmark Handbook,                            36(2):55-67, February 1993.
              chapter  2,  pages  79-117.   Morgan  Kauf-
              mann, San Mateo, California, 1991.                     [TGH92]       J. Torrellas, A. Gupta, and John Hennessy.
                                                                                   Characterizing the cache performance and

[Gra95]       Jim Gray, December 1995. Private commu-                              synchronization behavior of a multiproces-
              nication.                                                            sor operating system.   In Proc. of the 5th
                                                                                   International  Conference  on  Architectural
[LB94]        James R. Larus and Thomas Ball.  Rewrit-                             Support  for  Programming  Languages  and
              ing executable files to measure program be-                          Operating Systems, pages 162-174, October
              havior. Software_Practice and Experience,                            1992.
              24(2):197-218, February 1994.


[MDO94]       Ann Marie Grizzaffi Maynard,  Colette M.
              Donnelly, and Bret R. Olszewski. Contrast-
              ing characteristics and cache performance of
              technical and multi-user commercial work-
              loads. In Proc. of the 6th International Con-
              ference  on  Architectural  Support  for  Pro-
              gramming  Languages  and  Operating  Sys-
              tems, pages 145-155, October 1994.


[NBC+ 94]     Chris  Nyberg,  Tom  Barclay,  Zarka  Cve-
              tanovic, Jim Gray, and David B. Lomet. Al-
              phaSort: A RISC machine sort. In SIGMOD
              Conference, pages 233-242, 1994.


[SCK+ 93]     Richard    L.    Sites,    Anton    Chernoff,
              Matthew   B.   Kirk,   Maurice   P.   Marks,
              and Scott G. Robinson.  Binary translation.
              Communications of the ACM, 36(2):69-81,
              February 1993.


[SE94]        Amitabh   Srivastava   and   Alan   Eustace.
              ATOM: A system for building customized
              program analysis tools. In Proc. of the ACM
              Conference on Programming Language De-
              sign and Implementation, pages 1-1, 1994.


[SJF92]       Craig B. Stunkel, Bob Janssens, and W. Kent
              Fuchs.  Address tracing of parallel systems
              via TRAPEDS.   Microprocessors and Mi-
              crosystems, 16(5):249-261, 1992.


[SPE]         SPEC  CPU92  benchmarks.   World  Wide
              Web  URL:  https://www.specbench.org/osg-
              /cpu92/.


[SW95]        Richard L. Sites and Richard T. Witek.  Al-
              pha  AXP  Architecture  Reference  Manual.
              Digital Press, Newton MA, 2nd edition edi-
              tion, 1995.


                                                                 15