################################################
	   #                                              #
	   # ##   ## ###### ####### ##    ## ## ##     ## #
	   # ##   ## ##  ## ##      ###   ## ##  ##   ##  #
	   # ##   ## ##     ##      ####  ## ##   ## ##   #
	   # ##   ## ###### ######  ## ## ## ##    ###    #
	   # ##   ##     ## ##      ##  #### ##   ## ##   #
	   # ##   ## ##  ## ##      ##   ### ##  ##   ##  #
	   # ####### ###### ####### ##    ## ## ##     ## #
	   #                                              #
	   ################################################


	 The following paper was originally published in the
      Proceedings of the USENIX 1996 Annual Technical Conference
		San Diego,  California,  January 1996


	For more information about USENIX Association contact:

		   1. Phone:	510 528-8649
		   2. FAX:	510 548-5738
		   3. Email:	office@usenix.org
		   4. WWW URL:  https://www.usenix.org


              lmbench: Portable tools for performance analysis

                                 Larry McVoy
                           Silicon Graphics, Inc.

                                Carl Staelin
                        Hewlett-Packard Laboratories

                                  Abstract

   lmbench  is  a  micro-benchmark  suite  designed to focus attention on the
basic building blocks of many common system applications, such as  databases,
simulations,  software development, and networking.  In almost all cases, the
individual tests are the result of analysis and  isolation  of  a  customer's
actual  performance  problem.  These tools can be, and currently are, used to
compare different system implementations from different vendors.  In  several
cases,  the  benchmarks  have  uncovered  previously  unknown bugs and design
flaws.  The results have shown a strong  correlation  between  memory  system
performance and overall performance.  lmbench includes an extensible database
of results from systems current as of late 1995.

1.  Introduction

   lmbench provides a suite of benchmarks that attempt to  measure  the  most
commonly  found  performance  bottlenecks  in a wide range of system applica-
tions.  These bottlenecks have been identified, isolated, and reproduced in a
set  of small micro-benchmarks, which measure system latency and bandwidth of
data movement among the processor and memory, network, file system, and disk.
The intent is to produce numbers that real applications can reproduce, rather
than the frequently quoted and somewhat less reproducible  marketing  perfor-
mance numbers.

   The  benchmarks  focus on latency and bandwidth because performance issues
are usually caused by latency problems, bandwidth problems, or some  combina-
tion  of the two.  Each benchmark exists because it captures some unique per-
formance problem present in one or more important applications.  For example,
the  TCP latency benchmark is an accurate predictor of the Oracle distributed
lock manager's performance, the memory latency benchmark gives a strong indi-
cation  of Verilog simulation performance, and the file system latency bench-
mark models a critical path in software development.

   lmbench was developed to identify and evaluate system performance  bottle-
necks  present  in  many machines in 1993-1995.  It is entirely possible that
computer architectures will have changed and advanced enough in the next  few
years to render parts of this benchmark suite obsolete or irrelevant.

   lmbench  is  already in widespread use at many sites by both end users and
system designers.  In some cases, lmbench has provided the data necessary  to
discover and correct critical performance problems that might have gone unno-
ticed.  lmbench uncovered a problem in Sun's memory management software  that
made  all  pages map to the same location in the cache, effectively turning a
512 kilobyte (K) cache into a 4K cache.

   lmbench measures only a system's ability to transfer data between  proces-
sor,  cache,  memory,  network, and disk.  It does not measure other parts of
the system, such as the  graphics  subsystem,  nor  is  it  a  MIPS,  MFLOPS,
throughput,  saturation,  stress, graphics, or multiprocessor test suite.  It
is frequently run on multiprocessor (MP) systems to compare their performance
against uniprocessor systems, but it does not take advantage of any multipro-
cessor features.

   The benchmarks are written using standard, portable system interfaces  and
facilities commonly used by applications, so lmbench is portable and compara-
ble over a wide set of Unix systems.  lmbench has been run on AIX, BSDI,  HP-
UX,  IRIX,  Linux,  FreeBSD,  NetBSD, OSF/1, Solaris, and SunOS.  Part of the
suite has been run on Windows/NT as well.

   lmbench is freely distributed under the Free Software Foundation's General
Public License [Stallman89], with the additional restriction that results may
be reported only if the benchmarks are unmodified.

2.  Prior work

   Benchmarking and performance analysis is not a new  endeavor.   There  are
too many other benchmark suites to list all of them here.  We compare lmbench
to a set of similar benchmarks.

+  I/O (disk) benchmarks: IOstone [Park90] wants to be an I/O benchmark,  but
actually  measures  the  memory subsystem; all of the tests fit easily in the
cache.  IObench [Wolman89] is a systematic file system  and  disk  benchmark,
but it is complicated and unwieldy.  In [McVoy91] we reviewed many I/O bench-
marks and found them all lacking because they took too long to run  and  were
too  complex a solution to a fairly simple problem.  We wrote a small, simple
I/O benchmark, lmdd that measures sequential and random I/O far  faster  than
either  IOstone  or IObench.  As part of [McVoy91] the results from lmdd were
checked against IObench (as well as some other Sun internal I/O  benchmarks).
lmdd  proved  to be more accurate than any of the other benchmarks.  At least
one disk vendor routinely uses lmdd to do performance  testing  of  its  disk
drives.

Chen  and  Patterson [Chen93, Chen94] measure I/O performance under a variety
of workloads that are automatically varied to test the range of the  system's
performance.   Our  efforts  differ in that we are more interested in the CPU
overhead of a single request, rather than the capacity of  the  system  as  a
whole.

+   Berkeley  Software Distribution's microbench suite: The BSD effort gener-
ated an extensive set of test benchmarks to do regression testing (both qual-
ity and performance) of the BSD releases.  We did not use this as a basis for
our work (although we used ideas) for  the  following  reasons:  (a)  missing
tests -- such as memory latency, (b) too many tests, the results tended to be
obscured under a mountain of numbers, and (c) wrong copyright  --  we  wanted
the Free Software Foundation's General Public License.

+   Ousterhout's  Operating System benchmark: [Ousterhout90] proposes several
system benchmarks to measure system call latency, context  switch  time,  and
file  system  performance.   We  used the same ideas as a basis for our work,
while trying to go farther.  We measured a more complete set  of  primitives,
including  some hardware measurements; went into greater depth on some of the
tests, such as context switching; and went  to  great  lengths  to  make  the
benchmark portable and extensible.

+   Networking  benchmarks: Netperf measures networking bandwidth and latency
and was written  by  Rick  Jones  of  Hewlett-Packard.   lmbench  includes  a
smaller, less complex benchmark that produces similar results.

ttcp  is  a  widely used benchmark in the Internet community.  Our version of
the same benchmark routinely delivers bandwidth numbers that are within 2% of
the numbers quoted by ttcp.

+   McCalpin's  stream  benchmark: [McCalpin95] has memory bandwidth measure-
ments and results for a large number of high-end systems.   We  did  not  use
these  because  we  discovered  them only after we had results using our ver-
sions.  We will probably include McCalpin's  benchmarks  in  lmbench  in  the
future.

   In  summary,  we  rolled our own because we wanted simple, portable bench-
marks that accurately measured a wide variety of operations that we  consider
crucial to performance on today's systems.  While portions of other benchmark
suites include similar work, none includes all of it, few  are  as  portable,
and almost all are far more complex.  Less filling, tastes great.

3.  Benchmarking notes

3.1.  Sizing the benchmarks

   The  proper  sizing  of  various benchmark parameters is crucial to ensure
that the benchmark is measuring the right component  of  system  performance.
For  example,  memory-to-memory  copy speeds are dramatically affected by the
location of the data: if the size parameter is too small so the data is in  a
cache,  then  the  performance may be as much as ten times faster than if the
data is in memory.  On the other hand, if the memory size  parameter  is  too
big  so  the data is paged to disk, then performance may be slowed to such an
extent that the benchmark seems to `never finish.'

   lmbench takes the following approach to the cache and memory size issues:

+  All of the benchmarks that could be affected by cache size are  run  in  a
loop, with increasing sizes (typically powers of two) until some maximum size
is reached.  The results may then be plotted to see where  the  benchmark  no
longer fits in the cache.

+   The  benchmark verifies that there is sufficient memory to run all of the
benchmarks in main memory.  A small test program allocates as much memory  as
it  can,  clears the memory, and then strides through that memory a page at a
time, timing each  reference.   If  any  reference  takes  more  than  a  few
microseconds, the page is no longer in memory.  The test program starts small
and works forward until either enough memory is seen as present or the memory
limit is reached.

3.2.  Compile time issues

   The GNU C compiler, gcc, is the compiler we chose because it gave the most
reproducible results across platforms.  When gcc was not present, we used the
vendor-supplied cc.  All of the benchmarks were compiled with optimization -O
except the benchmarks that calculate  clock  speed  and  the  context  switch
times,  which  must be compiled without optimization in order to produce cor-
rect results.  No other optimization flags were  enabled  because  we  wanted
results that would be commonly seen by application writers.

   All  of  the benchmarks were linked using the default manner of the target
system.  For most if not all systems, the binaries were linked  using  shared
libraries.

3.3.  Multiprocessor issues

   All  of  the  multiprocessor systems ran the benchmarks in the same way as
the uniprocessor systems.  Some systems allow users to  pin  processes  to  a
particular CPU, which sometimes results in better cache reuse.  We do not pin
processes because it defeats the MP scheduler.  In certain cases, this  deci-
sion yields interesting results discussed later.

3.4.  Timing issues

+   Clock  resolution: The benchmarks measure the elapsed time by reading the
system clock via the gettimeofday interface.  On some systems this  interface
has  a  resolution  of  10  milliseconds, a long time relative to many of the
benchmarks which have results measured in tens to hundreds  of  microseconds.
To  compensate for the coarse clock resolution, the benchmarks are hand-tuned
to measure many operations within a single time  interval  lasting  for  many
clock  ticks.   Typically, this is done by executing the operation in a small
loop, sometimes unrolled if the  operation  is  exceedingly  fast,  and  then
dividing the loop time by the loop count.

+   Caching: If the benchmark expects the data to be in the cache, the bench-
mark is typically run several times; only the last result is recorded.

If the benchmark does not want to measure cache performance it sets the  size
parameter larger than the cache.  For example, the bcopy benchmark by default
copies 8 megabytes to 8 megabytes, which  largely  defeats  any  second-level
cache  in  use today.  (Note that the benchmarks are not trying to defeat the
file or process page cache, only the hardware caches.)

+  Variability: The results of some  benchmarks,  most  notably  the  context
switch  benchmark, had a tendency to vary quite a bit, up to 30%.  We suspect
that the operating system is not using the same set of  physical  pages  each
time  a process is created and we are seeing the effects of collisions in the
external caches.  We compensate by running the benchmark in a loop and taking
the  minimum  result.  Users interested in the most accurate data are advised
to verify the results on their own platforms.

   Many of the results included in the database were  donated  by  users  and
were  not  created  by the authors.  Good benchmarking practice suggests that
one should run the benchmarks as the only user of a  machine,  without  other
resource intensive or unpredictable processes or daemons.

3.5.  Using the lmbench database

   lmbench  includes a database of results that is useful for comparison pur-
poses.  It is quite easy to build the source, run the benchmark, and  produce
a  table  of  results that includes the run.  All of the tables in this paper
were produced from the database included in  lmbench.   This  paper  is  also
included  with  lmbench and may be reproduced incorporating new results.  For
more information, consult the file lmbench-HOWTO in the lmbench distribution.

4.  Systems tested

   lmbench  has  been run on a wide variety of platforms. This paper includes
results from a representative subset of machines and operating systems.  Com-
parisons  between similar hardware running different operating systems can be
very illuminating, and we have included a few examples in our results.

   The systems are briefly characterized in Table 1.  Please  note  that  the
list  prices  are  very  approximate  as  is  the  year of introduction.  The
SPECInt92 numbers are a little suspect since some vendors have  been  ``opti-
mizing''  for certain parts of SPEC.  We try and quote the original SPECInt92
numbers where we can.

Name         Vender        Multi Operating                        SPEC List
used         & model       or Uni    System    CPU         MhzYearInt92 price
------------------------------------------------------------------------------
IBM PowerPC  IBM 43P       Uni   AIX 3.?       MPC604      133 '95  176   15k
IBM Power2   IBM 990       Uni   AIX 4.?       Power2       71 '93  126  110k
FreeBSD/i586 ASUS P55TP4XE Uni   FreeBSD 2.1   Pentium     133 '95  190    3k
HP K210      HP 9000/859   MP    HP-UX B.10.01 PA 7200     120 '95  167   35k
SGI ChallengeSGI Challenge MP    IRIX 6.2-     R4400       200 '94  140   80k
SGI Indigo2  SGI Indigo2   Uni   IRIX 5.3      R4400       200 '94  135   15k
Linux/Alpha  DEC Cabriolet Uni   Linux 1.3.38  Alpha 21064A275 '94  189    9k
Linux/i586   Triton/EDO RAMUni   Linux 1.3.28  Pentium     120 '95  155    5k
Linux/i686   Intel Alder   Uni   Linux 1.3.37  Pentium Pro 200 '95~ 320    7k
DEC Alpha@150DEC 3000/500  Uni   OSF1 3.0      Alpha 21064 150 '93   84   35k
DEC Alpha@300DEC 8400 5/300MP    OSF1 3.2      Alpha 21164 300 '95  341? 250k
Sun Ultra1   Sun Ultra1    Uni   SunOS 5.5     UltraSPARC  167 '95  250   21k
Sun SC1000   Sun SC1000    MP    SunOS 5.5-    SuperSPARC   50 '92   65   35k
Solaris/i686 Intel Alder   Uni   SunOS 5.5.1   Pentium Pro 133 '95~ 215    5k
Unixware/i686Intel Aurora  Uni   Unixware 5.4.2Pentium Pro 200 '95~ 320    7k

                 Table 1.  System descriptions.

4.1.  Reading the result tables

   Throughout the rest of this paper, we present tables of results  for  many
of  the  benchmarks.  All of the tables are sorted, from best to worst.  Some
tables have multiple columns of results and those tables are sorted  on  only
one of the columns.  The sorted column's heading will be in bold.

5.  Bandwidth benchmarks

   By  bandwidth,  we  mean  the rate at which a particular facility can move
data.  We attempt to measure the data movement ability of a number of differ-
ent  facilities:  library  bcopy, hand-unrolled bcopy, direct-memory read and
write (no copying), pipes, TCP sockets, the  read  interface,  and  the  mmap
interface.

5.1.  Memory bandwidth

   Data  movement  is fundamental to any operating system.  In the past, per-
formance was frequently measured in MFLOPS because floating point units  were
slow  enough  that microprocessor systems were rarely limited by memory band-
width.  Today, floating point units are usually much faster than memory band-
width,  so  many  current  MFLOP  ratings can not be maintained using memory-
resident data; they are ``cache only'' ratings.

   We measure the ability to copy, read, and write data over a varying set of
sizes.   There are too many results to report all of them here, so we concen-
trate on large memory transfers.

   We measure copy bandwidth two ways.  The first is the  user-level  library
bcopy  interface.   The  second is a hand-unrolled loop that loads and stores
aligned 8-byte words.  In both cases, we took care to ensure that the  source
and  destination  locations would not map to the same lines if the any of the
caches were direct-mapped.  In order to test  memory  bandwidth  rather  than
cache  bandwidth,  both  benchmarks copy an 8M1 area to another 8M area.  (As
secondary caches reach 16M, these benchmarks  will  have  to  be  resized  to
reduce caching effects.)

   The copy results actually represent one-half to one-third  of  the  memory
bandwidth  used to obtain those results since we are reading and writing mem-
ory.  If the cache line size is larger than the word stored, then the written
cache line will typically be read before it is written.  The actual amount of
memory bandwidth used varies because some architectures have special instruc-
tions specifically designed for the bcopy function.  Those architectures will
move twice as much memory as reported by this benchmark; less advanced archi-
tectures  move  three  times as much memory: the memory read, the memory read
because it is about to be overwritten, and the memory written.

   The bcopy results  reported  in  Table  2  may  be  correlated  with  John
McCalpin's stream [McCalpin95] benchmark results in the following manner: the
stream benchmark reports all of the memory moved whereas the bcopy  benchmark
reports the bytes copied.  So our numbers should be approximately one-half to
one-third of his numbers.

   Memory reading is measured by an unrolled loop that sums up  a  series  of
integers.   On  most  (perhaps  all)  systems  measured the integer size is 4
bytes.  The loop is unrolled such that most compilers generate code that uses
a constant offset with the load, resulting in a load and an add for each word
of memory.  The add is an integer add that completes in one cycle on  all  of
the processors.  Given that today's processor typically cycles at 10 or fewer
nanoseconds (ns) and that memory is typically 200-1,000 ns  per  cache  line,
the  results  reported  here should be dominated by the memory subsystem, not
the processor add unit.

   The memory contents are added up because  almost  all  C  compilers  would
optimize out the whole loop when optimization was turned on, and would gener-
ate far too many instructions without optimization.  The solution is  to  add
up  the  data  and pass the result as an unused argument to the ``finish tim-
ing'' function.

   Memory reads represent about one-third to one-half of the bcopy work,  and
we  expect  that  pure  reads should run at roughly twice the speed of bcopy.
Exceptions to this rule should be studied, for exceptions indicate a  bug  in
the benchmarks, a problem in bcopy, or some unusual hardware.

   Memory writing is measured by an unrolled loop that stores a value into an
integer (typically a 4 byte integer) and then increments  the  pointer.   The
processor cost of each memory operation is approximately the same as the cost
in the read case.

   The numbers reported in Table 2 are not the raw  hardware  speed  in  some
cases.  The Power22 is capable of up to 800M/sec read rates [McCalpin95]  and
HP PA RISC (and other prefetching) systems also do better if higher levels of
code optimization used and/or the code is hand tuned.

-----------
  1 Some  of the PCs had less than 16M of available memory; those
machines copied 4M.
  2 Someone  described  this  machine  as a $1,000 processor on a
$99,000 memory subsystem.
-----------

                             |      Bcopy      |    Memory
               System        | unrolled   libc | read   write
               --------------+-----------------+--------------
               IBM Power2    |      242    171 |  205    364
               Sun Ultra1    |       85    167 |  129    152
               DEC Alpha@300 |       85     80 |  120    123
               HP K210       |       78     57 |  117    126
               Unixware/i686 |       65     58 |  235     88
               Solaris/i686  |       52     48 |  159     71
               DEC Alpha@150 |       46     45 |   79     91
               Linux/i686    |       42     56 |  208     56
               FreeBSD/i586  |       39     42 |   73     83
               Linux/Alpha   |       39     39 |   73     71
               Linux/i586    |       38     42 |   74     75
               SGI Challenge |       35     36 |   65     67
               SGI Indigo2   |       31     32 |   69     66
               IBM PowerPC   |       21     21 |   63     26
               Sun SC1000    |       17     15 |   38     31

                      Table 2.  Memory bandwidth (MB/s)

   The Sun libc bcopy in Table 2 is better because they use a  hardware  spe-
cific  bcopy  routine  that uses instructions new in SPARC V9 that were added
specifically for memory movement.

   The Pentium Pro read rate in Table 2 is much higher than  the  write  rate
because, according to Intel, the write transaction turns into a read followed
by a write to maintain cache consistency for MP systems.

5.2.  IPC bandwidth

   Interprocess communication bandwidth is frequently  a  performance  issue.
Many  Unix  applications  are  composed  of  several  processes communicating
through pipes or TCP sockets.  Examples include the groff documentation  sys-
tem  that  prepared  this paper, the X Window System, remote file access, and
World Wide Web servers.

   Unix pipes are an interprocess communication mechanism  implemented  as  a
one-way  byte  stream. Each end of the stream has an associated file descrip-
tor; one is the write descriptor and the  other  the  read  descriptor.   TCP
sockets  are  similar  to  pipes  except they are bidirectional and can cross
machine boundaries.

   Pipe bandwidth is measured by creating  two  processes,  a  writer  and  a
reader,  which  transfer 50M of data in 64K transfers.  The transfer size was
chosen so that the overhead of system calls and context switching  would  not
dominate  the  benchmark  time.   The reader prints the timing results, which
guarantees that all data has been moved before the timing is finished.

   TCP bandwidth is measured similarly, except the data is transferred in  1M
page  aligned  transfers instead of 64K transfers.  If the TCP implementation
supports it, the send and receive socket buffers are enlarged to 1M,  instead
of  the default 4-60K.  We have found that setting the transfer size equal to
the socket buffer size produces the greatest throughput over the most  imple-
mentations.

                   System          Libc bcopy   pipe   TCP
                   ----------------------------------------
                   HP K210                 57     93   34
                   Linux/i686              56     89   18
                   IBM Power2             171     84   10
                   Linux/Alpha             39     73    9
                   Unixware/i686           58     68   -1
                   Sun Ultra1             167     61   51
                   DEC Alpha@300           80     46   11
                   Solaris/i686            48     38   20
                   DEC Alpha@150           45     35    9
                   SGI Indigo2             32     34   22
                   Linux/i586              42     34    7
                   IBM PowerPC             21     30   17
                   FreeBSD/i586            42     23   13
                   SGI Challenge           36     17   31
                   Sun SC1000              15      9   11

                Table 3.  Pipe and local TCP bandwidth (MB/s)

   bcopy  is  important to this test because the pipe write/read is typically
implemented as a bcopy into the kernel from the writer and then a bcopy  from
the kernel to the reader.  Ideally, these results would be approximately one-
half of the bcopy results.  It is possible for the kernel bcopy to be  faster
than  the  C library bcopy since the kernel may have access to bcopy hardware
unavailable to the C library.

   It is interesting to compare pipes with TCP because the TCP  benchmark  is
identical to the pipe benchmark except for the transport mechanism.  Ideally,
the TCP bandwidth would be as good as the pipe bandwidth.  It is  not  widely
known  that  the  majority of the TCP cost is in the bcopy, the checksum, and
the network interface driver.  The checksum and  the  driver  may  be  safely
eliminated  in  the loopback case and if the costs have been eliminated, then
TCP should be just as fast as pipes.  From the pipe and TCP results in  Table
3, it is easy to see that Solaris and HP-UX have done this optimization.

   Bcopy  rates  in  Table  3  can  be lower than pipe rates because the pipe
transfers are done in 64K buffers, a size that  frequently  fits  in  caches,
while  the  bcopy  is  typically  an 8M-to-8M copy, which does not fit in the
cache.

   In Table 3, the SGI Indigo2, a uniprocessor, does better than the  SGI  MP
on pipe bandwidth because of caching effects - in the UP case, both processes
share the cache; on the MP, each process is communicating  with  a  different
cache.

   All  of  the  TCP  results in Table 3 are in loopback mode -- that is both
ends of the socket are on the same machine.  It was impossible to get  remote
networking  results  for  all  the  machines  included in this paper.  We are
interested in receiving more results for identical machines with a  dedicated
network connecting them.  The results we have for over the wire TCP bandwidth
are shown below.

                System               Network    TCP bandwidth
                ----------------------------------------------
                SGI PowerChallenge   hippi               79.3
                Sun Ultra1           100baseT             9.5
                HP 9000/735          fddi                 8.8
                FreeBSD/i586         100baseT             7.9
                SGI Indigo2          10baseT               .9
                HP 9000/735          10baseT               .9
                Linux/i586@90Mhz     10baseT               .7

                    Table 4.  Remote TCP bandwidth (MB/s)

   The SGI using 100MB/s Hippi is by far the fastest in  Table  4.   The  SGI
Hippi interface has hardware support for TCP checksums and the IRIX operating
system uses virtual memory tricks to avoid copying data as much as  possible.
For larger transfers, SGI Hippi has reached 92MB/s over TCP.

   100baseT  is  looking  quite competitive when compared to FDDI in Table 4,
even though FDDI has packets that are almost three times larger.   We  wonder
how long it will be before we see gigabit ethernet interfaces.

5.3.  Cached I/O bandwidth

   Experience  has  shown  us that reusing data in the file system page cache
can be a performance issue.  This section measures that operation through two
interfaces,  read  and  mmap.   The benchmark here is not an I/O benchmark in
that no disk activity is involved.  We wanted  to  measure  the  overhead  of
reusing  data, an overhead that is CPU intensive, rather than disk intensive.

   The read interface copies data from the kernel's file  system  page  cache
into  the  process's buffer, using 64K buffers.  The transfer size was chosen
to minimize the kernel entry overhead while remaining realistically sized.

   The difference between the bcopy and the read benchmarks is  the  cost  of
the  file  and  virtual  memory  system overhead.  In most systems, the bcopy
speed should be faster than the read  speed.   The  exceptions  usually  have
hardware  specifically  designed for the bcopy function and that hardware may
be available only to the operating system.

   The read benchmark is implemented by rereading a file  (typically  8M)  in
64K  buffers.  Each buffer is summed as a series of integers in the user pro-
cess.  The summing is done for two reasons: for an apples-to-apples  compari-
son  the  memory-mapped  benchmark  needs to touch all the data, and the file
system can sometimes transfer data into memory faster than the processor  can
read  the data.  For example, SGI's XFS can move data into memory at rates in
excess of 500M per second, but it can move data into the cache at only 68M/s.
The  intent is to measure performance delivered to the application,  not  DMA
performance to memory.

                              | Libc    File | Memory   File
                System        | bcopy   read |  read    mmap
                --------------+--------------+---------------
                IBM Power2    |   171    187 |    205   106
                HP K210       |    57     88 |    117    52
                Sun Ultra1    |   167     85 |    129   101
                DEC Alpha@300 |    80     67 |    120    78
                Unixware/i686 |    58     62 |    235   200
                Solaris/i686  |    48     52 |    159    94
                DEC Alpha@150 |    45     40 |     79    50
                Linux/i686    |    56     40 |    208    36
                IBM PowerPC   |    21     40 |     63    51
                SGI Challenge |    36     36 |     65    56
                SGI Indigo2   |    32     32 |     69    44
                FreeBSD/i586  |    42     30 |     73    53
                Linux/Alpha   |    39     24 |     73    18
                Linux/i586    |    42     23 |     74     9
                Sun SC1000    |    15     20 |     38    28

                 Table 5.  File vs. memory bandwidth (MB/s)

   The mmap interface provides a way to access the kernel's file cache  with-
out  copying  the  data.   The  mmap  benchmark is implemented by mapping the
entire file (typically 8M) into the process's address  space.   The  file  is
then summed to force the data into the cache.

   In  Table  5, a good system will have File read as fast as (or even faster
than) Libc bcopy because as the file system overhead goes to zero,  the  file
reread  case  is virtually the same as the library bcopy case.  However, file
reread can be faster because the kernel may have access to bcopy assist hard-
ware  not  available to the C library.  Ideally, File mmap performance should
approach Memory read performance,  but  mmap  is  often  dramatically  worse.
Judging  by the results, this looks to be a potential area for operating sys-
tem improvements.

   In Table 5 the Power2 does better on file reread  than  bcopy  because  it
takes  full  advantage  of  the memory subsystem from inside the kernel.  The
mmap reread is probably slower because of the  lower  clock  rate;  the  page
faults start to show up as a significant cost.

   It  is  surprising that the Sun Ultra1 was able to bcopy at the high rates
shown in Table 2 but did not show those rates for file reread in Table 5.  HP
has  the  opposite  problem,  they get file reread faster than bcopy, perhaps
because the kernel bcopy has access to hardware support.

   The Unixware system has outstanding mmap reread rates, better than systems
of  substantially higher cost.  Linux needs to do some work on the mmap code.

6.  Latency measurements

   Latency is an often-overlooked  area  of  performance  problems,  possibly
because  resolving  latency  issues  is frequently much harder than resolving
bandwidth issues.  For example, memory bandwidth may be increased  by  making
wider  cache lines and increasing memory ``width'' and interleave, but memory
latency can be improved only by shortening paths or  increasing  (successful)
prefetching.   The  first  step toward improving latency is understanding the
current latencies in a system.

   The latency measurements included in this suite are memory latency,  basic
operating  system  entry  cost, signal handling cost, process creation times,
context switching, interprocess communication, file system latency, and  disk
latency.

6.1.  Memory read latency background

   In  this  section,  we  expend considerable effort to define the different
memory latencies and to explain and justify our benchmark.  The background is
a bit tedious but important, since we believe the memory latency measurements
to be one of the most thought-provoking and useful measurements in lmbench.

   The most basic latency measurement is memory latency  since  most  of  the
other  latency measurements can be expressed in terms of memory latency.  For
example, context switches require saving the current process state and  load-
ing  the  state of the next process.  However, memory latency is rarely accu-
rately measured and frequently misunderstood.

   Memory read latency has many definitions; the most common,  in  increasing
time  order,  are  memory  chip cycle time, processor-pins-to-memory-and-back
time, load-in-a-vacuum time, and back-to-back-load time.

+  Memory chip cycle latency: Memory chips are rated in nanoseconds;  typical
speeds are around 60ns.  A general overview on DRAM architecture may be found
in  [Hennessy96].   The  specific  information  we  describe  here  is   from
[Toshiba94]  and  pertains  to the THM361020AS-60 module and TC514400AJS DRAM
used in SGI workstations.  The 60ns time is the time from  RAS  assertion  to
the  when  the  data  will be available on the DRAM pins (assuming CAS access
time requirements were met).  While it is possible to get data out of a  DRAM
in  60ns,  that  is  not all of the time involved.  There is a precharge time
that must occur after every access.  [Toshiba94] quotes 110ns as  the  random
read  or  write  cycle time and this time is more representative of the cycle
time.

+  Pin-to-pin latency: This number represents the time needed for the  memory
request  to travel from the processor's pins to the memory subsystem and back
again.  Many vendors have used the pin-to-pin definition of memory latency in
their reports.  For example, [Fenwick95] while describing the DEC 8400 quotes
memory latencies of 265ns; a careful reading of that paper shows  that  these
are  pin-to-pin  numbers.   In  spite  of  the historical precedent in vendor
reports, this definition of memory latency is  misleading  since  it  ignores
actual  delays  seen when a load instruction is immediately followed by a use
of the data being loaded.  The number of additional cycles inside the proces-
sor  can  be  significant  and  grows  more  significant  with today's highly
pipelined architectures.

   It is worth noting that the pin-to-pin numbers include the amount of  time
it  takes  to charge the lines going to the SIMMs, a time that increases with
the  (potential)  number  of  SIMMs  in  a  system.   More  SIMMs  mean  more
capacitance  which  requires  in longer charge times.  This is one reason why
personal computers frequently have better memory latencies than workstations:
the PCs typically have less memory capacity.

+   Load-in-a-vacuum latency: A load in a vacuum is the time that the proces-
sor will wait for one load that must be fetched from  main  memory  (i.e.,  a
cache  miss).   The  ``vacuum''  means that there is no other activity on the
system bus, including no other loads.  While this number is  frequently  used
as  the  memory  latency,  it is not very useful.  It is basically a ``not to
exceed'' number important only for marketing reasons.  Some architects  point
out that since most processors implement nonblocking loads (the load does not
cause a stall until the data is used), the perceived load latency may be much
less  that  the  real  latency.   When pressed, however, most will admit that
cache misses occur in bursts, resulting in perceived latencies  of  at  least
the load-in-a-vacuum latency.

+  Back-to-back-load latency: Back-to-back-load latency is the time that each
load takes, assuming that the instructions before and after are  also  cache-
missing loads.  Back-to-back loads may take longer than loads in a vacuum for
the following reason: many systems implement something known as critical word
first, which means that the subblock of the cache line that contains the word
being loaded is delivered to the processor before the entire cache  line  has
been brought into the cache.  If another load occurs quickly enough after the
processor gets restarted from the current load, the  second  load  may  stall
because the cache is still busy filling the cache line for the previous load.
On some systems, such as the current implementation of UltraSPARC,  the  dif-
ference between back to back and load in a vacuum is about 35%.

   lmbench measures back-to-back-load latency because it is the only measure-
ment that may be easily measured from software and because we feel that it is
what  most  software  developers consider to be memory latency.  Consider the
following C code fragment:

     p = head;
     while (p->p next)
          p = p->p next;

On a DEC Alpha, the loop part turns into three  instructions,  including  the
load.  A 300 Mhz processor has a 3.33ns cycle time, so the loop could execute
in slightly less than 10ns.  However, the load itself takes 400ns  on  a  300
Mhz DEC 8400.  In other words, the instructions cost 10ns but the load stalls
for 400.  Another way to look at it is that 400/3.3,  or  121,  nondependent,
nonloading  instructions  following the load would be needed to hide the load
latency.  Because superscalar processors typically  execute  multiple  opera-
tions  per  clock  cycle, they need even more useful operations between cache
misses to keep the processor from stalling.

   This benchmark  illuminates  the  tradeoffs  in  processor  cache  design.
Architects like large cache lines, up to 64 bytes or so, because the prefetch
effect of gathering a whole line increases hit rate given reasonable  spatial
locality.   Small  stride  sizes  have  high spatial locality and should have
higher performance, but large stride sizes have poor spatial locality causing
the system to prefetch useless data.  So the benchmark provides the following
insight into negative effects of large line prefetch:

+  Multi-cycle fill operations are typically atomic events at the caches, and
sometimes block other cache accesses until they complete.

+  Caches are typically single-ported. Having a large line prefetch of unused
data causes extra bandwidth demands at the cache,  and  can  cause  increased
access latency for normal cache accesses.

   In  summary,  we believe that processors are so fast that the average load
latency for cache misses will be closer to the back-to-back-load number  than
to  the load-in-a-vacuum number.  We are hopeful that the industry will stan-
dardize on this definition of memory latency.

6.2.  Memory read latency

   The entire memory hierarchy can be measured, including on-board data cache
latency  and  size,  external  data  cache  latency and size, and main memory
latency.  Instruction caches are not measured.  TLB miss latency can also  be
measured,  as  in [Saavedra92], but we stopped at main memory.  Measuring TLB
miss time is problematic because different systems map different  amounts  of
memory with their TLB hardware.

   The  benchmark  varies  two  parameters, array size and array stride.  For
each size, a list of pointers is created for all of  the  different  strides.
Then the list is walked thus:

     mov  r4,(r4)  # C code: p = *p;

The  time  to  do  about  1,000,000  loads  (the  list wraps) is measured and
reported.  The time reported is pure latency time and may be zero even though
the  load  instruction does not execute in zero time.  Zero is defined as one
clock cycle; in other words, the time reported is only memory  latency  time,
as  it  does  not include the instruction execution time.  It is assumed that
all processors can do a load instruction in one processor cycle (not counting
stalls).   In other words, if the processor cache load time is 60ns on a 20ns
processor, the load latency reported would be 40ns, the  additional  20ns  is
for  the load instruction itself.3 Processors that can manage to get the load
address out to the address pins before the end of the  load  cycle  get  some
free time in this benchmark (we don't know of any processors that do that).

   This benchmark has been validated by logic analyzer measurements on an SGI
Indy by Ron Minnich while he was at the Maryland Supercomputer Research  Cen-
ter.

   Results  from the memory latency benchmark are plotted as a series of data
sets as shown in Figure 1.  Each data set represents a stride size, with  the
array  size  varying  from  512 bytes up to 8M or more.  The curves contain a
series of horizontal plateaus, where each plateau represents a level  in  the
memory hierarchy.  The point where each plateau ends and the line rises marks
the end of that portion of the memory hierarchy (e.g., external cache).  Most
machines  have  similar  memory  hierarchies: on-board cache, external cache,
-----------
  3 In  retrospect,  this was a bad idea because we calculate the
clock rate to get the instruction execution time.  If  the  clock
rate is off, so is the load time.
-----------

			  [ Not included ]
                          Figure 1.  Memory latency

main memory, and main memory plus TLB miss costs.  There are variations: some
processors are missing a cache, while others add another cache to the hierar-
chy.  For example, the Alpha 8400 has two on-board caches,  one  8K  and  the
other 96K.

   The  cache line size can be derived by comparing curves and noticing which
strides are faster than main memory times.  The smallest stride that  is  the
same  as  main  memory  speed is likely to be the cache line size because the
strides that are faster than memory are getting more than one hit  per  cache
line.

   Figure 1 shows memory latencies on a nicely made machine, a DEC Alpha.  We
use this machine as the example because it shows the latencies and  sizes  of
the  on-chip  level 1 and motherboard level 2 caches, and because it has good
all-around numbers, especially considering it can support a 4M level 2 cache.
The  on-board cache is 213 bytes or 8K, while the external cache is 219 bytes
or 512K.

   Table 6 shows the cache size, cache latency, and main  memory  latency  as
extracted  from  the  memory  latency  graphs.   The graphs and the tools for
extracting the data are included with lmbench.  It is worthwhile to plot  all
of  the graphs and examine them since the table is missing some details, such
as the DEC Alpha 8400 processor's second 96K on-chip cache.

   We sorted Table 6 on level 2 cache latency  because  we  think  that  many
applications will fit in the level 2 cache.  The HP and IBM systems have only
one level of cache so we count that as both level 1 and level 2.   Those  two
systems  have  remarkable cache performance for caches of that size.  In both
cases, the cache delivers data in one clock cycle after the load instruction.

                                 Level 1        Level 2
                                  cache          cache       Memory
        System          Clk.   lat.   size   lat.     size   latency
        -------------------------------------------------------------
        HP K210            8      8   256K     --       --      349
        IBM Power2        14     13   256K     --       --      260
        Unixware/i686      5      5     8K     25     256K      175
        Linux/i686         5     10     8K     30     256K      179
        Sun Ultra1         6      6    16K     42     512K      270
        Linux/Alpha        3      6     8K     46      96K      357
        Solaris/i686       7     14     8K   ? 48     256K      281
        SGI Indigo2        5      8    16K     64       2M     1170
        SGI Challenge      5      8    16K     64       4M     1189
        DEC Alpha@300      3      3     8K     66       4M      400
        DEC Alpha@150      6     12     8K     67     512K      291
        FreeBSD/i586       7      7     8K     95     512K      182
        Linux/i586         8      8     8K    107     256K      150
        Sun SC1000        20     20     8K    140       1M     1236
        IBM PowerPC        7      6    16K    164   ? 512K      394

                   Table 6.  Cache and memory latency (ns)

   HP  systems usually focus on large caches as close as possible to the pro-
cessor.  A older HP multiprocessor system, the 9000/890, has a 4M, split I&D,
2  way set associative cache, accessible in one clock (16ns).  That system is
primarily a database server.

   The IBM focus is on low latency, high bandwidth memory.   The  IBM  memory
subsystem  is  good  because all of memory is close to the processor, but has
the weakness that it is extremely difficult to evolve the design to a  multi-
processor system.

   The  586 and PowerPC motherboards have quite poor second level caches, the
caches are not substantially better than main memory.

   The Pentium Pro and Sun Ultra second level caches are of medium  speed  at
5-6  clocks latency each.  5-6 clocks seems fast until it is compared against
the HP and IBM one cycle latency caches of similar  size.   Given  the  tight
integration  of  the  Pentium Pro level 2 cache, it is surprising that it has
such high latencies.

   The 300Mhz DEC Alpha has a rather high 22  clock  latency  to  the  second
level cache which is probably one of the reasons that they needed a 96K level
1.5 cache.  SGI and DEC have used large second level  caches  to  hide  their
long latency from main memory.

6.3.  Operating system entry

   Entry  into  the  operating system is required for many system facilities.
When calculating the cost of a facility, it is useful to know  how  expensive
it is to perform a nontrivial entry into the operating system.

   We measure nontrivial entry into the system by repeatedly writing one word
to /dev/null, a pseudo device driver that does nothing but discard the  data.
This particular entry point was chosen because it has never been optimized in
any system that we have measured.  Other entry points, typically  getpid  and
gettimeofday,  are heavily used, heavily optimized, and sometimes implemented
as user-level library routines rather than system  calls.   A  write  to  the
/dev/null  driver  will go through the system call table to write, verify the
user area as readable, look up the file descriptor to get the vnode, call the
vnode's write function, and then return.

                         System          system call
                         ----------------------------
                         Linux/Alpha               2
                         Linux/i586                2
                         Linux/i686                3
                         Unixware/i686             4
                         Sun Ultra1                5
                         FreeBSD/i586              6
                         Solaris/i686              7
                         DEC Alpha@300             9
                         Sun SC1000                9
                         HP K210                  10
                         SGI Indigo2              11
                         DEC Alpha@150            11
                         IBM PowerPC              12
                         IBM Power2               16
                         SGI Challenge            24

              Table 7.  Simple system call time (microseconds)

   Linux  is  the  clear  winner  in  the  system call time.  The reasons are
twofold: Linux is a uniprocessor operating system, without any  MP  overhead,
and  Linux is a small operating system, without all of the ``features'' accu-
mulated by the commercial offers.

   Unixware and Solaris are doing quite well, given that they are both fairly
large,  commercially  oriented operating systems with a large accumulation of
``features.''

6.4.  Signal handling cost

   Signals in Unix are a way to tell another  process  to  handle  an  event.
They are to processes as interrupts are to the CPU.

   Signal  handling is often critical to layered systems.  Some applications,
such as databases, software development environments, and threading libraries
provide an operating system-like layer on top of the operating system, making
signal handling a critical path in many of these applications.

   lmbench measure both signal installation and  signal  dispatching  in  two
separate  loops,  within the context of one process.  It measures signal han-
dling by installing a signal handler and then repeatedly sending  itself  the
signal.

                   System          sigaction   sig handler
                   ----------------------------------------
                   SGI Indigo2             4             7
                   SGI Challenge           4             9
                   HP K210                 4            13
                   FreeBSD/i586            4            21
                   Linux/i686              4            22
                   Unixware/i686           6            25
                   IBM Power2             10            27
                   Solaris/i686            9            45
                   IBM PowerPC            10            52
                   Linux/i586              7            52
                   DEC Alpha@300           6            59
                   Linux/Alpha            13           138

                    Table 8.  Signal times (microseconds)

   Table  8  shows the signal handling costs.  Note that there are no context
switches in this benchmark; the signal goes to the same process  that  gener-
ated  the  signal.   In  real applications, the signals usually go to another
process, which implies that the true cost of sending that signal is the  sig-
nal  overhead  plus the context switch overhead.  We wanted to measure signal
and context switch overheads  separately  since  context  switch  times  vary
widely among operating systems.

   SGI  does  very well on signal processing, especially since their hardware
is of an older generation than many of the others.

   The Linux/Alpha signal handling numbers are so poor that we  suspect  that
this  is a bug, especially given that the Linux/x86 numbers are quite reason-
able.

6.5.  Process creation costs

   Process benchmarks are used to measure the basic process primitives,  such
as  creating  a new process, running a different program, and context switch-
ing.  Process creation benchmarks are of particular interest  in  distributed
systems since many remote operations include the creation of a remote process
to shepherd the remote operation to completion.  Context switching is  impor-
tant for the same reasons.

+   Simple  process  creation.   The Unix process creation primitive is fork,
which creates a (virtually) exact copy of the calling  process.   Unlike  VMS
and  some  other  operating systems, Unix starts any new process with a fork.
Consequently, fork and/or execve should be fast  and  ``light,''  facts  that
many have been ignoring for some time.

   lmbench measures simple process creation by creating a process and immedi-
ately exiting the child process.  The parent process waits for the child pro-
cess to exit.  The benchmark is intended to measure the overhead for creating
a new thread of control, so it includes the fork and the exit time.

   The benchmark also includes a wait system call in the parent  and  context
switches  from  the  parent to the child and back again.   Given that context
switches of this sort are on the order of 20 microseconds and a  system  call
is  on  the order of 5 microseconds, and that the entire benchmark time is on
the order of a millisecond or more,  the  extra  overhead  is  insignificant.
Note  that even this relatively simple task is very expensive and is measured
in milliseconds while most of the other operations we consider  are  measured
in microseconds.

+  New process creation.  The preceding benchmark did not create a new appli-
cation; it created a copy of the old application.   This  benchmark  measures
the  cost  of  creating  a  new  process and changing that process into a new
application, which.  forms the basis of every Unix command line interface, or
shell.  lmbench measures this facility by forking a new child and having that
child execute a new program -- in this  case,  a  tiny  program  that  prints
``hello world'' and exits.

   The  startup  cost  is  especially  noticeable on (some) systems that have
shared libraries.  Shared libraries can introduce a substantial (tens of mil-
liseconds) startup cost.

                           |  fork  | fork, exec |  fork, exec
             System        | & exit |     & exit | sh -c & exit
             --------------+--------+------------+--------------
             Linux/Alpha   |    0.7 |         3  |          12
             Linux/i686    |    0.4 |         5  |          14
             Linux/i586    |    0.9 |         5  |          16
             Unixware/i686 |    0.9 |         5  |          10
             DEC Alpha@300 |    2.0 |         6  |          16
             IBM PowerPC   |    2.9 |         8  |          50
             SGI Indigo2   |    3.1 |         8  |          19
             IBM Power2    |    1.2 |         8  |          16
             FreeBSD/i586  |    2.0 |        11  |          19
             HP K210       |    3.1 |        11  |          20
             DEC Alpha@150 |    4.6 |        13  |          39
             SGI Challenge |    4.0 |        14  |          24
             Sun Ultra1    |    3.7 |        20  |          37
             Solaris/i686  |    4.5 |        22  |          46
             Sun SC1000    |   14.0 |        69  |         281

               Table 9.  Process creation time (milliseconds)

+   Complicated  new  process  creation.  When programs start other programs,
they frequently use one of three standard interfaces: popen,  system,  and/or
execlp.   The  first two interfaces start a new process by invoking the stan-
dard command interpreter, /bin/sh, to start the process.   Starting  programs
this way guarantees that the shell will look for the requested application in
all of the places that the user would look -- in other words, the shell  uses
the  user's $PATH variable as a list of places to find the application.  exe-
clp is a C library routine which also looks for the program using the  user's
$PATH variable.

   Since this is a common way of starting applications, we felt it was useful
to show the costs of the generality.

   We measure this by starting /bin/sh to start the same tiny program we  ran
in the last case.  In Table 9 the cost of asking the shell to go look for the
program is quite large, frequently ten times as expensive as just creating  a
new process, and four times as expensive as explicitly naming the location of
the new program.

   The results that stand out in Table 9 are the poor Sun  Ultra  1  results.
Given  that  the processor is one of the fastest, the problem is likely to be
software.  There is room for substantial improvement in the  Solaris  process
creation code.

6.6.  Context switching

   Context  switch  time is defined here as the time needed to save the state
of one process and restore the state of another process.

   Context switches are frequently in the critical performance path  of  dis-
tributed  applications.  For example, the multiprocessor versions of the IRIX
operating system use processes to move data  through  the  networking  stack.
This  means  that the processing time for each new packet arriving at an idle
system includes the time needed to switch in the networking process.

   Typical context switch benchmarks measure just the minimal context  switch
time  --  the time to switch between two processes that are doing nothing but
context switching.  We feel that this is misleading because  there  are  fre-
quently  more than two active processes, and they usually have a larger work-
ing set (cache footprint) than the benchmark processes.

   Other benchmarks frequently include the cost of the system calls needed to
force  the context switches.  For example, Ousterhout's context switch bench-
mark measures context switch time plus a read and a write on a pipe.  In many
of  the systems measured by lmbench, the pipe overhead varies between 30% and
300% of the context switch time, so we were careful to factor  out  the  pipe
overhead.

+   Number  of  processes.   The context switch benchmark is implemented as a
ring of two to twenty processes that are connected with Unix pipes.  A  token
is  passed  from process to process, forcing context switches.  The benchmark
measures the time needed to pass the token two thousand times from process to
process.   Each  transfer of the token has two costs: the context switch, and
the overhead of passing the token.  In order to calculate  just  the  context
switching  time,  the  benchmark first measures the cost of passing the token
through a ring of pipes in a single process.  This overhead time  is  defined
as  the cost of passing the token and is not included in the reported context
switch time.

+  Size of processes.  In order to  measure  more  realistic  context  switch
times,  we add an artificial variable size ``cache footprint'' to the switch-
ing processes.  The cost of the context switch  then  includes  the  cost  of
restoring  user-level state (cache footprint).  The cache footprint is imple-
mented  by having the process allocate an array of data4 and sum the array as
a series of integers after receiving the token but before passing  the  token
to  the  next  process.   Since  most  systems will cache data across context
switches, the working set for the benchmark is slightly larger than the  num-
ber of processes times the array size.

-----------
  4 All arrays are at the same virtual address in all  processes.
-----------

   It  is  worthwhile  to  point  out  that the overhead mentioned above also
includes the cost of accessing the data, in the same way as the actual bench-
mark.    However,  because  the overhead is measured in a single process, the
cost is typically the cost with ``hot'' caches.  In the Figure 2,  each  size
is plotted as a line, with context switch times on the Y axis, number of pro-
cesses on the X axis, and the process size as the data set.  The process size
and the hot cache overhead costs for the pipe read/writes and any data access
is what is labeled as size=0KB overhead=10.  The size is in kilobytes and the
overhead is in microseconds.

   The  context  switch time does not include anything other than the context
switch, provided that all the benchmark processes fit in the cache.   If  the
total  size  of all of the benchmark processes is larger than the cache size,
the cost of each context switch will include cache misses.  We are trying  to
show  realistic context switch times as a function of both size and number of
processes.

		       [ Not included ]
                       Figure 2.  Context switch times

   Results for an Intel Pentium Pro system running Linux at 167 MHz are shown
in  Figure 2.  The data points on the figure are labeled with the working set
due to the sum of data in all of the processes.  The actual  working  set  is
larger,  as  it  includes the process and kernel overhead as well.  One would
expect the context switch times to stay constant until  the  working  set  is
approximately  the  size  of  the second level cache.  The Intel system has a
256K second level cache, and the context switch times  stay  almost  constant
until about 256K (marked as .25M in the graph).

+   Cache  issues The context switch benchmark is a deliberate measurement of
the effectiveness of the caches across  process  context  switches.   If  the
cache  does not include the process identifier (PID, also sometimes called an
address space identifier) as part of the address,  then  the  cache  must  be
flushed  on every context switch.  If the cache does not map the same virtual
addresses from different processes to different cache lines, then  the  cache
will appear to be flushed on every context switch.

   If  the  caches  do  not  cache  across context switches there would be no
grouping at the lower left corner of  Figure  2,  instead,  the  graph  would
appear  as  a  series of straight, horizontal, parallel lines.  The number of
processes will not matter, the two process case will be just as  bad  as  the
twenty  process  case  since  the  cache  would  not be useful across context
switches.

                                | 2 processes | 8 processes
                  System        | 0KB   32KB  | 0KB   32KB
                  --------------+-------------+-------------
                  Linux/i686    |   6      18 |   7    101
                  Linux/i586    |  10     163 |  13    215
                  Linux/Alpha   |  11      70 |  13     78
                  IBM Power2    |  13      16 |  18     43
                  Sun Ultra1    |  14      31 |  20    102
                  DEC Alpha@300 |  14      17 |  22     41
                  IBM PowerPC   |  16      87 |  26    144
                  HP K210       |  17      17 |  18     99
                  Unixware/i686 |  17      17 |  18     72
                  FreeBSD/i586  |  27      34 |  33    102
                  Solaris/i686  |  36      54 |  43    118
                  SGI Indigo2   |  40      47 |  38    104
                  DEC Alpha@150 |  53      68 |  59    134
                  SGI Challenge |  63      80 |  69     93
                  Sun SC1000    | 107     142 | 104    197

                Table 10.  Context switch time (microseconds)

   We picked four points on the graph and extracted those  values  for  Table
10.  The complete set of values, as well as tools to graph them, are included
with lmbench.

   Note that multiprocessor context switch times are frequently  more  expen-
sive  than uniprocessor context switch times.  This is because multiprocessor
operating systems tend to have very complicated scheduling code.  We  believe
that multiprocessor context switch times can be, and should be, within 10% of
the uniprocessor times.

   Linux does quite well on context switching, especially on the more  recent
architectures.  By comparing the Linux 2 0K processes to the Linux 2 32K pro-
cesses, it is apparent that there is  something  wrong  with  the  Linux/i586
case.   If  we  look back to Table 6, we can find at least part of the cause.
The second level cache latency for  the  i586  is  substantially  worse  than
either the i686 or the Alpha.

   Given  the poor second level cache behavior of the PowerPC, it is surpris-
ing that it does so well on context switches,  especially  the  larger  sized
cases.

   The Sun Ultra1 context switches quite well in part because of enhancements
to the register window handling in SPARC V9.

6.7.  Interprocess communication latencies

   Interprocess communication latency is important  because  many  operations
are  control messages to another process (frequently on another system).  The
time to tell the remote process to do something is pure overhead and is  fre-
quently  in  the  critical  path  of  important functions such as distributed
applications (e.g., databases, network servers).

   The interprocess communication latency benchmarks typically have the  fol-
lowing  form:  pass a small message (a byte or so) back and forth between two
processes.  The reported results are always the microseconds needed to do one
round  trip.   For  one way timing, about half the round trip is right.  How-
ever, the CPU cycles tend to be somewhat asymmetric for one  trip:  receiving
is typically more expensive than sending.

+   Pipe  latency.   Unix  pipes  are an interprocess communication mechanism
implemented as a one-way byte stream.  Each end of the stream has an  associ-
ated  file  descriptor;  one  is  the write descriptor and the other the read
descriptor.

   Pipes are frequently used as a local IPC mechanism.  Because of  the  sim-
plicity  of  pipes,  they  are  frequently the fastest portable communication
mechanism.

   Pipe latency is measured by creating a pair of pipes, forking a child pro-
cess,  and passing a word back and forth.  This benchmark is identical to the
two-process, zero-sized context switch benchmark,  except  that  it  includes
both  the context switching time and the pipe overhead in the results.  Table
11 shows the round trip latency from process A to process B and back to  pro-
cess A.

   The time can be broken down to two context switches plus four system calls
plus the pipe overhead.  The context switch component is  two  of  the  small
processes  in  Table  10.   This benchmark is identical to the context switch
benchmark in [Ousterhout90].

+  TCP and RPC/TCP latency.  TCP sockets may be  viewed  as  an  interprocess
communication  mechanism  similar  to  pipes  with the added feature that TCP

                        System          Pipe latency
                        -----------------------------
                        Linux/i686                26
                        Linux/i586                33
                        Linux/Alpha               34
                        Sun Ultra1                62
                        IBM PowerPC               65
                        Unixware/i686             70
                        DEC Alpha@300             71
                        HP K210                   78
                        IBM Power2                91
                        Solaris/i686             101
                        FreeBSD/i586             104
                        SGI Indigo2              131
                        DEC Alpha@150            179
                        SGI Challenge            251
                        Sun SC1000               278

                   Table 11.  Pipe latency (microseconds)

sockets work across machine boundaries.

   TCP and RPC/TCP connections are frequently used in low-bandwidth, latency-
sensitive applications.  The default Oracle distributed lock manager uses TCP
sockets, and the locks per second available from this service are  accurately
modeled by the TCP latency test.

                        System          TCP   RPC/TCP
                        ------------------------------
                        Linux/i686      216       346
                        Sun Ultra1      162       346
                        DEC Alpha@300   267       371
                        FreeBSD/i586    256       440
                        Solaris/i686    305       528
                        Linux/Alpha     429       602
                        HP K210         146       606
                        SGI Indigo2     278       641
                        IBM Power2      332       649
                        IBM PowerPC     299       698
                        Linux/i586      467       713
                        DEC Alpha@150   485       788
                        SGI Challenge   546       900
                        Sun SC1000      855      1386

                    Table 12.  TCP latency (microseconds)

   Sun's  RPC  is  layered  either  over  TCP  or over UDP.  The RPC layer is
responsible for managing connections (the port  mapper),  managing  different
byte  orders  and  word sizes (XDR), and implementing a remote procedure call
abstraction.  Table 12 shows the same benchmark  with  and  without  the  RPC
layer to show the cost of the RPC implementation.

   TCP  latency is measured by having a server process that waits for connec-
tions and a client process that connects to the server.   The  two  processes
then  exchange  a  word  between them in a loop.  The latency reported is one
round-trip time.  The measurements in Table 12 are local or loopback measure-
ments,  since  our  intent is to show the overhead of the software.  The same
benchmark may be, and frequently is, used to measure host-to-host latency.

   Note that the RPC layer frequently adds hundreds of microseconds of  addi-
tional  latency.   The  problem is not the external data representation (XDR)
layer -- the data being passed back and forth is a byte, so there is  no  XDR
to  be  done.   There is no justification for the extra cost; it is simply an
expensive implementation.  DCE RPC is worse.

                        System          UDP   RPC/UDP
                        ------------------------------
                        Linux/i686       93       180
                        Sun Ultra1      197       267
                        Linux/Alpha     180       317
                        DEC Alpha@300   259       358
                        Linux/i586      187       366
                        FreeBSD/i586    212       375
                        Solaris/i686    348       454
                        IBM Power2      254       531
                        IBM PowerPC     206       536
                        HP K210         152       543
                        SGI Indigo2     313       671
                        DEC Alpha@150   489       834
                        SGI Challenge   678       893
                        Sun SC1000      739      1101

                    Table 13.  UDP latency (microseconds)

+  UDP and RPC/UDP latency.  UDP sockets are an alternative to  TCP  sockets.
They  differ  in  that  UDP  sockets  are  unreliable messages that leave the
retransmission issues to the application.  UDP sockets have a few advantages,
however.   They preserve message boundaries, whereas TCP does not; and a sin-
gle UDP socket may send messages to any number of other sockets, whereas  TCP
sends data to only one place.

   UDP  and RPC/UDP messages are commonly used in many client/server applica-
tions.  NFS is probably the most  widely  used  RPC/UDP  application  in  the
world.

   Like  TCP latency, UDP latency is measured by having a server process that
waits for connections and a client process that connects to the server.   The
two  processes  then  exchange  a  word  between them in a loop.  The latency
reported is round-trip time.  The measurements in Table 13 are local or loop-
back  measurements, since our intent is to show the overhead of the software.
Again, note that the RPC library can add hundreds of  microseconds  of  extra
latency.

                                              TCP       UDP
            System               Network    latency   latency
            ------------------------------------------------------
            Sun Ultra1           100baseT       280       308
            FreeBSD/i586         100baseT       365       304
            HP 9000/735          fddi           425       441
            SGI Indigo2          10baseT        543       602
            HP 9000/735          10baseT        592       603
            SGI PowerChallenge   hippi         1068      1099
            Linux/i586@90Mhz     10baseT       2954      1912

                 Table 14.  Remote latencies (microseconds)

+  Network latency.  We have a few results for over the wire latency included
in Table 14.  As might be expected, the most heavily used network  interfaces
(i.e., ethernet) have the lowest latencies.  The times shown include the time
on the wire,  which  is  about  130  microseconds  for  10Mbit  ethernet,  13
microseconds for 100Mbit ethernet and FDDI, and less than 10 microseconds for
Hippi.

+  TCP connection latency.  TCP is a connection-based, reliable, byte-stream-
oriented  protocol.  As part of this reliability, a connection must be estab-
lished before any data can be transferred.  The connection is accomplished by
a ``three-way handshake,'' an exchange of packets when the client attempts to
connect to the server.

   Unlike UDP, where no connection  is  established,  TCP  sends  packets  at
startup  time.   If  an application creates a TCP connection to send one mes-
sage, then the startup time can be a substantial fraction of the  total  con-
nection  and transfer costs.  The benchmark shows that the connection cost is
approximately half of the cost.

   Connection cost is measured by having a server, registered using the  port
mapper,  waiting for connections.  The client figures out where the server is
registered and then repeatedly times a connect system  call  to  the  server.
The  socket  is closed after each connect.  Twenty connects are completed and
the fastest of them is used as the result.  The time  measured  will  include
two  of  the  three  packets that make up the three way TCP handshake, so the
cost is actually greater than the times listed.

   Table 15 shows that if the need is to send a quick message to another pro-
cess, given that most packets get through, a UDP message will cost a send and
a reply (if positive acknowledgments are needed, which they are in  order  to
have an apples-to-apples comparison with TCP).  If the transmission medium is
10Mbit Ethernet, the time on the wire will be approximately  65  microseconds
each way, or 130 microseconds total.  To do the same thing with a short-lived
TCP connection would cost 896 microseconds of wire time alone.

   The comparison is not meant to disparage TCP; TCP is  a  useful  protocol.
Nor  is the point to suggest that all messages should be UDP.  In many cases,
the difference between 130 microseconds and 900 microseconds is insignificant
compared  with  other  aspects  of  application performance.  However, if the
application is very latency sensitive and the  transmission  medium  is  slow
(such  as  serial link or a message through many routers), then a UDP message
may prove cheaper.

                       System          TCP connection
                       -------------------------------
                       HP K210                    238
                       Linux/i686                 263
                       IBM Power2                 339
                       FreeBSD/i586               418
                       Linux/i586                 606
                       SGI Indigo2                667
                       SGI Challenge              716
                       Sun SC1000                 852
                       Solaris/i686              1230
                       DEC Alpha@150             3047

                Table 15.  TCP connect latency (microseconds)

6.8.  File system latency

   File system latency is defined as the time required to create or delete  a
zero  length  file.  We define it this way because in many file systems, such
as the BSD fast file system, the directory operations are done  synchronously
in  order  to  maintain  on-disk integrity.  Since the file data is typically
cached and sent to disk at some later date, the file  creation  and  deletion
become  the  bottleneck  seen by an application.  This bottleneck is substan-
tial: to do a synchronous update to a disk is a matter of tens  of  millisec-
onds.  In many cases, this bottleneck is much more of a perceived performance
issue than processor speed.

   The benchmark creates 1,000 zero-sized files and then deletes  them.   All
the  files  are  created  in one directory and their names are short, such as
"a", "b", "c", ... "aa", "ab", ....

                  System            FS     Create   Delete
                  -----------------------------------------
                  Linux/i686      EXT2FS      751       45
                  HP K210          HFS        579       67
                  Linux/i586      EXT2FS    1,114       95
                  Linux/Alpha     EXT2FS      834      115
                  Unixware/i686    UFS        450      369
                  SGI Challenge    XFS      3,508    4,016
                  DEC Alpha@300   ADVFS     4,255    4,184
                  Solaris/i686     UFS     23,809    7,246
                  Sun Ultra1       UFS     18,181    8,333
                  Sun SC1000       UFS     25,000   11,111
                  FreeBSD/i586     UFS     28,571   11,235
                  SGI Indigo2      EFS     11,904   11,494
                  DEC Alpha@150     ?      38,461   12,345
                  IBM PowerPC      JFS     12,658   12,658
                  IBM Power2       JFS     13,333   12,820

                Table 16.  File system latency (microseconds)

   The create and delete latencies are shown in Table 16.  Notice that  Linux
does  extremely well here, 2 to 3 orders of magnitude faster than the slowest
systems.   However,  Linux  does  not  guarantee  anything  about  the   disk
integrity;  the directory operations are done in memory.  Other fast systems,
such as SGI's XFS, use a log to guarantee the  file  system  integrity.   The
slower systems, all those with ~10 millisecond file latencies, are using syn-
chronous writes to guarantee the file system integrity.  Unless Unixware  has
modified  UFS substantially, they must be running in an unsafe mode since the
FreeBSD UFS is much slower and both file systems are basically the 4BSD  fast
file system.

6.9.  Disk latency

   Included with lmbench is a small benchmarking program useful for measuring
disk and file I/O.  lmdd, which is patterned after the Unix utility dd,  mea-
sures both sequential and random I/O, optionally generates patterns on output
and checks them on input, supports flushing the data from the buffer cache on
systems that support msync, and has a very flexible user interface.  Many I/O
benchmarks can be trivially replaced with a perl script wrapped around  lmdd.

   While  we  could  have generated both sequential and random I/O results as
part of this paper, we did not because those benchmarks  are  heavily  influ-
enced  by the performance of the disk drives used in the test.  We intention-
ally measure only the system overhead of a SCSI command since  that  overhead
may become a bottleneck in large database configurations.

   Some  important  applications, such as transaction processing, are limited
by random disk IO latency.  Administrators can increase the  number  of  disk
operations  per  second  by  buying  more disks, until the processor overhead
becomes the bottleneck.  The lmdd  benchmark measures the processor  overhead
associated with each disk operation, and it can provide an upper bound on the
number of disk operations the processor can support.  It is designed for SCSI
disks,  and  it  assumes  that most disks have 32-128K read-ahead buffers and
that they can read ahead faster than the processor can request the chunks  of
data.5

   The benchmark simulates a large number of disks by reading 512byte  trans-
fers  sequentially from the raw disk device (raw disks are unbuffered and are
not read ahead by Unix).  Since the disk can read ahead faster than the  sys-
tem can request data, the benchmark is doing small transfers of data from the
disk's track buffer.  Another way to look at this is that  the  benchmark  is
doing  memory-to-memory  transfers  across a SCSI channel.  It is possible to
generate loads of more than 1,000 SCSI operations/second  on  a  single  SCSI
disk.  For comparison, disks under database load typically run at 20-80 oper-
ations per second.

-----------
  5 This may not always be true: a processor could be fast enough
to make the requests faster than the rotating disk.  If  we  take
6M/second  to  be disk speed, and divide that by 512 (the minimum
transfer size), that is 12,288 IOs/second, or 81 microseconds/IO.
We don't know of any processor/OS/IO controller combinations that
can do an IO in 81 microseconds.
-----------

                        System          Disk latency
                        -----------------------------
                        SGI Challenge            920
                        SGI Indigo2              984
                        HP K210                 1103
                        DEC Alpha@150           1436
                        Sun SC1000              1466
                        Sun Ultra1              2242

                 Table 17.  SCSI I/O overhead (microseconds)

   The resulting overhead number represents a lower bound on the overhead  of
a disk I/O.  The real overhead numbers will be higher on SCSI systems because
most SCSI controllers will not disconnect if the  request  can  be  satisfied
immediately.   During  the  benchmark, the processor simply sends the request
and transfers the data, while during normal  operation,  the  processor  will
send  the  request,  disconnect, get interrupted, reconnect, and transfer the
data.

   This technique can be used to discover how many drives a system  can  sup-
port  before  the system becomes CPU-limited because it can produce the over-
head load of a fully configured system with just a few disks.

7.  Future work

   There are several known improvements and extensions that could be made  to
lmbench.

+   Memory  latency.   The current benchmark measures clean-read latency.  By
clean, we mean that the cache lines being replaced are highly  likely  to  be
unmodified,  so  there  is  no  associated write-back cost.  We would like to
extend the benchmark to measure dirty-read latency, as well as write latency.
Other changes include making the benchmark impervious to sequential prefetch-
ing and measuring TLB miss cost.

+  MP benchmarks.  None of the benchmarks in lmbench is designed  to  measure
any  multiprocessor features directly.  At a minimum, we could measure cache-
to-cache latency as well as cache-to-cache bandwidth.

+  Static vs. dynamic processes.  In the process creation section, we  allude
to the cost of starting up processes that use shared libraries.  When we fig-
ure out how to create statically linked processes on all or most systems,  we
could quantify these costs exactly.

+   McCalpin's stream benchmark.  We will probably incorporate part or all of
this benchmark into lmbench.

+  Automatic sizing.  We have enough technology that we could  determine  the
size  of the external cache and autosize the memory used such that the exter-
nal cache had no effect.

+  More detailed papers.  There are  several  areas  that  could  yield  some
interesting  papers.   The  memory  latency  section  could  use  an in-depth

treatment, and the context switching section could turn into  an  interesting
discussion of caching technology.

8.  Conclusion

   lmbench  is  a  useful, portable micro-benchmark suite designed to measure
important aspects of system performance.   We have found that a  good  memory
subsystem is at least as important as the processor speed.  As processors get
faster and faster, more and more of the system design  effort  will  need  to
move to the cache and memory subsystems.

9.  Acknowledgments

   Many people have provided invaluable help and insight into both the bench-
marks themselves and the paper.  The USENIX reviewers were  especially  help-
ful.   We  thank all of them and especially thank: Ken Okin (SUN), Kevin Nor-
moyle (SUN), Satya Nishtala (SUN), Greg Chesson  (SGI),  John  Mashey  (SGI),
Neal   Nuckolls  (SGI),  John  McCalpin  (Univ.  of  Delaware),  Ron  Minnich
(Sarnoff),  Chris  Ruemmler  (HP),  Tom  Rokicki   (HP),   and   John   Weitz
(Digidesign).

   We  would also like to thank all of the people that have run the benchmark
and contributed their results; none of this would have been possible  without
their assistance.

   Our  thanks to all of the free software community for tools that were used
during this project.  lmbench is currently developed on Linux,  a  copylefted
Unix written by Linus Torvalds and his band of happy hackers.  This paper and
all of the lmbench documentation was produced using the groff suite of  tools
written  by  James Clark.  Finally, all of the data processing of the results
is done with perl written by Larry Wall.

   Sun Microsystems, and in particular Paul Borrill,  supported  the  initial
development of this project.  Silicon Graphics has supported ongoing develop-
ment that turned into far more time then we ever imagined.  We  are  grateful
to both of these companies for their financial support.

10.  Obtaining the benchmarks

   The            benchmarks            are            available           at
https://reality.sgi.com/employees/lm engr/lmbench.tgz as well as  via  a  mail
server.   You  may  request the latest version of lmbench by sending email to
archives@slovax.engr.sgi.com with lmbench-current* as the subject.

References

[Chen94] P. M. Chen and D. A. Patterson, "A new approach to  I/O  performance
  evaluation  -  self-scaling  I/O  benchmarks,  predicted  I/O performance,"
  Transactions on Computer Systems, 12 (4), pp. 308-339, November 1994.

[Chen93] Peter M. Chen and David Patterson, "Storage  performance  -  metrics
  and  benchmarks,"  Proceedings  of  the IEEE, 81 (8), pp. 1151-1165, August
  1993.

[Fenwick95] David M. Fenwick, Denis J. Foley, William  B.  Gist,  Stephen  R.
  VanDoren, and Danial Wissell, "The AlphaServer 8000 series: high-end server
  platform development," Digital Technical Journal, 7 (1), pp. 43-65,  August
  1995.

[Hennessy96] John L. Hennessy and David A. Patterson, Computer Architecture A
  Quantitative Approach, 2nd Edition, Morgan Kaufman, 1996.

[McCalpin95] John D. McCalpin, "Memory bandwidth and machine balance in  cur-
  rent  high  performance  computers,"  IEEE  Technical Committee on Computer
  Architecture newsletter, to appear, December 1995.

[McVoy91] L. W. McVoy and S. R. Kleiman, Extent-like Performance from a  Unix
  File System, pp. 33-43, Proceedings USENIX Winter Conference, January 1991.

[Ousterhout90] John K.  Ousterhout,  Why  aren't  operating  systems  getting
  faster as fast as hardware?, pp. 247-256, Proceedings USENIX Summer Confer-
  ence, June 1990.

[Park90] Arvin Park and J. C.  Becker,  "IOStone:  a  synthetic  file  system
  benchmark," Computer Architecture News, 18 (2), pp. 45-52, June 1990.

[Saavedra95]  R.H.  Saavedra and A.J. Smith, "Measuring cache and TLB perfor-
  mance and their effect on benchmark runtimes," IEEE Transactions on Comput-
  ers, 44 (10), pp. 1223-1235, October 1995.

[Stallman89]  Free  Software  Foundation,  Richard  Stallman,  General Public
  License, 1989. Included with lmbench.

[Toshiba94]  Toshiba,  DRAM  Components  and  Modules,  pp.  A59-A77,C37-C42,
  Toshiba America Electronic Components, Inc., 1994.

[Wolman89]  Barry  L. Wolman and Thomas M. Olson, "IOBENCH: a system indepen-
  dent IO benchmark," Computer Architecture News, 17 (5), pp. 55-70,  Septem-
  ber 1989.

Biographical information

   Larry  McVoy  currently  works  Silicon Graphics in the Networking Systems
Division on high performance networked file systems and networking  architec-
ture.   His computer interests include hardware architecture, software imple-
mentation and architecture, performance issues,  and  free  (GPLed)  software
issues.   Previously at Sun, he was the architect for the SPARC Cluster prod-
uct line, redesigned and wrote an entire source management system  (now  pro-
ductized as TeamWare), implemented UFS clustering, and implemented all of the
Posix 1003.1 support in SunOS 4.1.  Concurrent with Sun work, he lectured  at
Stanford  University  on Operating Systems.  Before Sun, he worked on the ETA
systems supercomputer Unix port.  He may be  reached  by  electronically  via
lm@sgi.com or by phone at (415) 933-1804.

   Carl  Staelin  works  for  Hewlett-Packard  Laboratories  in  the External
Research Program.  His research interests include network information infras-
tructures  and  high  performance  storage systems.  He worked for HP at U.C.
Berkeley on the 4.4BSD LFS port,  the  HighLight  hierarchical  storage  file
system,  the Mariposa distributed database, and the NOW project.  He received
his PhD in Computer Science from Princeton University in 1991 in high perfor-
mance   file   system   design.    He   may  be  reached  electronically  via
staelin@hpl.hp.com.