Memory Behavior of an X11 Window System

                                J. Bradley Chen
                          School of Computer Science
                          Carnegie Mellon University


                                   ABSTRACT

We used memory reference traces from a DEC Ultrix system running the X11 window
system from MIT Project Athena and several freely available X11 applications to
measure  different  aspects  of  memory  system  behavior and performance.  Our
measurements show that memory behavior for X11  workloads  differs  in  several
important  ways  from  workloads  more  traditionally used in cache performance
studies.  User instruction cache behavior  is  a  major  component  in  overall
memory  system  delays, with significant competition within and between address
spaces.  User TLB miss rates are up to  a  factor  of  two  higher  than  other
ill-behaved  integer  workloads.  Write-buffer stalls, data cache behavior, and
uncached memory reads can be problematic for microbenchmarks, but they are  not
an issue for the realistic applications we tested.
  ______________________________
  This  research  was  sponsored in by the Avionics Laboratory, Wright Research
and Development Center, Aeronautical Systems Division (AFSC), U. S. Air  Force,
Wright-Patterson AFB, OH 45433-6543 under Contract F33615-90-C-1465, Arpa Order
No. 7597, and by an equipment grant from Digital Equipment Corporation.
  The views and conclusions contained in this document are those of the authors
and  should  not  be  interpreted as representing the official policies, either
expressed or implied, of Digital Equipment Corporation or the U.S. Government.


1. Introduction
  We have used memory reference traces from DEC Ultrix, the X11  window  system
from  MIT  Project  Athena,  and  freely  available X11 applications to explore
several aspects of memory system behavior and performance  for  X11  workloads.
We  measured  behavior  within  the  system,  server  and  client,  as  well as
interaction between address spaces.
  Our  analysis  shows  that  memory  behavior  for   X11   workloads   differs
substantially   from   that  of  traditional  workloads,  particularly  in  the
instruction cache and TLB.  Competition within  and  between  contexts  in  the
instruction  cache  has significant performance impact.  This cache competition
appears difficult to avoid in a direct mapped  cache,  suggesting  that  higher
associativity may be required.  TLB designs that do not accommodate the demands
of large interactive systems may also become performance problems.
  X11 workloads, as compared to the SPECmarks [23] and other  more  traditional
workloads   for  behavioral  studies  of  memory  systems,  differ  in  several
fundamental ways:

   - Large program text.  Even the largest SPECmarks are small compared to
     X11.  At 688 KBytes, gcc stands out among the SPECmarks for its large
     text segment (Text sizes are given for Ultrix DECstation executables.
     gcc  is as built from the SPECmark distribution.  The X11 server size
     is for /usr/bin/Xws on an Ultrix workstation, which includes a number
     of  DEC  extensions.    The tracing experiments used a smaller server
     (958 KBytes).  See Section 3 for details on the server  used  in  the
     tracing  experiments.).    X11  servers  commonly have as much as 1.8
     megabytes of text, more than twice that of gcc.    X11  clients  also
     tend to have large code.  The two real-world X11 clients used in this
     study, gs and splot, have text sizes of 946  Kbytes  and  278  Kbytes
     respectively.  User text size for gs with the X11 server is over four
     times that of gcc.

   - Three interacting  contexts.    Typically,  batch-oriented  workloads
     involve  two contexts: the user application and the kernel.  For many
     of  these  workloads  (scientific  workloads  are  the  most   common
     examples)  kernel  activity  is negligible.  In contrast, activity in
     most X11 workloads is split among three contexts: the client, the X11
     server, and the operating system, with significant activity occurring
     in all three contexts.  The result is additional resource competition
     that does not happen in the two-context and single-context case.

   - Mandatory and potentially frequent context switches.  When multi-task
     workloads are used in memory system studies, they are usually created
     by   taking  unrelated  batch-oriented  workloads  and  running  them
     simultaneously.  Context switches for these multi-task workloads  can
     often  be scheduled arbitrarily.  An intelligent scheduler may try to
     make  switches  infrequent,  as  a  strategy  for  minimizing   cache
     competition.     In  contrast,  scheduler  policy  is  irrelevant  in
     client-server systems.  Context switches are  largely  determined  by
     client  behavior  and  inter-process  communication  implementations.
     Depending on the client, context switches may be frequent.

   - A large community of users.  The X11  server  and  clients  are  used
     daily  and  repeatedly  by  a  large  contingent  of  the workstation
     computing community.  Many of these users make rare use  of  programs
     such as those in the SPECmarks.

  Performance for benchmarks is typically measured in terms of throughput, with
program execution times reduced to units  such  as  MIPS  or  MFLOPS.    A  key
distinction  between  interactive  workloads and more traditional benchmarks is
their sensitivity to latency, which is the time  required  for  the  system  to
respond  to  a given input event.  Analysis of memory system components such as
caches  and  write  buffers  is  common  practice  for  throughput   benchmarks
 [7, 8, 10].    However,  interactive  programs  and client-server systems have
received relatively little attention in  recent  research. [3, 20].    This  is
unfortunate  in  that,  for  many  computer  users,  quick  response  time  for
latency-critical interactive applications is more important than the throughput
of batch jobs.  Because of the size and complexity of server-based systems such
as X11, few detailed measurements of their behavior have been made.   We  think
the  problem  deserves  more  attention,  as  memory  system  delays can have a
significant impact on latency for interactive workloads.


1.1. Related Work
  This research focused primarily on measuring the behavior and performance  of
realistic  X11  client  workloads  from  the  perspective of the memory system.
Several prior studies measure X11 behavior, although they differ  substantially
in that they consider behavior at higher levels of abstraction.  Researchers at
the Microelectronics Computation and Technology Corporation built a tool called
XSCOPE  to  measure  X11  performance  and  localize performance problems [22].
XSCOPE provides information about X11 request, reply, error, and event packets.
Their experience in designing XSCOPE indicated some problems with the syntax of
the X11 protocol.
  Simple measures of performance, such as operations per second, are often used
when  characterizing  new  graphics hardware.  Researchers at DEC WRL have done
significant work in achieving good X11 performance, both with simple bit-mapped
framebuffers [16]  and  more  complicated hardware [17].  They also demonstrate
software algorithms that permit effective use of the hardware.   They  consider
memory  reference behavior, but strictly as related to frame buffer references;
application performance is beyond the scope their work.  Researchers at Hewlett
Packard  used  a  technique  called  Direct  Hardware  Accesses  (DHA) in their
Starbase/X11 Merge system to enable high performance when Starbase applications
access the display. [6, 5]
  Several  other  projects consider memory system performance in a more general
context, independent of X11 applications.  MTOOL [11] compares  execution  time
of  program  segments  to  predicted time for a perfect memory system.  A large
difference between the predicted and the measured  times  suggests  a  possible
memory  system  performance  problem.    MTOOL  has  been  applied primarily to
detecting memory bottlenecks in FORTRAN programs, and is  not  appropriate  for
measuring  operating  system  behavior.    Thus,  MTOOL  is not appropriate for
measuring X11 workloads.  MTOOL has been adapted  to  work  with  shared-memory
multiprocessor  programs [12].    Another project, MemSpy [15], is based on the
Tango [9] simulation and tracing system.  To date, Tango is  designed  for  use
with parallel applications and multiprocessor systems, and has not been applied
to multiprogrammed uniprocessor workloads or measurements of  operating  system
activity.    The  measurements  for  this paper concern aggregate memory system
behavior.  In contrast, both MTOOL and MemSpy identify performance problems  as
specific segments of code (also data for MemSpy) within a workload.
  The  remainder  of  the paper is organized as follows.  The next two sections
describe the experiments, first giving details on the  tracing  and  simulation
systems, then a qualitative and quantitative characterization of the workloads.
Next, in the analysis section, we analyze memory delays for X11  workload  from
three  points  of  view:  memory penalties by subsystem, cache effects, and TLB
behavior.  The paper closes with a brief review of our major conclusions.

2. Tracing and Simulation
  The experiments for this study ran on a DECstation 5000/200, using an address
tracing system developed at Carnegie Mellon University and DEC WRL [4, 7].  The
tracing system uses object code rewriting [4, 24],  in  which  original  object
code  is augmented with instrumentation instructions such that an address trace
is generated as a side effect of program  execution.    Traces  are  accurately
interleaved  both  within a single context and across user and system contexts.
Traced addresses are corrected to reflect those of the  original  and  not  the
traced  instruction stream.  The Ultrix kernel, the X11 server, and X11 clients
were all instrumented and traced.
  The DECstation 5000/200 uses a 25 MHz MIPS R3000 CPU with MIPS R3010 floating
point  unit  and  MIPS  R3220  memory  buffer.    The  DS5000/200 uses a simple
eight-bit frame buffer interface, in which the frame buffer appears  as  memory
and is written directly by the processor.  No special-purpose graphics hardware
is used.  The X11 server runs as a user process.    Communication  between  X11
server and clients occurs through the socket interface provided by Ultrix.  The
frame buffer is mapped directly into the address space of the X11  server,  and
accesses  to  the frame buffer bypass the cache.  This could potentially induce
penalties for frame buffer reads.  Fortunately such reads are relatively  rare.
Frame  buffer  writes  pass  through  the  write-buffer so their performance is
unaffected.  A discussion of  effective  software  support  of  the  DECstation
5000/200 frame buffer can be found in [16].
  We  used  a  DECstation  5000/200 memory system simulator, along with several
other simple tools, to process trace generated by  the  test  workloads.    The
parameters  for  the  memory system simulation are given in Table 2-1.  We omit
several  million  instructions  from  the  start  and  end  of  each  simulator
experiment  to  eliminate  startup  and  shutdown  effects.  The tracing system
extracts page table information from the running kernel to provide  virtual  to
physical  page  mappings  in  the  simulator.   The Ultrix page allocation code
attempts to  assign  physical  pages  such  that  virtual  page  orderings  are
preserved in the physical cache.  Kernel text is not mapped.

3. Workloads
  We  used  the  standard  X11R5  distribution  from  MIT  Athena (available by
anonymous ftp from export.lcs.mit.edu).  The PEX  extension(PEX  is  the  PHIGS
extension  to  X11,  used for three-dimensional graphics.) was omitted from the
X11 server.  Otherwise, we used the default server configuration.   The  Ultrix
system  was  version 4.2 revision 96.  Table 3-1 describes the X11 clients used
in this study.  All programs are written in C, and compiled with version 2.1 of
the DEC/MIPS C compiler.
  X11perf  is  a  client  in  the  X11R5 distribution.  It measures the time to
repeat a given server operation some number of times, and is commonly used as a
gauge  of  X11  server performance.  All the microbenchmarks for this paper are
runs of x11perf with different input  parameters.    Splot  is  a  program  for
generating plots for PostScript and X11.  It is available by anonymous ftp from

 ----------------------------------------------------------------------------
 |                                                                          |
 |       instruction cache: 64 KB, direct-mapped, physical, 16              |
 |       byte line, 15 cycle miss penalty.                                  |
 |                                                                          |
 |       data cache: 64 KB, direct-mapped, physical, 4 byte line,           |
 |       write allocate, 15 cycle read miss penalty, read miss fetches      |
 |       16 aligned bytes.                                                  |
 |                                                                          |
 |       write buffer: six entries, page-mode writes complete               |
 |       in one cycle, non page-mode writes complete in five cycles;        |
 |       CPU reads have priority for memory access, but wait for            |
 |       writes that have already started.  4 KB page size for              |
 |       page-mode writes.                                                  |
 |                                                                          |
 |       translation buffer: 64 entries, 56 random/8 wired                  |
 |       entries, trap to software on TLB miss.  Each TLB entry             |
 |       maps a 4 KB page.                                                  |
 |                                                                          |
 |       page mapping: Deterministic.  The physical page used               |
 |       to back a given virtual page is determined by the virtual          |
 |       page number and its address space identifier.  The                 |
 |       deterministic strategy prevents conflicts within any 64 KB         |
 |       (cache size) range of virtual addresses.                           |
 |                                                                          |
 |       kernel memory:  All kernel text and most kernel data               |
 |       is in unmapped, cached physical memory.                            |
 |                                                                          |
 ----------------------------------------------------------------------------

               Table 2-1:  Memory system simulation parameters.


        --------------------------------------------------------------
        |                 |                                |         |
        |    workload     |          Description           |  time   |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        | micro           |                                |         |
        | benchmarks      |                                |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   destroy       | window destruction, using      |   2.6   |
        |                 | x11perf -repeat 5 -reps 10 \   |         |
        |                 |  -subs 10 100 -destroy         |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   resize        | window resize, using           |   2.5   |
        |                 | x11perf -repeat 2 -reps 5 \    |         |
        |                 |  -subs 10 100 -resize          |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   circulate     | window circulate               |   2.8   |
        |                 | operations, using              |         |
        |                 | x11perf -repeat 2 reps 5 \     |         |
        |                 |  subs 10 100 -circulate        |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   ftext         | text painting, using           |   2.4   |
        |                 | x11perf -repeat 5 reps \       |         |
        |                 |  500 -ftext                    |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   copy          | bitmap copy, using             | 11.4    |
        |                 | x11perf -repeat 5 reps \       |         |
        |                 |  250 -copywinwin100            |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   scroll        | window scrolling, using        | 23.3    |
        |                 | x11perf -repeat 2 reps \       |         |
        |                 |  250 -scroll500                |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        | X11 clients     |                                |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   splot         | Splot is run four times        | 12.4    |
        |                 | on four different input        |         |
        |                 | files.  Total size of splot    |         |
        |                 | input is 94 Kbytes.            |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   gs            | Ghostscript is used to         | 25.9    |
        |                 | preview a twenty page          |         |
        |                 | conference paper.  Input       |         |
        |                 | file size is 251 Kbytes.       |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        | Other workloads |                                |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   gcc           | The GNU C compiler converts a  |   3.7   |
        |                 | 17K (preprocessed) source file |         |
        |                 | into optimized Sun-3 assembly  |         |
        |                 | code.  Not an X11 client.      |         |
        |                 |                                |         |
        |----------------------------------------------------------  |
        |                 |                                |         |
        |   compress      | Data compression using         |   1.3   |
        |                 | Lempel-Ziv encoding.  A 100K   |         |
        |                 | file is compressed then        |         |
        |                 | uncompressed.                  |         |
        |                 |                                |         |
        --------------------------------------------------------------

          Table 3-1:  Experimental workloads.  Times are in seconds.


labrea.stanford.edu.  Ghostscript, by Aladdin Enterprises, is an X11  previewer
for Adobe Systems' PostScript language.  It is distributed with the GNU General
Public license, available by anonymous ftp from athena-dist.mit.edu.    Version
2.6.1 of gs was used.


               Table 3-2:  Instruction and Data Reference Counts

      This  table  shows  event  counts  for  each workload, along with the
    percentage contribution from the system (%sys), X11 server  (%Xs),  and
    X11  client (%Xc).  The first column for each workload shows the number
    of idle instructions executed during that workload.  All  other  counts
    and percentages are for non-idle events.  All counts are in thousands.


  Two  workloads  which are not X11 clients, gcc and compress, are included for
comparison between X11 clients and  other  workloads.    Among  common  integer
benchmarks,  compress  presents  an  above average demand on the memory system,
while gcc's demands are extreme.  More details on the memory system behavior of
these workloads can be found in [7].
  Table  3-2  gives  reference  counts for the experimental workloads.  The low
percentage of kernel instructions for  the  microbenchmarks  demonstrates  that
many types of X11 operations require relatively little kernel activity.  Higher
levels of kernel activity in splot and gs are attributed (at  least  partially)
to the file I/O that these workloads require.

4. Experiments and Analysis


4.1. Memory Cycles Per Instruction
  We  use  data  from our simulator to calculate memory cycles per instruction,
(MCPI), which is the total number of CPU stall cycles due to the memory  system
divided  by  the total number of instructions executed.  MCPI is one of several
components of cycles per instruction (CPI), a metric commonly used to  evaluate
computer  systems [14].  Other components of CPI that are not reflected in MCPI
include one cycle per instruction for instruction execution by  the  processor,
cycles  during interlocked multiply, divide, and floating point operations, and
no-ops inserted by the  compiler  for  load  and  branch  delays.    The  other
components of CPI remain relatively constant as processor cycle time decreases.
In contrast, MCPI is a function of the  ratio  of  memory  speed  to  processor
speed,  and  is  less  dependent of processor architecture.  MCPI will dominate
overall CPI if current trends in processor and memory speed continue.    MCPI+1
is  a  lower  bound  and  a good estimate of overall CPI for workloads (such as
operating systems) in which multiplies, divides and  floating-point  operations
are rare.
  Figure  4-1 illustrates MCPI for the experimental workloads.  For comparison,
MCPI measurements for gcc and compress are also shown.  Each bar is  shaded  to
denote  different  MCPI  components.  For  visual  clarity, the system and user
contributions are separated by a vertical bar.    Data  and  instruction  cache
misses  in  user  and  system mode are only partially responsible for the total
MCPI.  Other components include CPU write-stalls  and  uncached  memory  reads.
CPU  write-stalls are reflected in the wbuffer component which show the average
per-instruction penalty from writes to a full write-buffer,  and  reads  during
the  completion  of a five-cycle write.  TLB misses are not included explicitly
in the baseline MCPI.  Their cost appears as additional  instructions  executed
and additional data references, included in the total counts.
  As  can  be seen from Table 3-2, operating system overhead is low for the six
microbenchmarks.  This is reflected in Figure 4-1 as low system MCPI.  With the
X11  server  accessing  the  frame  buffer directly, kernel activity during the
microbenchmarks is dominated by TLB faults and socket  communication,  both  of
which  are  relatively  inexpensive  as  compared  to  the activity of disk I/O
intensive workloads.  In contrast, user-level MCPI varies significantly  across
the  microbenchmarks,  and is dependent on server activity.  The worst behavior
by far occurs for copy  nd  cr l , wh ch  inc r  substantial  write-buffe
uncached-read penaltiesaduestooalhighidensityuof frame-buffer references.r  and


                          Figure 4-1:  Baseline MCPI.

      Each  horizontal  bar  represents total MCPI for a given experimental
    workload, broken down between system/user and various components of the
    memory  system.    Contribution  from  system  and  user  activity  are
    separated by a vertical line.  User activity includes  X11  server  and
    X11  client.   The number at the right of each bar is the MCPI for that
    workload.  Startup and  shutdown  effects  were  excluded  by  omitting
    several   million  instructions  at  the  beginning  and  end  of  each
    simulation experiment.  MCPI for gcc  and  compress  are  included  for
    comparison with workloads that are not X11 clients.

  Comparing  system  behavior  for the microbenchmarks to that of splot and gs,
there is substantial additional system overhead for the  realistic  benchmarks.
This  reflects greater variation in system activity due to the addition of file
I/O, and is consistent with behavior  observed  for  system-intensive  and  I/O
intensive applications such as gcc.
  Turning  to  user-level overhead, Figure 4-1 shows that many X11 clients have
significant user instruction cache MCPI contributions,  sometimes  higher  than
system  i-cache MCPI contributions.  This is unusual for integer workloads [gcc
is exceptional in this respect.] [7].  In  the  next  section  we  discuss  how
competition  within  and between address spaces contributes to poor instruction
cache behavior, for both system and user.
  Penalties from the write-buffer and uncached memory reads appear  problematic
in  microbenchmarks  but  aren't  significant in more realistic workloads.  For
frame-buffer  writes,  the  X11  server  benefits  from  the   combination   of
write-buffer  and  writes  through  the uncached segment.  Together they permit
frame buffer writes to proceed at top speed without disturbing the contents  of
the data cache.


4.2. Cache Effects
  We consider two types of cache misses:

   - Inter-Context  Competition  occurs  when  references from two or more
     active  address  spaces   displace   each   other   in   the   cache.
     Client-server  systems  such  as  X11  introduce  a user level server
     context, in addition to the  application  and  system  context  of  a
     workload  such  as  gcc.   The additional server context could induce
     more competition in the cache.

   - Self-Interference misses occur when two active  instructions  in  the
     same  address  space  collide  in the cache.  The Ultrix page-mapping
     algorithm ensures that self-interference misses will not occur within
     an  address space if a program's text is smaller than the cache size.
     Many of the SPECmarks have text smaller than the cache, but the  text
     of  X11  server,  gs  and splot are all much larger.  With localities
     spread throughout  large  text,  self-interference  misses  are  more
     likely to occur.

We discuss inter-context competition first.


4.2.1. Inter-Context Competition
  An  X11  workload  is composed of three contexts: X11 client, X11 server, and
the operating system.   Inter-context  competition  occurs  when  two  or  more
contexts  compete  for memory resources (such as cache lines), causing degraded
performance.  To simulate a system without competition, we ran experiments with
memory  reference  trace  from  only  one source (eg. use trace from the kernel
only), then summed events over all three runs.  The sum represents the behavior
of a hypothetical machine where each context has its own private memory system,
hence a system where all competition has been eliminated.  Figure 4-2  compares
MCPI for several workloads with and without competition.


         Figure 4-2:  MCPI with and without inter-context competition.

      This  figure  shows the effect of inter-context competition on memory
    system performance.  Each horizontal bar represents total  MCPI  for  a
    given  experimental  run,  as in Figure 4-1.  We show two bars for each
    workload, the upper corresponding to the base  system,  and  the  lower
    representing  a  system in which competition between address spaces has
    been eliminated by giving each address space a private  memory  system.
    This  figure  shows  that inter-context competition has major impact on
    instruction cache behavior for realistic X11 workloads.

  Figure 4-2 shows that, for realistic X11 workloads, inter-context competition
has  a  large  impact  on  overall MCPI.  For both splot and gs, competition is
responsible for a large proportion of system  i-cache  misses.    Additionally,
eliminating  competition  from splot causes a drastic reduction in user i-cache
miss rates.  The instruction cache requirements of destroy  microbenchmark  are
modest,  so  eliminating  competition  has  little effect.  Similarly, compress
shows little benefit when competition is eliminated.
  We compared missing instruction addresses in the  X11  server  to  counts  of
kernel  instruction  references  for  a  run  of  splot  to  identify  probable
inter-context conflicts.  An example of such a conflict was between the  Ultrix
general  exception  handler,  exception(), and the X11 server memory allocation
routine, malloc().  Both routines are called frequently,  and  are  located  in
such  a  way that they overlap on a 4 Kbyte memory page.  Kernel text pages are
not mapped, so exception() will always be located in  the  same  place  in  the
cache.    User  text  is  mapped, so the location of malloc() in the cache will
depend on  the  virtual-to-physical  page  assignment.    In  Ultrix  the  page
assignment  is  a  function  of  virtual page number and process ID.  Given the
page-mapping  policy  and  cache  parameters  for  this  memory   system,   the
probability  that  the  above conflict will occur is 1/16.  Many such conflicts
were identified for the splot run.  The large working sets  of  X11  workloads,
combined  with  frequent  mandatory  context switches, makes competition misses
more likely.
  These  results  confirm  earlier  research  on   cache   competition,   which
demonstrated  significant  penalties  for  warming up the cache after a context
switch [19].    The  earlier  study  found  competition  to  be  important   in
multitasking  and compute-bound workloads, but a non-issue in the client-server
workload they tested.  However,  the  earlier  study  did  not  include  system
behavior,   and  it  used  a  synthetic  client-server  workload  dominated  by
communication, a workload more similar to the  microbenchmarks  used  for  this
study  than  the  realistic workloads.  Our work complements the prior study by
including system effects in the memory reference stream, and  by  demonstrating
that   inter-context   competition  can  also  have  impact  for  client-server
workloads.


4.2.2. Self-Interference misses
  We  measured  the  impact  of  self-interference  misses  by  replacing   the
direct-mapped  caches  in  the simulator with two-way set associative caches of
the same size.  Figure 4-3 illustrates the combined effects of competition  and
associativity   on   MCPI  for  splot  and  gs.    To  isolate  the  impact  of
self-interference misses from that of competition misses,  compare  runs  where
all  competition  misses  are eliminated (bench-nocomp and bench-a+nocomp).  In
this comparison, all misses eliminated  by  associativity  are  from  conflicts
within  an  address  space.   For both gs and splot, associativity eliminates a
significant number of self-interference misses.


       Figure 4-3:  Inter-Context Competition, Associativity, and MCPI.

      This figure shows the effect competition and self-interference misses
    on  memory  system  performance.   Each horizontal bar represents total
    MCPI for a given experimental run, as in Figure 4-1.  We show four bars
    for  each  workload:  the base system (bench-base), without competition
    (bench-nocomp),   with   associativity    and    without    competition
    (bench-a+nocomp),  and  with  associativity (bench-assoc).  This figure
    shows  that  both  competition  and   self-interference   misses   have
    significant   impact  on  memory  system  behavior  for  realistic  X11
    workloads, particularly in the instruction cache.
  Figure  4-3  also  shows  the  effect  of  associativity   on   inter-context
competition.    The two-way set associative caches eliminate some inter-context
competition, but not all (compare bench-assoc and bench-a+nocomp).


4.2.3. Summary
  For  the  two  realistic  X11  workloads  we  consider,  both   inter-context
competition  and  self-interference  misses  have  significant impact on memory
system behavior, with the most significant effects occurring in the instruction
cache.   Strategies have been described [18] for avoiding text conflicts within
an address space, but it is difficult to envision a practical  software  system
to  avoid competition between address spaces.  The problem could potentially be
addressed in hardware with cache associativity, although  this  could  have  an
impact  on machine cycle time.  Many current systems rely on good luck to avoid
inter-context competition.  As the gap between CPU and memory speed grows,  and
as  users  demand  improved performance for interactive and multi-address space
systems,  more  aggressive  cache  designs  may  be  required.    The  lack  of
client/server  workloads  in  standard  benchmark  suites  could  lead hardware
developers to believe that inter-context  competition  is  a  non-issue.    Our
measurements suggest it deserves more attention.


4.3. TLB Behavior
  In  this section we consider the TLB behavior of the realistic X11 workloads.
Table 4-1 shows TLB miss data for splot,  gs,  and  gcc.    Compared  to  other
integer  workloads, X11 applications have poor TLB behavior.  Both splot and gs
show significantly higher miss rates than gcc, which  is  relatively  demanding
among  integer workloads [7].  Three phenomena contribute to increased TLB miss
rates:

   - The X11 server needs over 200 page mappings  to  address  the  entire
     frame  buffer.    Any operation that paints a significant part of the
     screen will tend to flush the TLB.

   - The X11 server, gs and splot all have relatively large program  text.
     This  increases  the likelihood that localities will be spread across
     multiple text pages, which  in  turn  increases  the  demand  on  TLB
     resources.

   - X11 applications involve two interacting user contexts, as opposed to
     one context for gcc.  Multiple contexts mean more  fragmentation  and
     increased competition for limited TLB resources.


                 Table 4-1:  TLB Misses per 1000 instructions.

      System TLB misses include misses to both user and system segments.

  During  the  run of splot, 280000 user TLB misses occurred.  There were about
680000 during gs.  Estimating 20 cycles to service a TLB miss [21], the penalty
for  TLB  faults is less than 0.04 CPI for both workloads.  Thus, the impact on
overall performance is not significant for the memory system we simulated.
  The impact of the TLB on overall performance is dependent on the  performance
balance  of  memory  system components.  Processors such as the DEC Alpha 21064
 [2] rely on fewer TLB entries with larger and oversized pages to achieve  good
TLB behavior.  If software systems such as X11 don't make good use of these new
features, miss rates will go up, increasing the impact of the  TLB  on  overall
performance.    Also,  the  penalty  for  a single TLB miss could increase.  An
earlier study used 100 cycles as an estimate of the  TLB  miss  penalty  for  a
futuristic  machine [8].  As the balance of TLB to cache resources changes, TLB
performance could become an important issue.  X11 workloads  require  more  TLB
resources  than  popular  benchmarks  such  as  gcc.    If computer systems are
designed to optimize the performance of the popular benchmarks, systems such as
X11 can be expected to suffer.
  Our  measurements show degraded TLB behavior for X11 workloads as compared to
other integer codes.  Researchers at the University of  Michigan [21]  measured
similar  TLB behavior for the Mach 3.0 operating system [1, 13], another system
where a user-level server contributes significantly to overall activity.    The
two  independent  studies  suggest a broader conclusion, that page behavior for
user-level client/server systems induces substantially elevated TLB miss rates.
  As a final note, competition also affects TLB behavior.  In our measurements,
competition  accounted  for 60% of user TLB misses in splot and 30% of user TLB
misses in gs.  Note that additional  associativity  can't  help  here,  as  the
DECstation 5000/200 TLB is already fully associative.

5. Conclusions
  Memory  system  behavior  for  X11  workloads  differs significantly from the
batch-mode programs typically used  in  memory  system  studies.    With  large
program  text,  a  large  mapped frame buffer, and multiple competing contexts,
they can present a far greater load on instruction caches and TLBs than typical
throughput benchmarks.
  Cache associativity and TLB size are sensitive issues for hardware designers.
For many machines, increasing these parameters  would  have  direct  impact  on
machine  cycle  time,  the principle metric driving performance improvements in
microprocessors.  As machine cycle time dominates performance for many  current
benchmarks,   there  is  a  potential  conflict  between  high  throughput  for
benchmarks,  and  low  latency  for  large  client/server  systems.     Optimal
throughput and optimal latency may not be possible in the same machine.
  Our measurements are specific to X11 clients and server, but similar behavior
can be expected in other client/server systems.    Instruction  cache  and  TLB
behavior  is  aggravated  by  large  user  text, frequent and mandatory context
switches, and multiple active contexts.  Any system with  these  characteristic
will  probably  have  similar  behavior.  In systems with multiple heavy-weight
servers, contention for memory system resources will be even more intense.

6. Acknowledgements
  Thanks to Norm Jouppi who first suggested looking at X11.  Brian Bershad  and
Doug Tygar made useful comments on the text.  Joel McCormick answered questions
about X11.  David Wall provided the initial version of epoxie.  The  design  of
the  tracing system is based on prior work by Anita Borg, and her contributions
continued  throughout  this  project.    Thanks  also  to   Digital   Equipment
Corporation for their generous support.

References

1.  Michael J. Accetta, Robert V. Baron, William Bolosky, David B. Golub,
Richard F. Rashid, Avadis Tevanian, Jr., and Michael W. Young.  Mach: A New
Kernel Foundation for Unix Development.  Proceedings of the Summer 1986 USENIX
Conference, July, 1986, pp. 93-113.

2.  Digital Equipment Corporation.  Digital's 21064 Microprocessor.  Data
sheet.

3.  Brian N. Bershad, Richard P. Draves, and Alessandro Forin.  Using
Microbenchmarks to Evaluate System Performance.  The Proceedings of the Third
Workshop on Workstation Operating Systems, April, 1992, pp. 148-153.

4.  Anita Borg, R.E. Kessler, Georgia Lazana, and David Wall.  Long Address
Traces from RISC Machines: Generation and Analysis.  WRL Research Report 89/14,
Digital Equipment Corporation Western Research Laboratory, 1989.

5.  J.R. Boyton, S.L. Chakrabarti, S.P. Hiebert, J.J. Lang, J.R. Owen, K.A.
Marchington, P.R. Robinson, M.H. Stroyan, J.A. Waitz.  "Sharing access to
display resources in the Starbase/X11 Merge".  Hewlett Packard Journal 40, 6
(December 1989), 20-32.

6.  Kenneth H. Bronstein, David J. Sweetser, and William R. Yoder.  "System
design for compatibility of a high-performance graphics library and the X
Window System.".  Hewlett Packard Journal 40, 6 (December 1989), 6-10. .

7.  J. Bradley Chen and Brian N. Bershad.  The Impact of Operating System
Structure on Memory System Performance.  Proceedings of the 14th ACM Symposium
on Operating System Principles, December, 1993.

8.  J. Bradley Chen, Anita Borg, and Norman P. Jouppi.  A Simulation Based
Study of TLB Performance.  The Proceedings of the 19th Annual International
Symposium on Computer Architecture, May, 1992, pp. 114-123.

9.  H. Davis, S.R. Goldschmidt, and J. Hennessy.  Tango: A Multiprocessor
Simulation and Tracing System.  Proceedings of the International Conference on
Parallel Processing, August, 1991, pp. 99-107.

10.  Jeffrey D. Gee, Mark D. Hill, Dionisios N. Pnevmatikatos, and Alan Jay
Smith.  Cache Performance of the SPEC Benchmark Suite.  University of
Wisconsin-Madison, 1991.

11.  Aaron Goldberg and John Hennessy.  MTOOL: A Method for Detecting Memory
Bottlenecks.  WRL Technical Note TN-17, Digital Equipment Corporation Western
Research Laboratory, 1990.

12.  Aaron J. Goldberg and John L. Hennessy.  "MTOOL: An Integrated System for
Performance Debugging Shared Memory Multiprocessor Applications".  IEEE
Transactions on Parallel Processing 4, 1 (January 1993), 28-40.

13.  David Golub, Randall Dean, Alessandro Forin and Richard Rashid.  UNIX as
an Application Program.  Proceedings of the Summer 1990 USENIX Conference,
June, 1990, pp. 87-95.

14.  John L. Hennessy and David A. Patterson.  Computer Architecture: A
Quantitative Approach.  Morgan Kaufmann, Palo Alto, CA, 1990.

15.  Margaret Martonosi, Anoop Gupta, and Thomas Anderson.  MemSpy, Analyzing
Memory System Bottlenecks in Programs.  Proceedings of the 1992 ACM SIGMETRICS
Conference on Measurement and Modeling of Computer Systems, June, 1992, pp.
1-12.

16.  Joel McCormack.  Writing Fast X Servers for Dumb Color Frame Buffers.  WRL
Research Report 91/1, Digital Equipment Corporation Western Research
Laboratory, 1991.

17.  Joel McCormack and Bob McNamara.  A Smart Frame Buffer.  WRL Research
Report 93/1, Digital Equipment Corporation Western Research Laboratory, 1993.

18.  Scott McFarling.  Program Optimization for Instruction Caches.  The
Proceedings of the Third International Conference on Architectural Support for
Programming Languages and Operating Systems, April, 1989, pp. 183-191.

19.  Jeffrey C. Mogul and Anita Borg.  The Effect of Context Switches on Cache
Performance.  The Proceedings of the Fourth International Conference on
Architectural Support for Programming Languages and Operating Systems, April,
1991, pp. 75-84.

20.  Jeffrey C. Mogul.  SPECmarks are leading us astray.  The Third Workshop on
Workstation Operating Systems, April, 1992, pp. 160-161.

21.  David Nagle, Richard Uhlig, Tim Stanley, Stuart Sechrest, Trevor Mudge and
Richard Brown.  Design Tradeoffs for Software-Managed TLBs.  Proceedings of the
20th Annual International Symposium on Computer Architecture, May, 1993, pp.
27-38.

22.  J.L. Peterson.  XSCOPE: a debugging and PERFORMANCE tool for X11.
Proceedings of the IFIP 11th World Computer Congress, September, 1989, pp.
49-54.
23.  SPEC Benchmark Suite Release 1.0.   System Performance Evaluation
Cooperative, 1989.

24.  David W. Wall.  Systems for Late Code Modification.  In Code Generation
--- Concepts, Tools, Techniques, Springer-Verlag, 1992, pp. 275-293.


J.  Bradley  Chen is presently completing a PhD in Computer Science at Carnegie
Mellon University.  His interests include operating  systems,  memory  systems,
and  issues  relating  to the design and performance of large software systems.
He received BS and MS degrees in 1987 from Stanford University.