Software Prefetching and Caching
                          for
               Translation Lookaside Buffers


Kavita Bala* M. Frans Kaashoek* William E.  Weihl*
        MIT Laboratory for Computer Science
              Cambridge, MA 02139, USA

[Footnote:E-mail: {kaybee, kaashoek, weihl}@lcs.mit.edu. World Wide
Web URL: https://www.psg.lcs.mit.edu/. Prof. Weihl is currently
supported by DEC while on sabbatical at DEC SRC. This research was
supported in part by the Advanced Research Projects Agency under
Contract N00014-94-1-0985, by grants from AT&T and IBM, and by an
equipment grant from DEC. The views and conclusions contained in this
document are those of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of
the U.S. government.]


                           Abstract

     A number of interacting trends in operating system structure,
processor architecture, and memory systems are increasing both the
rate of translation lookaside buffer (TLB) misses and the cost of
servicing a miss.  This paper presents two novel software schemes,
implemented under Mach 3.0, to decrease both the number and the cost
of kernel TLB misses (i.e., misses on kernel data structures,
including user page tables). The first scheme is a new use of
prefetching for TLB entries on the IPC path, and the second scheme is
a new use of software caching of TLB entries for hierarchical page
table organizations.

     For a range of applications, prefetching decreases the number of
kernel TLB misses by 40% to 50%, and caching decreases TLB penalties
by providing a fast path for over 90% of the misses. Our caching
scheme also decreases the number of nested TLB traps due to the page
table hierarchy, reducing the number of kernel TLB miss traps for
applications by 20% to 40%. Prefetching and caching, when used alone,
each improve application performance by up to 3.5%; when used
together, they improve application performance by up to 3%. On
synthetic benchmarks that involve frequent communication among several
different address spaces (and thus put more pressure on the TLB),
prefetching improves overall performance by about 6%, caching improves
overall performance by about 10%, and the two used together improve
overall performance by about 12%.

     Our techniques are very effective in reducing kernel TLB
penalties, which currently range from 1% to 5% of application runtime
for the benchmarks studied. Since processor speeds continue to
increase relative to memory speeds, our schemes should be even more
effective in improving application performance in future
architectures.


1 Introduction


     A number of interacting trends in operating system structure,
processor architecture, and memory systems are increasing both the
rate of translation lookaside buffer (TLB) misses and the relative
cost of servicing a miss.  Microkernel-based operating systems achieve
modularity and flexibility by providing OS functionality in user-level
server processes. Frequent communication between client processes, the
kernel, and server processes results in more TLB misses than in
monolithic- kernel systems [17]. Furthermore, many microkernel-based
systems use virtual memory rather than physical memory for most OS
data structures. This increases the number of pages in the active
working set that require TLB entries.
     Several modern RISC architectures (e.g., the MIPS R2000/3000, the
DEC Alpha, and the HP PA-RISC) handle TLB misses in software.  This
simplifies the hardware considerably, and provides greater flexibility
to the operating system.  However, the penalty for a TLB miss
increases.  This penalty becomes even more significant as CPU speeds
increase relative to memory speeds, since the cost of accessing page
table entries is growing [18].
     As researchers continue to improve inter-process communication
(IPC) performance [4,15], TLB penalties will become an increasing
fraction of the IPC cost [3,11].  In addition, recent commercial
standards, such as OLE [9] and OpenDoc [21], place an increasing
emphasis on inter-application communication. The net effect of all
these factors is that the impact of TLB misses on overall system
performance is increasing [1,7].
     While TLB penalties have been recognized as a problem, proposed
solutions typically require expensive hardware [17,16]. We present two
schemes to address this problem that rely only on software mechanisms;
no additional hardware is required. Our techniques reduce both the
number and cost of TLB misses.
     The first technique involves prefetching TLB entries during
IPC. This well-known technique has not been applied before in the
context of TLB management.  After an IPC, many page table accesses
that cause TLB misses are highly predictable. By prefetching these
entries in the kernel during the IPC, many misses can be avoided.
     The second technique introduces a software cache for the TLB,
called the software TLB (STLB). On several architectures, handling
some types of TLB traps is quite expensive (on the order of several
hundred cycles).  By introducing a large software cache for TLB
entries and checking it early in the TLB miss trap handler, this
approach reduces the time for servicing expensive misses. In addition,
in systems with hierarchical page table organizations, page tables
themselves are mapped.  A hit in the STLB avoids further references to
the page tables.  Preventing cascaded misses reduces the total number
of TLB misses significantly. To the best of our knowledge, this paper
is the first to present a technique to reduce nested TLB traps in
hierarchical page table organizations.  A scheme similar to the STLB
was simulated with a different organization of the hardware TLB and
the page tables [12], and is discussed with related work in Section 7.
     We have implemented our prefetching and caching techniques in the
Mach 3.0 kernel running on a MIPS R3000-based DECstation 5000/240.
Prefetching TLB entries benefited the entire range of applications we
examined, decreasing the number of kernel TLB misses (i.e., misses on
kernel data structures, user page tables, and kernel page tables) by
40% to 50%.  This improves overall application performance by up to
3.5%. The penalties of kernel TLB misses for these applications range
from 1% to 5% [16]; benchmarks in other studies have higher TLB
penalties [12].  Thus, prefetching eliminates a significant fraction
of kernel TLB penalties.  The software TLB achieves high hit rates for
kernel TLB misses that range from 90% to nearly 100%. It also
decreases the number of kernel TLB misses by eliminating nested TLB
traps. As with prefetching, this improves performance by up to 3.5%.
     Using our two techniques together improves application
performance by up to 3%. On synthetic benchmarks that involve frequent
communication among several different address spaces (and thus put
more pressure on the TLB), prefetching improves overall performance by
about 6%, caching improves overall performance by about 10%, and the
two used together improve overall performance by about 12%. Thus, for
systems that use more frequent communication, we expect our techniques
to provide increasing benefits, and provide even greater benefits when
used together.
     Although currently the impact of TLB miss handling on overall
application performance is small, TLB miss handling penalties are
expected to have a larger impact on overall application performance on
newer architectures.  Our proposed techniques reduce both the number
of misses and the cost of each miss, and we expect both techniques to
provide increasing benefits on future architectures.
     Our experiments assume a microkernel-based operating system.
However, our techniques can be applied to other operating system
organizations, such as large single address space systems [6] and
systems with software segmentation [24].  Furthermore, with the
current emphasis on application-controlled resource management [2,10],
our prefetching techniques could become even more effective, since 
the prefetching strategy can be tailored for individual applications.
Prefetching can also be integrated with other VM functions such as
prefetching cache entries.
     Section 2 presents some background material on page table
organizations and VM management.  Section 3 describes our prefetching
scheme, implementation details, and results.  In Section 4, we present
our software caching scheme, implementation details, and results. In
Section 5, results for the integration of our two schemes are
presented. In Section 6, we discuss the benefits of our schemes in the
context of client-server database applications. We also present a
model for TLB penalties on faster architectures. In Section 7 we
discuss related work, and we summarize our conclusions in Section 8.


2 Background


     Architectures such as the DEC Alpha, the HP PA-RISC, and the MIPS
R2000/3000 provide software-managed TLBs.  This provides greater
flexibility to the operating system and eliminates hardware complexity.
For example, different operating systems can implement different page
table organizations for the same architecture.  Some common page table
organizations are inverted page tables and forward-mapped page
tables. For example, HP-UX implements an inverted page table for the
PA-RISC [12], while Mach 3.0 implements a 3-level page table
hierarchy for the MIPS R2000/3000 [23].


[Graphic]
----------------------------------------------------------------------------
Figure 1: Mach 3.0 Page Table Hierarchy. Mach 3.0 implements a 3-level
page table hierarchy on the DECstation 5000/240.


     Figure 1 depicts the page table hierarchy implemented by Mach 3.0
for an R3000-based machine. There are four types of page table entries
(PTEs): L1U, L2, L1K, and L3. L1U PTEs map the pages of user address
spaces. L2 PTEs map the user page tables.  L1K PTEs map kernel data
structures.  L3 PTEs map the pages containing L2 and L1K PTEs.  A page
pinned down in physical memory at the root of the hierarchy holds L3
PTEs [17].
     There are five types of TLB misses: L1U, L1K, L2, and L3 misses,
as well as TLB-invalid misses. A TLB-invalid miss takes place on user
addresses that are marked invalid by the virtual memory system. These
misses are handled identically by the VM system both in Mach and in
our modified versions of Mach.
       ___________________________________
       | Type of miss | Penalty (cycles) |
       ___________________________________
       | L1U          |    10 or 30-40   |
       | L1K          |         512      |
       | L2           |         555      |
       | L3           |         407      |
       | TLB-invalid  |         338      |
       ___________________________________

Table 1: Average TLB Miss Penalties.  Average penalties, measured in
CPU cycles, for TLB misses under Mach 3.0 on a 40 MHz R3000-based
DECstation. These averages were measured using the IOASIC counter. The
cycles counts for the misses exhibited fairly high variability.


     Table 1 gives the the average cost of each miss type, in CPU
cycles, for a 40MHz R3000-based architecture running Mach 3.0. The
cycle counts reported in the table were measured using the IOASIC
counter on the 25 MHz system bus. The counter is a free-running 32 bit
counter that is incremented once per TURBOchannel clock cycle (40 ns
for the DECstation 5000/240). The time to service TLB traps exhibited
fairly high variability, on the order of a couple hundred cycles. We
also observed that the trap time was related to the time between
consecutive traps. We believe that trap times are significantly
affected by cache misses.
     Since L1U misses are the most frequent, the architecture provides
a special trap handler, making them fast. The L1U special trap handler
executes 9 instructions and one memory load. If the memory load hits
in the cache the handler executes in 10 cycles; otherwise, the L1U
miss handler executes in about 30-40 cycles.  All other misses are
slow, since they are serviced by a generic trap handler that handles
all traps except for L1U TLB misses.
     Our prefetching and caching schemes decrease the number of L1K,
L2, and L3 misses, from now on referred to as the kernel TLB misses.
Kernel TLB misses typically account for greater than 50% of TLB
penalties [17].  Since L1U miss handling is extremely fast, and
TLB-invalid misses require the intervention of the VM system, our
schemes do not service these types of TLB misses.


3 Prefetching TLB Entries


     One software approach to decrease TLB overheads is to prefetch
page table entries.  Successful prefetching reduces the number of TLB
misses and thus eliminates the overhead of invoking TLB miss handlers.
As a beneficial side-effect, prefetching decreases the pollution of
caches caused by executing trap handlers.  When prefetching TLB
entries, three issues must be resolved: where to place the prefetching
code, what entries to prefetch, and how many entries to prefetch at a
time.
     Prefetching can be done on IPCs, or context switches between
different protection domains or even between different threads running
in a single large address space.  Communication between concurrently
executing protection domains significantly increases the number of TLB
misses in the system, since several protection domains must be mapped
at the same time.  In this situation, prefetching on the communication
path can be useful.  Communication takes place very frequently between
clients and servers in Mach; therefore, we decided to prefetch entries
on the IPC path.
     A process is typically structured such that its stack segment and
code segment are allocated at opposite ends of the address space,
while the data segment is allocated somewhere in the middle of the
process virtual address space.  This means that more than one L2 TLB
entry is required to map the process. Therefore, prefetching the L2
entries mapping the process stack, code, and data segments on IPCs can
be beneficial. For communicating processes, prefetching TLB entries
that map message buffers and those that map IPC data structures is
useful. Therefore, we decided to prefetch L1K entries mapping IPC data
structures. We also prefetch L2 entries mapping message buffers and
their associated L3 entries, and L2 entries mapping the process stack
and code segments.
     There is a trade-off between the benefits of aggressive
prefetching and the overheads added to the system. Prefetching entries
that are not subsequently used can evict useful entries from the TLB
and thus increase the total number of TLB misses. Therefore,
aggressive prefetching can result in performance degradation. In our
experiments, the remaining TLB misses were not frequent enough to
warrant the overhead added by prefetching.  Therefore, more aggressive
prefetching was not worthwhile.


3.1 Implementation

     Our implementation maintains TLB entries to be prefetched in a
separate Prefetch TLB (PTLB). The PTLB is a table maintained in
unmapped, cached physical memory, to avoid a cascade of TLB misses
while prefetching. PTLB entries are kept consistent with the kernel
page tables efficiently, by invalidating them when the kernel
invalidates page table entries.  The PTLB can hold up to 4K entries
and stores L1K, L2, and L3 TLB entries. TLB entries are prefetched on
the IPC path between different protection domains. On the first IPC
call between a pair of protection domains, TLB misses are handled
through the generic trap handler. The hardware TLB is then probed on
the IPC path and the TLB entries are entered into the PTLB. On
subsequent IPC calls, a hashed lookup is done into the PTLB to
prefetch the entries. Thus, the PTLB does not add overhead to the TLB
trap handler, but only to the IPC path; however, this overhead is not
significant (adding only about 25-60 cycles to a path of length
3500-4000 cycles).
     Trace data collected during experiments indicated that
prefetching L1K entries mapping kernel data structures eliminates
about 20% of kernel TLB misses.  Prefetching L2 entries mapping the
code and stack segments of processes eliminates about 15%, while
prefetching L2 entries mapping server message buffers and their
associated L3 entries eliminates another 15%. We decided to
dynamically prefetch these entries on the IPC path between processes.
     TLB misses on L2 entries for the data segment of processes were
also frequent.  Mach supports sparse memory allocation, and therefore
permits allocation of memory in non-contiguous locations of the
virtual address space of a process [5]. While this is a useful
feature of the VM system, it is not easy to dynamically determine the
location of the data segment of a process. Therefore, we did not
prefetch L2 entries mapping the data segment of a process.


3.2 Experimental Environment

     Our experimental platform consists of Mach 3.0 running on an
R3000-based DECstation 5000/240. MK82 is the Mach 3.0 microkernel used
in our experiments with the user-level server, UX41, to provide Unix
functionality. Xcfbpmax is the X11R5 server used in our experiments.
Thus, our experiments consist of three user-level processes: the Unix
server, the X server, and the client application, all communicating
through the kernel using IPC.
     The R3000 has a 64-entry fully associative TLB. The hardware
supports using the TLB in two partitions.  The upper partition of 56
entries supports a random replacement policy and Mach uses it to hold
L1U, L1K, and L3 entries.  Mach uses the lower partition of 8 entries
to hold L2 entries with a FIFO policy.
     The benchmarks studied were mpeg_play, jpeg_play, video_play,
which are X applications, and ousterhout, IOzone, and mab (modified
Andrew benchmark), which are file-system-oriented applications. Table 2
lists some relevant statistics for each benchmark. For a full
description of the benchmarks refer to [16].

______________________________________________
|              | Kernel_TLB_misses_(thousands)|
|Application   |Total | L1K  | L2    | L3     |
|_____________________________________________|
|mpeg_play     |245.5 | 46.8 | 116.0 |  82.7  |
|jpeg_play_    | 25.6 |  6.2 |  10.0 |   9.4  |
|video_play    |211.0 | 41.5 |  98.4 |  71.1  |
|IOzone        |  5.9 |  2.9 |   0.1 |   2.9  |
|ousterhout    | 43.0 | 11.2 |  16.3 |  15.5  |
|mab           | 89.6 | 33.3 |  22.4 |  33.9  |
|_____________________________________________|

Table 2: Baseline application statistics. This figure gives the
breakdown of the different types of kernel TLB misses under unmodified
Mach 3.0. The counts are in thousands.


     Network traffic, Mach's random page replacement policy [7], and
the "aging" of the kernel [22], result in variability in application
runtimes. The machine was taken off the network to eliminate
variability due to network traffic. As the system stays up for a long
time, the number of L1K misses increases.  This happens because the
kernel starts allocating kernel data structures from mapped virtual
memory. To eliminate this effect, all timings were taken with freshly
booted kernels.  However, our schemes will give increasing benefits as
the number of L1K misses in the system increases with the aging of the
kernel.  In order to obtain accurate timings, 50 data points were
collected for each benchmark.  This decreased the variability in the
average time to acceptable amounts.

     We measured average application runtimes (X~ ) and standard
deviations (sigma), for all our experiments.  For all the data presented
in this paper, we found that for jpeg_play sigma/X~ is less than 0.3%,
for video_play it is less than 0.8%, for mpeg_play sigma/X~ is less than
1.0%, for mab it is less than 1.2%, for IOzone it is less than 1.5%,
and for ousterhout it is less than 4.0%.  Since TLB penalties are a
relatively small percentage of overall application runtime, it is
important that sigma/X~ be small.


3.3 Prefetching Results

[graphic]
Figure 2: Kernel TLB misses: Mach, Mach+PTLB. Kernel misses under
unmodified Mach are shown in black; misses under Mach+PTLB are shown in
gray.


__________________________________________
|             |     Kernel_TLB_Misses     |
|_________________________________________|
|             |      |   Mach+  |         |
|Application  | Mach |   PTLB   | Removed |
|_________________________________________|
|mpeg_play    |245.4 |   116.1  |  52.7%  |
|jpeg_play    | 25.6 |    14.9  |  42.0%  |
|video_play   |211.0 |   103.9  |  50.8%  |
|IOzone       |  5.9 |     5.7  |   4.1%  |
|ousterhout   | 43.0 |    22.2  |  48.5%  |
|mab          | 89.6 |    67.0  |  25.2%  |
|_________________________________________|

Table 3: Kernel TLB miss counts: Mach, Mach+PTLB. Total kernel TLB
miss counts, in thousands, and the percentage of kernel TLB misses
eliminated by prefetching.


     Figure 2 shows the decrease in the number of kernel TLB misses
for each of the three types of misses --- L1K, L2, and L3. Table 3 gives
the counts of kernel TLB misses under unmodified Mach and Mach+PTLB and
the percentage of kernel TLB misses eliminated by prefetching.
     Prefetching TLB entries on the IPC path eliminates 40% to 50% of
the kernel TLB misses for all benchmarks except IOzone and
mab. Neither of these benchmarks is IPC-intensive, and therefore they
derive little benefit from the PTLB.

[graphic]
---------------------------------------------------------------
Figure 3: Prefetching costs and benefits.  The measured timing
contributions of kernel TLB misses in unmodified Mach and
Mach+PTLB. The solid gray bars show the measured overhead added on the
IPC path due to prefetching.

     Figure 3 shows kernel TLB miss penalties under unmodified Mach in
black. The striped bars represent kernel TLB miss penalties under
Mach+PTLB, and the solid gray bars indicate the prefetching overheads
added to the IPC path. Prefetching an entry into the TLB takes about
60 cycles, while probing the TLB and not prefetching an entry because
it already exists takes about 25 cycles (this case occurs about twice
as often as successful prefetching).
______________________________________________
|              | Kernel          |            |
|              | TLB Penalty     |            |
|              | (million_cycles)|            |
|_____________________________________________|
|              |        |  Mach+ |            |
|Application   |  Mach  |  PTLB  |   Speedup  |
|_____________________________________________|
|mpeg_play     | 124.6  |  48.2  |     1.69%  |
|jpeg_play     |  13.0  |   6.0  |     0.91%  |
|video_play    | 108.0  |  41.6  |     3.52%  |
|IOzone        |   3.4  |   2.7  |     0.19%  |
|ousterhout    |  28.6  |  17.4  |     0.77%  |
|mab           |   2.0  |  29.8  |     0.96%  |
|_____________________________________________|

Table 4: Kernel TLB penalties: Mach, Mach+PTLB. The TLB penalties are
reported in millions of CPU cycles. The last column gives the
percentage of application time saved by prefetching.


     Table 4 presents the measured kernel TLB penalties, in millions
of CPU cycles, for applications under Mach and Mach+PTLB. The last
column presents the application speedup due to prefetching; speedup is
measured as the decrease in application runtime as a percentage of the
application runtime under unmodified Mach. Prefetching results in
speedups of up to 3.5% of overall application run time.  Kernel TLB
misses typically constitute greater than 50% of the overall TLB
penalties in an application. [Footnote 1: The remaining penalties are
due to L1U and TLB-invalid misses. Even though L1U misses are highly
optimized, their frequency is so high that they significantly
contribute to overall TLB penalties.]  Kernel TLB miss penalties
constitute 1% to 5% of the overall application time for the benchmarks
studied [17]. Thus, prefetching decreases a significant fraction of
the kernel TLB penalties.
     In summary, prefetching benefits all applications and decreases
TLB overheads significantly, usually eliminating about 50% of the
kernel TLB miss penalties.  Although TLB penalties are currently only
a few percent of overall application runtime, prefetching will provide
increasing benefits as TLB penalties increase in the future.


4 Software Cache of TLB Entries


     Architectures with software managed TLBs incur large penalties in
TLB miss handling. This is due not only to the use of generic trap
handlers but also to cascaded TLB misses resulting from hierarchical
page table organizations. We address this problem by providing a large
second-level software cache of TLB entries (STLB).
     L1K, L2, and L3 entries that have been evicted from the hardware
TLB are stored in the STLB. The STLB code
branches off the generic trap handler path at the beginning and does a
quick lookup in the STLB. On an STLB hit, the entry is inserted into
the hardware TLB. On an STLB miss, the code branches back to the
generic trap handler path.
     Thus, the first benefit of the STLB is that it provides a fast trap
path for TLB misses, and on an STLB hit this avoids the overhead (such
as saving and restoring registers) of the generic trap handler. The
second benefit of the STLB is that it eliminates cascaded TLB misses.
Cascaded TLB misses occur because the page tables are hierarchically
organized with the lower levels in mapped memory. The TLB trap handler
code tries to resolve a TLB miss by looking in the next higher level
in the page table hierarchy. However, this lookup can itself cause a
TLB miss (up to a depth of three in Mach's page table organization),
resulting in a cascade of TLB misses.  The STLB provides a flat space
of TLB entries in unmapped physical memory, and thus prevents such
cascaded TLB misses.
     As with any cache, the STLB can be organized in many different
ways --- direct-mapped, direct-mapped with a victim cache [14], n-way
associative, or fully associative [18]. The organization and the size
of the STLB affect its hit rate.


4.1 Implementation

     Like the PTLB, the STLB resides in unmapped, cached physical
memory, and therefore does not occupy page table entries in the
hardware TLB. Since the STLB code branches off the generic trap
handler path, it adds some overhead to other traps. However, this code
has been optimized so that it adds only 4 cycles to system calls and
interrupts, which are by far the most frequent kinds of
traps. [Footnote 2: The STLB code adds only 11 cycles to the other
traps, such as Bus Error, Address Error, and Floating Point
exceptions, which are relatively infrequent.]
     We studied the following organizations of the STLB: direct-mapped
with 1K entries (DM 1K), direct-mapped with 4K entries (DM 4K), 2-way
set-associative with 1K and 4K entries, and 4-way set-associative with
1K and 4K entries.  The direct-mapped organizations index into the
STLB and check only one entry to find a match with the faulting
address, whereas the set-associative organizations check more than one
entry of the STLB in software to find a match.


4.2 STLB results

                     _______________________
                    |   Penalty_(cycles)   |
        ____________|_____STLB_____|_______|
       |Miss_Type   |  Hit | Miss  |  Mach |
       |L1K         | 105  |  582  |  512  |
       |L2,_Path_1  | 114  |  625  |  555  |
       |L2,_Path_2  | 160  |  625  |  555  |
       |L3          | 105  |  408  |  338  |
       |___________________________________|

Table 5: Average STLB penalties. An STLB miss adds about 70 cycles to
the TLB trap path. The first L2 path through the STLB is more frequent
than the second L2 path, and has been optimized to take less time.


     Preliminary experiments with direct-mapped organizations showed
high hit rates, comparable to the set-associative organizations, but
with lower access times because sequential checks in software for
matches are avoided in a direct-mapped organization. Therefore, we
present only the results for the direct-mapped configurations of the
STLB. Table 5 presents the average time, in CPU cycles, that the
direct-mapped STLB takes to service TLB misses. [Footnote 3: The two
different L2 paths correspond to L2 misses that take place when in
user space, and L2 misses that take place when in kernel space.] An
STLB hit takes about 115-160 cycles for L2 hits and about 100 cycles
for L1K and L3 hits.  The difference in timings is because Mach 3.0
replaces L2 entries using a FIFO policy, and a random replacement
policy is followed for L1K and L3 entries.  An STLB miss adds about 70
cycles to the trap path. Thus, on an STLB hit, this scheme decreases
the overhead of the generic trap handler by providing fast access to a
large number of page table entries.

_________________________________________________________________
|              | Mach |        DM_1K       |         DM_4K       |
|Application   |Total |  Total  | Hit rate |  Total   | Hit rate |
|________________________________________________________________|
|mpeg_play     |245.4 |  197.5  |   82.5%  |   173.8  |   97.2%  | 
|jpeg_play     | 25.6 |   20.2  |   84.4%  |    17.4  |   99.8%  | 
|video_play    |211.0 |  202.4  |   70.9%  |   163.3  |  100.0%  | 
|IOzone        |  5.9 |    4.4  |   98.5%  |     4.3  |   99.3%  | 
|ousterhout    | 43.0 |   33.2  |   85.9%  |    29.0  |   99.3%  | 
|mab           | 89.6 |   88.0  |   70.6%  |    63.9  |   92.5%  | 
|________________________________________________________________|
 
Table 6: TLB miss counts and hit rates under Mach and STLB. Total
kernel misses in thousands under unmodified Mach 3.0, and the two
direct-mapped STLB configurations, DM 1K and DM 4K. Hit rates are
included for the STLB configurations.  DM 1K is the direct-mapped STLB
with 1K entries, while DM 4K is the direct-mapped STLB with 4K
entries.


______________________________________________
|              | Nested_Misses_(thousands)   |
|Application   | Mach   |   DM 1K |   DM 4K  |
|____________________________________________|
|mpeg_play     |  72.3  |   13.1  |     3.3  |
|jpeg_play     |   7.9  |    1.0  |     0.0  |
|video_play    |  61.4  |   35.8  |     0.0  |
|IOzone        |   1.7  |    0.0  |     0.0  |
|ousterhout    |  13.1  |    2.9  |     0.0  |
|mab           |  26.4  |   12.6  |     0.2  |
|____________________________________________|
Table 7: Nested TLB misses.  Nested TLB misses in thousands under
unmodified Mach 3.0, and the two direct-mapped STLB configurations, DM
1K and DM 4K. The nested misses are caused because of the page table
hierarchy.

     The experimental environment and benchmarks for the STLB are the
same as those for the PTLB which are described in Section 3.2.  Table
6 presents the number of kernel TLB misses for the two STLB
configurations and the associated STLB hit rates. The number of kernel
TLB misses under unmodified Mach 3.0 has been included for
comparison. The hit rates achieved by DM 1K range from 70% to 99%
while the hit rates achieved by DM 4K are close to 100%. The larger
STLB (DM 4K) services more TLB misses than the smaller DM 1K
STLB. This configuration also results in an overall decrease in the
number of kernel TLB misses. This is because the cascade of TLB misses
to higher levels of the page table hierarchy is eliminated when a TLB
miss hits in the STLB; and as Table 7 shows, under Mach 3.0 nearly 30%
of the kernel TLB misses for each application are such nested
misses. The larger STLB eliminates almost all of these cascaded
misses.

[graphic]
Figure 4: Total TLB penalties for different STLB
configurations. This figure gives the kernel TLB penalties, in seconds,
with the different STLB configurations and compares them with TLB
penalties under Mach.  The overhead added by the STLB is represented
by the solid gray bars. The overhead is the same for both STLB
configurations, since it depends on the number of system calls and is
independent of the STLB size.

     Figure 4 presents the total TLB penalties under the different
STLB configurations and unmodified Mach. The gray bars represent the
overhead imposed by the STLB on system calls. As is evident from the
figure, DM 4K reduces TLB penalties significantly.
___________________________________________________________________
|          |        TLB penalty             |                      |
|__________________________________________________________________|
|          |       (million cycles)         |        Speedup       |   
|__________________________________________________________________|
|Application | Mach  |   DM 1K  |    DM 4K  |    DM 1K |    DM 4K  |
|__________________________________________________________________|
|mpeg_play   | 124.6 |    41.7  |     21.4  |   0.27%  |    1.73%  |
|jpeg_play   |  13.0 |     4.2  |      1.9  |   0.15%  |    0.21%  | 
|video_play  | 108.0 |    60.7  |     15.7  |   2.18%  |    3.53%  | 
|IOzone      |   3.4 |     0.7  |      0.6  |   0.85%  |    0.59%  | 
|ousterhout  |  24.0 |     7.9  |      3.9  |   0.76%  |    1.53%  | 
|mab         |  52.0 |    31.6  |     13.2  |   0.76%  |    0.80%  | 
___________________________________________________________________|
Table 8: Kernel TLB penalties: Mach, STLB. Kernel TLB penalties under
Mach 3.0, and the two direct-mapped STLB configurations, DM 1K and DM
4K, in millions of CPU cycles. The last two columns present the
percentage of application time saved by the STLB configurations.


     Table 8 presents TLB penalties under unmodified Mach and the two
STLB configurations, and the percentage of application time saved by
using the STLB. The speedups obtained by using the larger STLB, DM 4K,
go up to 3.5%, a significant fraction of application kernel TLB
penalties.  DM 1K does not perform as well for all applications, due
to its lower hit rate.

5 Prefetching with the Software TLB


     The direct-mapped organization of the PTLB and the STLB make
integrating the two schemes fairly straightforward.  Using the STLB
decreases the number of cascaded TLB traps. Therefore, prefetching L3
entries is not very useful. In fact, it has the negative effect of
evicting useful entries from the TLB. Therefore, only L1K and L2
entries are prefetched. Since the DM 4K organization of the STLB
performed well, the PTLB+STLB implementation uses a direct-mapped
STLB, with 4K entries.


__________________________________________________________
|              |     Kernel_TLB_Misses        |           |
|              |        | PTLB+    |  Percent |    Hit    |
|Application   | Mach   |   STLB   |  Removed |   Rate    |
|_________________________________________________________|
|mpeg_play     | 245.4  |   128.6  |   47.6%  |    96.2%  | 
|jpeg_play     |  25.6  |    13.8  |   46.2%  |    99.7%  | 
|video_play    | 211.0  |   111.8  |   47.0%  |   100.0%  |  
|IOzone        |   5.9  |     4.1  |   31.2%  |    99.3%  | 
|ousterhout    |  43.0  |    21.1  |   50.9%  |    99.1%  | 
|mab           |  89.6  |    52.8  |   41.1%  |    90.9%  | 
__________________________________________________________|

Table 9: Kernel TLB miss counts: Mach, PTLB+STLB. Total kernel TLB
miss counts, in thousands, under Mach and PTLB+STLB.  The third column
presents the percentage of kernel TLB misses eliminated by the
software TLB and prefetching, and the last column gives the hit rate
for the STLB.


______________________________________________
|             |       Kernel      |           |
|             |     TLB Penalty   |           |
|             | (million_cycles)  |           |
|             |        |    PTLB+ |           |
|Application  | Mach   |    STLB  |   Speedup |
|_____________________________________________|
|mpeg play    | 124.6  |    18.4  |    1.09%  |
|jpeg play    |  13.0  |     1.6  |    0.27%  |
|video play   | 108.0  |    11.4  |    3.04%  |
|IOzone       |   3.4  |     0.6  |    0.99%  |
|ousterhout   |  24.0  |     3.4  |    1.65%  |
|mab          |  52.0  |    13.6  |    0.25%  |
|_____________________________________________|

Table 10: Kernel TLB penalties: Mach, PTLB+STLB. The kernel TLB
penalties are measured in millions of CPU cycles. The last column
gives the percentage of application time saved by using the integrated
scheme with prefetching and the STLB.

[graphic]
Figure 5: Costs and benefits of PTLB+STLB. The measured timing
contributions of kernel TLB misses in unmodified Mach and
PTLB+STLB. The solid gray bars give the measured overhead added to the
IPC path due to prefetching, and the trap handler path due to the
STLB.


     Table 9 gives the counts of kernel TLB misses under unmodified
Mach and PTLB+STLB and the percentage of kernel TLB misses eliminated.
The STLB hit rates are given in the last column.  The hit rates are
lower than those obtained for the STLB configuration DM 4K in Table 6.
This is because the total number of kernel TLB misses is decreased due
to prefetching.  Figure 5 shows kernel TLB miss penalties under
unmodified Mach in black.  The striped bars represent the kernel TLB
miss penalties under the PTLB+STLB, and the solid gray bars indicate
the overheads added by prefetching and the STLB.
     Table 10 presents the measured kernel TLB penalties, in millions
of CPU cycles, for applications under Mach and the PTLB+STLB. The last
column gives the percentage of overall application time saved due to
the integrated scheme.  As seen in Figure 5 and Table 10, the
integrated scheme performs well, improving application performance by
up to 3% of overall application runtime. However, it does not do
better than the prefetching scheme alone or the STLB scheme alone for
these applications. This is because while prefetching and the STLB are
both very effective at reducing TLB penalties, they eliminate
opportunities to reduce TLB costs from each other. However,
integrating the two schemes adds the overhead of both
schemes. Therefore, it does not provide greater benefits than the two
schemes do independently for these benchmarks.


6 Discussion


6.1 Multiple Clients and Servers

     All the benchmarks presented above involve a single client
application communicating though IPC to the UX server and the X
server. To study the benefits of our schemes with multiple clients and
servers and more frequent IPC, we ran a small synthetic database
application that consists of several clients making requests in a loop
to a database server. The database server receives a request from each
client to query a fixed number of records randomly distributed in the
memory-resident database file. The server touches each record
requested and returns to the requesting client. This application is
IPC-intensive and tests TLB behavior with several communicating
processes. Since it includes several clients, it puts significantly
more pressure on the TLB than the benchmarks presented earlier.
     We ran the benchmark with three clients communicating with one
server.  Prefetching eliminates about 30% to 35% of the kernel TLB
misses.  This improves overall application performance by 6%.  The
STLB exhibits a hit rate of about 90% and eliminates about 20% of the
kernel TLB misses by avoiding cascaded TLB misses. Using DM 4K, this
improves overall application performance by 10%. The integrated
PTLB+STLB scheme achieves an STLB hit rate of about 86%. It eliminates
about 40% of kernel TLB misses, and improves application performance
by about 12%.
     Thus, as the number of clients and servers increases, TLB
penalties represent a larger fraction of the runtime of IPC- intensive
applications. Our prefetching and caching schemes both provide
increasing benefits under these circumstances.  This benchmark also
demonstrates the usefulness of the integrated scheme.  The PTLB+STLB
provides greater benefits than either technique used alone.


6.2 Model for Future Architectures

     To study the benefits of our schemes on future architectures we
modelled trap time as a function of the caching behavior of the TLB
miss handler code.  We use the following notation: i denotes the
number of instructions executed by the trap handler, m denotes the
number of cache misses during trap execution, and t denotes the time
to service a cache miss.  Therefore we have, Trap time = i + m x t.
     The time to service a cache miss on the DECstation 5000/240 is t
= 24 cycles [8]. On a machine with a clock speed of 200 MHz and the
same memory system, the time to service a cache miss will be (200/40)
x 24 = 120 cycles.  On such a machine, TLB penalties will become:
  STLB fail       = 29 + 2 x 120 = 269 cycles
  STLB  L2; path1 = 61 + 2 x 120 = 301 cycles
  STLB  L2; path2 = 76 + 3 x 120 = 436 cycles
  STLB L1K; L3    = 52 + 2 x 120 = 292 cycles 

and for unmodified Mach,
  L1K   = 167 + 14 x 120 = 1847 cycles
  L2    = 207 + 14 x 120 = 1887 cycles
  KPT   = 161 + 10 x 120 = 1361 cycles

     Thus, TLB miss penalties increase by approximately 1500 cycles,
from 500 cycles to nearly 2000 cycles.  STLB penalties increase only
by about 300 cycles, from 100 cycles to 400 cycles.  Referring to
Table 5, an L1K miss that hits in the STLB on the current architecture
eliminates approximately (512 - 105) = 407 cycles.  Based on the above
model for a 200MHz architecture, an L1K hit in the STLB would
eliminate (1847 - 292) = 1555 cycles.  Prefetching entirely eliminates
the penalty of the TLB miss.  Thus, prefetching and the STLB should
decrease TLB miss handling penalties significantly on faster, newer
architectures, and should have an even greater impact on overall
application performance.


7 Related Work


     Prefetching and caching are two well-known techniques that have
been widely applied in computer systems [19,20], but have not been
applied to software management of TLBs. Most of the previous work has
focussed on hardware caches.
     Huck and Hays [12] have explored an idea similar to the STLB in
the context of the HP PA-RISC. Their Hashed Page Table (HPT) is a page
table organization suited for large address spaces. The faulting
virtual address is used to index into the HPT and a match is attempted
between the faulting address and the HPT entry. If no match is found,
a fault is taken to the OS. Collision entries are then looked up in
software. This scheme replaced an earlier one in the HP-UX operating
system, which used an inverted page table organization. Huck and Hays
simulated both hardware and software implementations of the HPT.
     Our work, which has been implemented on a different architecture,
complements their work in a number of ways.  Our entirely
software-based implementation is in the context of
hierarchically-organized forward-mapped page tables and it studies the
effect of nested TLB traps.  We obtained measurements from a running
system for a range of applications and found that a direct-mapped
organization of the STLB was very effective.  Our experiments showed
that a 2-way set-associative software STLB adds more overhead, as
compared to a simple direct-mapped organization.  Further, our work
studies the benefit of prefetching TLB entries in the context of a
microkernel-based OS environment with several communicating processes,
where TLB performance is more critical.
     Another approach to decreasing TLB miss handling costs is to
provide variable-sized pages that use a single TLB entry to map large
contiguous regions of memory. This scheme is used by the PowerPC
architecture [13]. Prefetching of TLB entries is still useful for
dynamically allocated data structures. As operating systems move to
modular organizations supporting finer grained processes interacting
with each other, the number of L2 TLB misses will increase. Thus,
although each process could be mapped by a single TLB entry with a
larger page size, the number of such entries is still large.  Since
the TLB is a shared resource, contention for the TLB will be high.
Therefore, although each entry maps more pages of an address space,
prefetching such entries should continue to be useful.  Also, in
systems supporting garbage collection, ensuring that all important
data structures are contiguous in physical memory is not easy.  Thus,
our schemes can complement the variable-sized pages solution in
reducing TLB miss penalties.


8 Conclusions


     This paper presents two entirely software-based solutions to
decrease TLB penalties.  The first scheme, prefetching TLB entries on
the IPC path, decreases kernel TLB miss penalties by about 50%. The
second scheme uses a large software cache to provide fast path access
to entries on a TLB miss.  The hit rates achieved are high, ranging
from 90% to nearly 100%.
     Both prefetching and software caching perform well for all
applications.  The integration of the two schemes also performs well,
though it provides less benefits than any one scheme independently for
the benchmarks studied.  For a synthetic benchmark that involves
several processes communicating frequently, however, using the two
techniques together provides greater benefits than either technique
alone.
     Although TLB penalties are currently only a few percent of
overall application runtime, they will probably increase in the
future, both because processor speeds are increasing more rapidly than
memory speeds, and because of changes in
operating system structure that are resulting in more frequent
communication. Analysis and experiments indicate that our techniques
should improve application runtime by more than 10% in the future.  9
Acknowledgments


     Thanks to Wilson Hsieh and Carl Waldspurger for their
contribution to early ideas in this work. Special thanks to Carl for
his invaluable suggestions, advice, time and support. Thanks also to
Don Yeung for our fruitful discussions, Anthony Joseph for his help
with X and Rich Uhlig, Alessandro Forin and Mary Thompson for their
help with Mach.  Thanks to Anthony, Debby Wallach, Don, Eric Brewer,
Ulana Legedza, and Wilson for their useful comments on the
drafts. Thanks to Paul Leach and the anonymous referees for their
helpful comments.


References


  [1] T. Anderson, H. Levy, B. Bershad, and E. Lazowska. The
interaction of architecture and operating system design. In
Proceedings
      of the 4th Conference on Architectural Support for Programming
Languages and Systems, pages 108-119, Santa Clara, CA, April
      1991.

  [2] B. Bershad, C. Chambers, S. Eggers, C. Maeda D., McNamee,
P. Pardyak, S. Savage, and E. Sire. SPIN - an extensible microkernel
      for application-specific operating system services. Technical
Report TR94-03-03, University of Washington, February 1994.

  [3] B.N. Bershad. The increasing irrelevance of IPC performance for
microkernel-based operating systems. In USENIX Workshop on
      Microkernels and Other Kernel Architectures, pages 205-211,
Seattle, WA, April 1992.

  [4] B.N. Bershad, T.E. Anderson, E.D. Lazowska, and H.M. Levy.
Lightweight remote procedure call.  In Proceedings of the 12th
      Symposium on Operating Systems Principles, pages 102-113,
Litchfield Park, AZ, December 1989.

  [5] J. Boykin, D. Kirschen, A. Langerman, and
S. LoVerso. Programming under Mach. Addison-Wesley Publishing Company,
1993.

  [6] J.S. Chase, H.M. Levy, E.D. Lazowska, and M. Baker-Harvey.
Lightweight shared objects in a 64-bit operating system.  In
      Proceedings of the Conference on Object-Oriented Programming
Systems, Languages and Applications, pages 397-413, Vancouver,
      Canada, October 1992.

  [7] B. Chen and B. Bershad. The impact of operating system structure
on memory system performance. In Proceedings of the 14th
      Symposium on Operating Systems Principles, pages 120-133,
Asheville, NC, December 1993.

  [8] Digital Equipment Corporation. DECstation and DECsystem 5000
Model 240 Technical Overview. 1991.

  [9] Microsoft Corporation. Microsoft OLE programmer's
reference. Microsoft Press, 1993.

[10] D. Engler, M. F. Kaashoek, and J. O'Toole. The operating system
kernel as a secure programmable machine. In Proceedings of the
      6th European SIGOPS Workshop, Germany, September 1994.

[11] W. Hsieh, M. F. Kaashoek, and W. E. Weihl. The persistent
relevance of IPC performance: New techniques for reducing the IPC
      penalty. In Proceedings of the 4th Workshop on Workstation
Operating Systems, pages 186-190, Napa, CA, October 1993.

[12] J. Huck and J. Hays. Architectural support for translation table
management in large address space machines. In Proceedings of
      the 20th International Symposium on Computer Architecture, pages
39-50, San Diego, CA, May 1993.

[13] Motorola Inc. PowerPC 601: RISC Microprocessor User's
Manual. Prentice-Hall, Inc., 1993.

[14] N. Jouppi. Improving direct-mapped cache performance by the
addition of a small fully-associative cache and prefetch buffers.*
 * In
      Proceedings of the 17th International Symposium on Computer
Architecture, pages 364-373, Seattle, WA, May 1990.

[15] J. Liedtke.  Improving IPC by kernel design.  In Proceedings of
the 14th Symposium on Operating Systems Principles, pages
      175-188, Asheville, NC, December 1993.

[16] D. Nagle, R. Uhlig, T. Mudge, and S. Sechrest.  Optimal
allocation of on-chip memory for multiple-API operating systems.  In
      Proceedings of the 21st International Symposium on Computer
Architecture, pages 358-369, Chicago, IL, April 1994.

[17] D. Nagle, R. Uhlig, T. Stanley, S. Sechrest, T. Mudge, and
R. Brown. Design tradeoffs for software managed TLBs. In Proceedin*
 *gs
      of the 20th International Symposium on Computer Architecture,
pages 27-38, San Diego, CA, May 1993.

[18] D. Patterson and J. Hennessy. Computer architecture: a
quantitative approach. Morgan Kaufman Publishers, 1989.

[19] D. Patterson and J. Hennessy. Computer organization and design:
the hardware/software interface. Morgan Kaufman Publishers,
      1993.

[20] A. S. Tanenbaum. Modern Operating Systems. Prentice-Hall, Inc.,
1992.

[21] OpenDoc Design Team. OpenDoc technical summary. In Apple's World
Wide Developers Conference Technologies CD, San Jose,
      CA, April 1994.

[22] R. Uhlig. Private communication, 1993.

[23] R. Uhlig, D. Nagle, T. Mudge, and S. Sechrest. Software TLB
management in OSF/1 and Mach 3.0. Technical report, University
      of Michigan, 1993.

[24] R. Wahbe, S. Lucco, T. Anderson, and S. Graham. Efficient
software-based fault isolation. In Proceedings of the 14th Symposium
      on Operating Systems Principles, pages 203-216, Asheville, NC,
December 1993.