Check out the new USENIX Web site.

Software Prefetching and Caching for Translation Lookaside Buffers


Kavita Bala* M. Frans Kaashoek* William E. Weihl*
MIT Laboratory for Computer Science
Cambridge, MA 02139, USA

[Footnote:E-mail: {kaybee, kaashoek, weihl}@lcs.mit.edu.
World Wide Web URL: https://www.psg.lcs.mit.edu/.
Prof. Weihl is currently supported by DEC while on sabbatical at DEC SRC.

Abstract

A number of interacting trends in operating system structure, processor architecture, and memory systems are increasing both the rate of translation lookaside buffer (TLB) misses and the cost of servicing a miss. This paper presents two novel software schemes, implemented under Mach 3.0, to decrease both the number and the cost of kernel TLB misses (i.e., misses on kernel data structures, including user page tables). The first scheme is a new use of prefetching for TLB entries on the IPC path, and the second scheme is a new use of software caching of TLB entries for hierarchical page table organizations.

For a range of applications, prefetching decreases the number of kernel TLB misses by 40% to 50%, and caching decreases TLB penalties by providing a fast path for over 90% of the misses. Our caching scheme also decreases the number of nested TLB traps due to the page table hierarchy, reducing the number of kernel TLB miss traps for applications by 20% to 40%. Prefetching and caching, when used alone, each improve application performance by up to 3.5%; when used together, they improve application performance by up to 3%. On synthetic benchmarks that involve frequent communication among several different address spaces (and thus put more pressure on the TLB), prefetching improves overall performance by about 6%, caching improves overall performance by about 10%, and the two used together improve overall performance by about 12%.

Our techniques are very effective in reducing kernel TLB penalties, which currently range from 1% to 5% of application runtime for the benchmarks studied. Since processor speeds continue to increase relative to memory speeds, our schemes should be even more effective in improving application performance in future architectures.


Download the full text of this paper in ASCII (50,507 bytes) and POSTSCRIPT (204,379 bytes) form.

To Become a USENIX Member, please see our Membership Information.