Next: Related Work Up: A Decoupled Architecture for Previous: A Decoupled Architecture for

Introduction

With the emergence of data-intensive applications such as multimedia and data mining workloads, disk I/O performance plays an ever more critical role in the overall application performance. This is due to the increasing performance gap between CPU speed and disk access latency. To improve performance for these data-intensive applications requires that disk access delays be effectively masked or overlapped with computation. Many application programmers alleviate this performance gap problem by manually interleaving computation with synchronous disk I/O. That is, the program issues a disk I/O call, waits for the I/O to complete, performs some computation, then issues another disk I/O request, and so on. Another solution to mask disk I/O delay is asynchronous I/O. Although conceptually straightforward, asynchronous disk I/O tends to complicate program logic and possibly lengthen software development time. Also, an asynchronous disk I/O-based application written for one disk I/O subsystem may not work well for another with different latency and bandwidth characteristics. Yet another solution is to cache recently accessed disk blocks in main memory for future reuse. If there is a high degree of data reuse, file or disk caching can reduce both read latency and disk bandwidth requirements. However, for many media applications caching is not effective either because the working set is larger than the file buffer cache, or because each disk block is used only once.

A well-known solution to this problem implemented in many UNIX systems, including Linux, is to prefetch a disk block before it is needed [12]. Linux assumes that most applications access a file in a sequential order. Hence, whenever an application reads a data block from the disk, the file system prefetches blocks , , ... for some small value of . If the access pattern is not sequential, prefetching is suspended until the application's disk accesses exhibit a sequential access pattern again. For most common applications, this simple sequential prefetching scheme seems to work well. However, sequential access is not necessarily the dominant access pattern for some other important applications, such as volumetric data visualization, multi-dimensional FFT, or digital video playbacks (e.g., fast forwards and rewinds); these are popular applications in the scientific visualization or multimedia world.

Different applications may exhibit different access patterns and thus a fixed prefetching policy is not adequate. We propose an Automatic Application-Specific File Prefetching mechanism (AASFP) that allows an application program to instruct the file system how to prefetch on its behalf. AASFP automatically generates the prefetch instructions from an application's source code. The prefetch instructions are instantiated as a program running inside either a local file system or a network file server. AASFP is an application of the idea of decoupled architecture [22], which was originally proposed to address the von Neumann bottleneck to bridging the CPU-disk performance gap.

AASFP transforms a given application program into two threads: a computation thread and a prefetch thread. The computation thread is the original program and contains both computational and disk access instructions; the prefetch thread contains all the instructions in the original program that are related to disk I/O. At run time, the prefetch thread is started earlier than the computation thread. Because the computation load in the prefetch thread is typically a small percentage of that of the original program, the prefetch thread could remain ahead of the computation thread throughout the application's entire lifetime. Consequently, most disk I/O accesses in the computation thread are serviced directly from the file system buffer, which is populated beforehand by the I/O accesses in the prefetch thread. In other words, the prefetch thread brings in exactly the data blocks as required by the computation thread before they are needed. Of course, it is not always possible for the prefetch thread to maintain sufficient advance over the computation thread. For example, the computation thread may need to access a disk address, but the generation of this address may depend on the user inputs or on the inputs from a file. In such cases, these two threads need to synchronize with each other. Fortunately, such tight synchronization is relatively infrequent in I/O-intensive programs. The key advantage of AASFP is that it eliminates the need for manual programming of file prefetch hints by generating the prefetch thread from the original program using compile-time analysis. In addition to being more accurate in what to prefetch, AASFP also provides more flexibility to the file system in deciding when to prefetch disk blocks via a large look-ahead window lead by the prefetch thread.

The rest of this paper is organized as follows. Section 2 reviews some of the related work. Section 3 describes the system architecture of AASFP. Section 4 shows how to generate the prefetch thread from a given program. Section 5 discusses the run-time system of AASFP. Section 6 evaluates the performance results of AASFP. Section 7 concludes this paper and points out potential future directions.

Next: Related Work Up: A Decoupled Architecture for Previous: A Decoupled Architecture for

chuan-kai yang 2002-04-15