5th USENIX Conference on File and Storage Technologies

5th USENIX Conference on File and Storage Technologies - Paper

Pp. 301–314 of the Proceedings

PRO: A Popularity-based Multi-threaded Reconstruction Optimization for RAID-Structured Storage Systems
Lei Tian^†‡, Dan Feng^†‡, Hong Jiang^§,Ke Zhou^†‡
Lingfang Zeng^†‡, Jianxi Chen^†‡, Zhikun Wang^†‡, Zhenlei Song^†
^† Huazhong University of Science and Technology
^‡ Wuhan National Laboratory for Optoelectronics
^§ University of Nebraska-Lincoln
E-mail: {dfeng, k.zhou, ltian}@hust.edu.cn, jiang@cse.unl.edu

Abstract

This paper proposes and evaluates a novel dynamic data reconstruction optimization algorithm, called popularity-based multi-threaded reconstruction optimization (PRO), which allows the reconstruction process in a RAID-structured storage system to rebuild the frequently accessed areas prior to rebuilding infrequently accessed areas to exploit access locality. This approach has the salient advantage of simultaneously decreasing reconstruction time and alleviating user and system performance degradation. It can also be easily adopted in various conventional reconstruction approaches. In particular, we optimize the disk-oriented reconstruction (DOR) approach with PRO. The PRO-powered DOR is shown to induce a much earlier onset of response-time improvement and sustain a longer time span of such improvement than the original DOR. Our benchmark studies on read-only web workloads have shown that the PRO-powered DOR algorithm consistently outperforms the original DOR algorithm in the failure-recovery process in terms of user response time, with a 3.6%∼23.9% performance improvement and up to 44.7% reconstruction time improvement simultaneously.

1 Introduction

Reliability and availability are the most common issues that administrators of storage systems are concerned with and directly affect the end users of such systems. Frequent or long downtimes and data losses are clearly intolerable to the end users. Research has shown that 50 percent of companies losing critical systems for more than 10 days never recover, 43 percent of companies experiencing a disaster never reopen, and 29 percent of the remaining close within two years [1]. For some companies that do survive long-term performance degradation, the resulting penalties can become too costly to ignore. For example, for data services at a large-scale data center, a typical Service Level Agreement specifies the percentage of transaction response times that can exceed a threshold (e.g. seconds), and the penalties for failing to comply can be very costly [2].

As a result, redundant storage systems, such as RAID [3], have been widely deployed to prevent catastrophic data losses. Various RAID schemes employ mirroring, parity-encoding, hot spare and other mechanisms [4] to tolerate the failures of disk drives. If a disk failure occurs, RAID will automatically switch to a recovery mode, often at a degraded performance level, and activate a background reconstruction process for data recovery. The reconstruction process will rebuild data units of the failed disk onto the replacement disk while RAID continues to serve I/O requests from clients. After all of the contents are rebuilt, RAID returns the system to the normal operating mode.

Advances in storage technology have significantly reduced cost and improved performance and capacity of storage devices, making it possible for system designers to build extremely large storage systems comprised of tens of thousands or more disks with high I/O performance and data capacity. However, reliability of individual disk drives has improved very slowly, compared to the improvement in their capacity and cost, resulting in a very high overall failure rate in a system with tens of thousands of disk drives. To make matters worse, the time to rebuild a single disk has lengthened as increases in disk capacity far outpace increases in disk bandwidth, lengthening the “window of vulnerability" during which a subsequent disk failure (or a series of subsequent disk failures) causes data loss [5]. Furthermore, many potential applications of data replication or migration, such as on-line backup and capacity management, are faced with similar challenges as the reconstruction process.

Therefore, the efficiency of the reconstruction algorithm affects the reliability and performance of RAID-structured storage systems directly and significantly. The goals of reconstruction algorithms are (1) to minimize the reconstruction time to ensure system reliability and (2) to minimize the performance degradation to ensure user and system performance, simultaneously. Existing effective approaches, such as disk-oriented reconstruction (DOR) [6] and pipelined reconstruction (PR) [7], optimize the recovery workflow to improve the reconstruction process's parallelism. Based on stripe-oriented reconstruction (SOR) [8], both DOR and PR execute a set of reconstruction processes to rebuild data units by exploiting parallelism in processes and pipelining. While they both exploit reconstruction parallelism, the problem of optimizing reconstruction sequence remains unsolved. In a typical recovery mode, the reconstruction process rebuilds data units while RAID continues to serve I/O requests from the clients, where clients' accesses can adversely and severely affect the reconstruction efficiency because serving clients' I/O requests and reconstruction I/O requests simultaneously leads to frequent long seeks to and from the multiple separate data areas. Optimizing the reconstruction sequence with users' workload characteristics, we believe, is a key in improving existing and widely-used reconstruction approaches such as those mentioned above.

In this paper, we propose a Popularity-based multi-threaded Reconstruction Optimization (PRO) algorithm to optimize the existing reconstruction approaches, which exploits I/O workload characteristics to guide the reconstruction process. The main idea behind our PRO algorithm is to reconstruct high-popularity data units of a failed disk, which are the most frequently accessed units in terms of the workload characteristics, prior to reconstructing other units. Furthermore, by fully exploring the access patterns, the PRO algorithm has the potential to recover many units ahead of users' accesses with high probability to avoid performance degradation caused by recovery.

To the best of our knowledge, little research effort has been directed towards integrating workload characteristics into the reconstruction algorithm. Most of the existing reconstruction algorithms perform a sequential reconstruction on the failed disk regardless of the workload characteristics. Taking into account the workload distribution (such as spatial locality, temporal locality, etc.), the PRO approach attains a flexible and pertinent reconstruction strategy that attempts to reduce reconstruction time while alleviating the performance degradation. PRO achieves this by inducing a much earlier onset of response-time performance improvement and sustaining a longer time span of such improvement while reducing reconstruction time during recovery.

We implement the PRO algorithm based on DOR, and our benchmark studies on read-only web workloads have shown that the PRO algorithm outperforms DOR by 3.6%∼23.9% in user response time and by up to 44.7% in reconstruction time, simultaneously. Because the PRO algorithm is specifically designed to only generate the optimal reconstruction sequence based on workload characteristics without any modification to the recon-struction workflow or data layout, it can be easily incorporated into most existing reconstruction approaches of RAID and achieve significant improvement on system reliability and response time performance.

The rest of this paper is organized as follows. Related work and motivation are presented in Section 2. In Section 3, we describe the algorithm and implementation of the PRO approach. Performance evaluation and discussions are presented in Section 4. We conclude the paper in section 5 by summarizing the main contributions of this paper and pointing out directions for future research.

2 Related Work and Motivation

In this section, we first present the background and related work on the reconstruction issues. Second, we reveal the ubiquitous properties of popularity and locality of typical workloads and thus elucidate the motivations for our research.

2.1 Existing Reconstruction Approaches

There has been substantial research reported in the literature on reliability and recovery mechanisms in RAID or RAID-structured storage systems. Since the advent of disk arrays and RAID, researchers have been working to improve reliability and availability of such systems by devising efficient reconstruction algorithms for the recovery process, with several notable and effective outcomes.

There are two general approaches to disk array reconstruction depending on the source of improvement. The first general approach improves performance by reorganizing the data layout of spare or parity data units during recovery.

Hou et al. [9] considered an approach to improving reconstruction time, called distributed sparing. Instead of using a dedicated spare disk which is unused in the normal mode, they distribute the spare space evenly across the disk array. Their approach can result in improved response times and reconstruction times because one extra disk is available for normal use and less data needs to be reconstructed on a failure. As an extension to the distributed sparing approach in large-scale distributed storage systems, Xin et al. [10] presented a fast recovery mechanism (FARM) to dramatically lower the probability of data loss in large-scale storage systems.

The second general approach improves performance by optimizing the reconstruction workflow. Because this approach needs not modify the data organization of RAID, it is more prevalent in RAID implementations. Therefore, it is also the approach that our proposed algorithm will be based on.

Disk-oriented reconstruction (DOR) and pipelined reconstruction (PR) are two representative and widely-used reconstruction approaches. The DOR algorithm was proposed by Holland [11] to address the deficiencies of both the single-stripe and parallel-stripe reconstruction algorithms (SOR). Instead of creating a set of reconstruction processes associated with stripes, the array controller creates a number of processes, with each associated with one disk. The advantage of this approach is that it is able to maintain one low-priority request in the queue for each disk at all times, thus being able to absorb all of the array's bandwidth that is not absorbed by the users. Although DOR outperforms SOR in reconstruction time, the improvement in reliability comes at the expense of performance in user response times.

To address the reliability issue of continuous-media servers, Lee and Lui [7] presented a track-based reconstruction algorithm to rebuild lost data in tracks. In addition, they presented a pipelined reconstruction (PR) algorithm to reduce the extra buffers required by the track-based reconstruction algorithm. Muntz and Lui [12] conducted a performance study on reconstruction algorithms using an analytical model. Their first enhancement, reconstruction with redirection of reads, uses the spare disk to service disk requests to the failed disk if the requested data has already been rebuilt to the spare disk. They also proposed to piggyback rebuild requests on a normal workload. If a data block on the failed disk is accessed and has not yet been rebuilt to the spare disk, the data block is regenerated by reading the corresponding surviving disks. It is then a simple matter of also writing this data block to the spare disk.

The basic head-following algorithm [11] attempts to minimize head positioning time by reconstructing data and parity in the region of the array currently being accessed by the users. The problem with this approach is that it leads to almost immediate deadlock of the reconstruction process. Since the workload causes the disk heads to be uncorrelated with respect to each other, head following causes each reconstruction process to fetch a reconstruction unit from a different parity stripe. The buffers are filled with data units from different parity stripes and hard to be freed, and so reconstruction deadlocks.

Assuming that the probability of a second disk failure (while the first failed disk is under repair) is very low, Kari et al. [13] presented a delayed repair method to satisfy the response time requirement, which introduced a short delay between repair requests to limit the number of repair read (or write) requests that any user disk request might need to wait. With regard to RAID-structured storage systems, it may not be practical to adopt such a delayed repair method directly because of the relatively high probability of a second disk failure.

To address the problem of performance degradation during recovery, Vin et al. [14] integrated the recovery process with the decompression of video streams, thereby distributing the reconstruction process across the clients to utilize the inherent redundancy in video streams. From the viewpoint of a file system's semantic knowledge, Sivathanu et al. [15] proposed a live-block recovery approach in their D-GRAID, which changes the recovery process to first recover those blocks which are live ( i.e., belong to allocated files or directories).

Of the above related work on disk array reconstruction or recovery based on reconstruction workflow optimizations, the DOR algorithm still dominates in its applications and implementations; whereas, the others can be arguably considered either rooted at DOR or DOR's extensions or variations. In general, while DOR absorbs the disk array's bandwidth, its variations attempt to take advantage of user accesses. On the other hand, our proposed PRO algorithm improves over these approaches by not only preserving the inherent sequentiality of the reconstruction process, but also dynamically scheduling multiple reconstruction points and considering the reconstruction priority. What's more important, the PRO algorithm makes use of the access locality in I/O workloads, which is ignored by the above reconstruction algorithms, and improves the response-time performance and the reliability performance simultaneously.

2.2 Popularity and Locality of Workloads

A good understanding of the I/O workload characteristics can provide a useful insight into the reason behind DOR's inability to minimize application's performance degradation, thus helping us improve the DOR algorithm and other related algorithms to address this problem. In particular, workload characteristics such as temporal locality and spatial locality can be exploited to alleviate performance degradation while reducing reconstruction time. Experience with traditional computer system workloads has shown the importance of temporal locality to computer design [16].

In many application environments, 80% accesses are always directed to 20% of the data, a phenomenon that has long been known as Pareto's Principle or “The 80/20 Rule" [17]. Pareto's Principle points to the existence of temporal locality and spatial locality in various kinds of I/O workloads. The presence of access locality in I/O workloads has been well known in the literature. Gomez and Santonja [17] studied three sets of I/O traces provided by Hewlett-Packard Labs, and showed that some of the blocks are extremely hot and popular while other blocks are rarely or never accessed. In the media workload, Cherkasova and Gupta [18] found that 14%∼30% of the files accessed on the server accounted for 90% of the media sessions and 92%∼94% of the bytes transferred, and were viewed by 96%∼97% of the unique clients.

Studies have further indicated that access locality is more pronounced in workload characteristics of web servers [19], and identified three distinct types of web access locality, namely, static, spatial and temporal locality. Static locality refers to the observation that 10% of files accessed on a web server typically account for 90% of the server requests and 90% of the bytes transferred [20]. Roselli et al. [21] found that for all workloads, file access patterns are bimodal in that most files tend to be mostly-read or mostly-written. The web workload has far more read bandwidth than any other workload but has relatively little write bandwidth. Sikalinda et al. [22] found that in the web search workload almost all the operations are reads (i.e., 99.98% of the total number of operations). Since typical on-line transaction processing type (OLTP-like) workloads are read-dominated [13], the load on the surviving disks increased by typically between 50-100% in the presence of a disk failure. This severely degrades the performance as observed by the users, and dramatically lengthens the period of time required to recover the lost data and store it on a replacement disk.

Motivated by insightful observations made by other researchers and by our own research, we propose to integrate temporal locality and spatial locality into the reconstruction algorithms to improve their effectiveness. If the frequently accessed data units could be reconstructed prior to reconstructing all others, the effect of performance degradation can be potentially hidden from the users in their subsequent accesses to the same data units, especially during the recovery process.

3 Design and Implementation of PRO

Due to the aforementioned shortcomings of existing parallel reconstruction approaches (such as DOR and PR), a solution must be sought to optimize the reconstruction sequence. Based on prior research by other researchers and by us, we believe that the key to our solution is to exploit the workload characteristics into the reconstruction algorithm. By making an appropriate connection between the external workload and the internal reconstruction I/O activities, we propose a Popularity-based multi-threaded Reconstruction Optimization algorithm (PRO) to combine the locality of the workloads with the sequentiality of the reconstruction process.

The main idea behind the PRO algorithm is to monitor and keep track of the dynamic popularity changes of data areas, thus directing the reconstruction process to rebuild highly popular data units of a failed disk, prior to rebuilding other units. More specifically, PRO divides data units on the replacement disk into multiple non-overlapping but consecutive data areas, called “hot zones". Correspondingly, PRO employs multiple independent reconstruction threads, where each reconstruction thread is responsible for one of the hot zones and the priority of each thread is determined by the latest frequency of users' accesses to its hot zone. The PRO algorithm adopts a priority-based time-sharing scheduling algorithm to schedule the reconstruction threads. The scheduler activates the thread with the highest priority periodically by allocating a time slice to the thread that begins rebuilding the remaining data units of its hot zones until the time slice is used up.

Different from the strictly sequential sequence of most existing reconstruction approaches, PRO generates a workload-following reconstruction sequence to exploit the locality of workload characteristics.

In this section, we first outline the design principles behind PRO, which is followed by a detailed description of the PRO algorithm via an example, as well as an overview of PRO's implementation.

3.1 Design Principles

As Holland concluded [11], a reconstruction algorithm must preserve the inherent sequentiality of the reconstruction process, since a disk drive is able to service sequential accesses at many times the bandwidth of random accesses. This leads to the development of the disk-oriented reconstruction (DOR) algorithm and the pipelined reconstruction (PR) algorithm, and to the rejection of the head-following approaches. On the other hand, to take full advantage of locality of access in the I/O workload, we believe that a reconstruction algorithm should rebuild data units with high-popularity prior to rebuilding low-popularity or no-popularity data units on the failed disk. Consequently, frequent long seeks to and from the multiple separate popular data areas result in seek and rotation penalties for multiple reconstruction points.

To strike a good balance between the above two seemingly conflicting goals of maintaining sequentiality and exploiting access locality, the PRO algorithm effectively combines “global decentralization" with “local sequentiality" in the reconstruction process. Inspired by the design principles of classical time-sharing operating systems, PRO adopts the ideas of “divide and conquer" and “time-sharing scheduling" to achieve the above goals.

Figure 1: Reconstruction thread descriptor.

Figure 2: A snapshot of the PRO algorithm at work.

3.2  The Reconstruction Algorithm
To obtain an optimal reconstruction sequence, PRO tracks the popularity changes of data areas and generates a workload-following reconstruction sequence to combine locality with sequentiality.
A description of the PRO algorithm is given as follows.
First, PRO divides data units on the replacement disk into multiple non-overlapping but consecutive data areas called “hot zones" (the algorithm for defining the appropriate number of hot zones is described in Section 3.4), with each being associated with a popularity measure determined by the latest frequency of users' accesses to it. Correspondingly, the entire reconstruction process is divided into multiple independent reconstruction threads with each being responsible for rebuilding one of the hot zones. The priority of a reconstruction thread is dynamically adjusted according to the current popularity of its hot zone. Similar to a real thread in operating systems, each reconstruction thread has its reconstruction thread descriptor (see Figure 1 that shows its data structure and content) that includes properties such as status, priority, time slice, etc.
Second, once all of the reconstruction threads are initialized, a reconstruction scheduler selects the reconstruction thread with the highest priority and allocates a time slice to it, which activates this thread to rebuild the remaining data units of its hot zone until the time slice is used up. If the thread runs out of its time slice, the reconstruction scheduler suspends it, reselects the reconstruction thread that has the current highest priority, and allocates one time slice to it. This process repeats until all the data units have been rebuilt. Figure 2 is a snapshot of the PRO algorithm at work. From Figure 2, one can see three reconstruction threads and their respective hot zones with different popularity. Because the hot zone of the middle thread has the highest popularity, the thread is switched to the running state and allocated a time slice by the reconstruction scheduler. At the same time, the other two threads are suspended due to their low popularity.
The PRO approach uses the reconstruction threads to track the popularity changes of the hot zones and always picks up the most popular data area to rebuild prior to rebuilding others, thus achieving the goal of fully exploiting access locality based on popularity (via global decentralization) during the reconstruction process.
A time slice in PRO is always associated with a number of the consecutive data units. In our implementation, a time slice is set to be 64, that is, 64 consecutive data units. With this approach, PRO achieves the goal of maintaining local sequentiality of the reconstruction process.
After the above step, the data units in the reconstruction queue are workload-following and locally sequential.

3.3  The PRO Structure and Procedures
The PRO architecture consists of three key components: Access Monitor (AM), Reconstruction Scheduler (RS) and Reconstruction Executor (RE). AM is responsible for capturing and analyzing clients' access locality and adjusting popularities of the corresponding hot zones. The responsibility of RS is to select data units of the hot zone with the highest popularity, or the most frequently-accessed data units, generate the corresponding tasks and put them into a FIFO reconstruction task queue. The main function of RE is to fetch jobs from the reconstruction task queue and rebuild the corresponding units on the replacement disk. The operations of each of the three PRO components are detailed below.

Access Monitor (AM):

Repeat
1. AM receives the I/O request and determines its type and address information.
2. If the type is read and the address is located on the failed disk, increase the popularity of the corresponding hot zone (including this address).
Until (the failed disk has been reconstructed)

Reconstruction Scheduler (RS):

Repeat
1. Check the time slice of the reconstruction thread whose status is running.
2. If the thread has no time slice left, select the re-construction thread with the highest popularity from all reconstruction threads, and allocate a time slice to it. At the same time, reset the popularity of all reconstruction threads to zero.
3. Set the status of the chosen reconstruction thread to the running state, and set the status of other threads to the suspended state. Meanwhile, reduce the remaining time slice of the chosen reconstruction thread by one.
4. The chosen reconstruction thread begins rebuilding a remaining data units in its hot zone, generating a corresponding reconstruction job to obtain this reconstruction unit, and queuing it to the tail of the reconstruction job queue.
5. If all data units in this hot zone are reconstructed, reclaim the chosen reconstruction and its hot zone.
Until (the failed disk has been reconstructed)

Reconstruction Executor (RE):
In fact, the functions of RE are the same as those of the DOR algorithm except for the following subtle difference: RE chooses the next job from the queue, not the sequentially next data unit one by one as the DOR algorithm. The disk array controller creates C processes, each associated with one disk. Each of the C−1 processes associated with a surviving disk executes the following loop:

Repeat
1. Fetch the next job from the reconstruction job queue on this disk that is needed for reconstruction.
2. Issue a low-priority request to read the indicated unit into a buffer.
3. Wait for the read to complete.
4. Submit the unit's data to a centralized buffer manager for XOR, or Block the process if the buffer is full.
Until (all necessary units have been read)

The process associated with the replacement disk executes the following operations:

Repeat
1. Request sequentially the next full buffer from the buffer manager.
2. Issue a low-priority write of the buffer to the replacement disk.
3. Wait for the write to complete.
Until (the failed disk has been reconstructed)

3.4  Implementation Issues
How is the popularity data collected, stored, updated?
Whenever the reconstruction process starts, the Access Monitor begins to track and evaluate the statistics of clients' accesses to the failed disk. The popularity data of each hot zone is stored in the “popularity" counter of the hot zone. It can keep better track of the dynamic changes of popular data areas if the popularity data is collected, stored and updated even after the reconstruction process starts. This, however, leads to a 2% to 4% system performance loss (similar to the impact of the I/O trace tools) while the probability of a disk failure is relatively low.
Once a client accesses a hot zone, the value of its popularity counter is increased by one. Each time the reconstruction scheduler finishes a thread scheduling procedure, all the popularity counters of hot zones are reset to zeros to be ready for a new round of monitoring. This implies that short-term, rather than long-term, popularity history is captured by the PRO implementation. This is based on our belief that the former is more important than the latter during recovery because current or recent past popularity change tends to point to newly and rapidly accessed data areas that should be reconstructed more urgently.

How many zones are used?
In PRO, hot zones are fully dynamically created, updated and reused, including the corresponding reconstruction threads. At the beginning of the recovery process, there is no reconstruction thread or hot zone.
When a client accesses a data unit, say, numbered N, on the failed disk, a thread will be created and initiated, along with its corresponding hot zone of default length L and range N to N+L. After initialization, the reconstruction thread switches to the alive status and waits for scheduling. If a client accesses a data unit belonging to this hot zone, its popularity will be increased by one. But if a client accesses a data unit not belonging to any of the existing hot zones, a new thread, along with its new hot zone, will be initialized and created in a manner described above. Assuming that the data unit is numbered M, if M<N and N−M<L, the range of the newly created hot zone is set to be M to N−M to ensure non-overlapping between two adjacent hot zones, else the range is set to be M to M+L. In the other words, the default length L is the maximum threshold. When all data units in a hot zone are rebuilt, the hot zone and the corresponding reconstruction thread will be reclaimed for the reuse purpose.
Clearly, the number and length of hot zones are not fixed but dynamically adjusted by the workload. In the implementation, we set the maximum length of a hot zone to 1024 and the maximum number of hot zones (thus of the reconstruction threads) to 128, that is, the PRO algorithm can track the popularity changes of 128 data areas simultaneously.

How are threads scheduled?
The PRO algorithm adopts a priority-based time-sharing scheduling algorithm to schedule multiple reconstruction threads. It must be noted that the PRO algorithm's scheduler bases its selection decision mostly on a thread's priority (or its corresponding zone's popularity).

Table 1: The platform for performance evaluation.

Component Description

CPU Intel Pentium4 3GHz

Memory 512M DDR SDRAM

SCSI HBA LSIlogic 22320R

SCSI Disk Seagate ST3146807LC

OS NetBSD 2.1

RAID software RAIDframe [26]

4  Performance Evaluations
This section presents results of a comprehensive experimental study comparing the proposed PRO-powered DOR algorithm (PRO for short) with the original DOR algorithm. To the best of our knowledge, the DOR algorithm is arguably the most effective among existing reconstruction algorithms in part because it is implemented in many software and hardware RAID products [23-25] and most widely studied in the literature. This performance study analyzes reconstruction performance in terms of user response time and reconstruction time, and algorithm complexity.

Table 2: The characteristics of three web traces.

Trace Number of Popularity (20% most active area by size Average Inter-Arrival Time

Name Requests handles so much of total requests) (ms)

Web Trace 1 842,535 69.99% 3.74

Web Trace 2 999,999 70.12% 3.73

Web Trace 3 999,999 69.89% 5.71

4.1  Experimental Setup
We conducted our performance evaluation of the two algorithms above on a platform of server-class hardware and software (see Table 1). The speed of the Seagate ST3146807LC disks is 10000 rpm, its average seek time is 4.7ms, and its capacity is 147GB. We use 2 SCSI channels and each channel attaches 5 disks.
Because the NetBSD platform has no appropriate benchmark to replay the I/O trace, we implemented a block-level replay tool, RAIDmeter, to evaluate performances of the two reconstruction algorithms since most databases run not on a file system but on the raw devices directly. The main function of RAIDmeter is to replay the traces and evaluate the I/O response time, that is, to open the block device, send the designated read/write requests to the device in terms of the timestamp and request type of each I/O event in the trace, wait for the completion of the I/O requests and gather the corresponding I/O results, such as the response time, throughput and so on. Since the NetBSD 2.1 operating system cannot support read or write operations to files larger than 2GB, it is very inconvenient to benchmark the performance of disk arrays consisting of hard drives with hundreds of GB capacity each. As a result, we had to limit the capacity of every disk to 5GB. In addition, RAIDmeter adopts the approach of randread [27] to overcome the capacity limitation to support I/O accesses to files of tens of GBs. We believe that RAIDmeter is the best block-level trace-replay tool available for the current NetBSD platform.

Table 3: A comparison of PRO and DOR reconstruction times (second) as a function of the number of disks.

RAID Number Web Trace 1 Web Trace 2 Web Trace 3

Level of Disks DOR PRO improved DOR PRO improved DOR PRO improved

3 1123.8 666.5 40.7% 1058.3 585.2 44.7% 452.3 351.6 22.3%

RAID5 5 457.4 353.1 22.8% 374.5 304.1 18.8% 242.8 203.9 16.0%

7 344.0 319.1 7.2% 325.7 278.5 14.5% 238.2 210.8 11.5%

9 243.5 240.3 1.3% 231.5 215.3 7.0% 192.4 184.5 4.1%

RAID1 2 1208.1 1157.7 4.2% 938.0 870.1 7.3% 424.2 409.5 3.7%

Table 4: A comparison of PRO and DOR user response times (millisecond) as a function of the number of disks.

RAID Number Web Trace 1 Web Trace 2 Web Trace 3

Level of Disks DOR PRO improved DOR PRO improved DOR PRO improved

3 31.8 24.2 23.9% 28.5 23.9 16.0% 27.4 23.1 15.6%

RAID5 5 21.7 19.3 11.1% 21.0 18.7 11.0% 20.0 17.8 11.3%

7 25.0 23.8 5.1% 22.5 21.4 4.5% 22.6 20.0 11.8%

9 19.1 18.2 4.5% 19.6 17.3 11.5% 19.5 18.8 3.6%

RAID1 2 29.5 28.2 11.1% 21.4 20.6 11.0% 18.8 17.8 11.3%

4.2  Methodology and Workload
We evaluated our design by running trace-driven evaluations over the web I/O traces identified from the Storage Performance Council [28] because the web workloads tend to most pronouncedly reveal the nature of locality. These traces were collected from a system running a web search engine, and they are all read-dominant and with high locality. Because of the absolute read-domination (99.98%) in all of the web-search-engine I/O traces, we filtered out the write requests and feed only read events to our RAIDmeter benchmark. Owing to the relatively short reconstruction times in our current experimental setup, it may not be necessary to use the full daily-level traces. Thus, we only reserved the beginning part of the traces with lengths appropriate for our current reconstruction experiments. Table 2 shows the relevant information of the web-search-engine I/O traces.

4.3  Reconstruction Performance
We first conducted our performance evaluation of the two reconstruction algorithms on a platform of a RAID-5 disk array consisting of variable number of disks and 1 hot-spare disk, with a stripe unit size of 64KB and a RAID-1 disk array consisting of 2 disks and 1 hot-spare disk, with a stripe unit of 64KB.
Tables 3 and 4 show the measured reconstruction times and user response times of DOR and PRO, respectively, and reveal the efficacy of the PRO algorithm in improving the array's reconstruction time and user response time simultaneously during recovery. Across all given workloads, RAID levels and disk numbers in our experiments, the PRO algorithm almost consistently outperforms the DOR algorithm in reconstruction time and user response time by up to 44.7% and 23.9%, respectively.
It is not uncommon that a second disk drive can fail shortly after a first disk failure in a very large-scale RAID-structured storage system, and thus it is very important to shorten the reconstruction time to avoid a second disk failure during the recovery of the first disk failure for the sake of reliability and availability of the storage system.
Furthermore, with regard to all the traces, one can see that the reconstruction time improvement caused by the PRO algorithm for a RAID5 disk array is more significant than that for a RAID1 disk array. And the smaller the number of disks in the RAID5 disk array, the more improvement is obtained by PRO. The reason for the lower improvement for RAID 1 is that the overhead of reconstructing data on the fly is lower for RAID 1 than for RAID 5. The reason for the decreased improvement at a higher array size is that, for the same trace that is applied to all the array sizes, the larger the array size, the lower I/O request intensity imposed on an individual disk, which in turn implies less locality to be exploited by PRO. However, as the overall I/O traffic/intensity increases proportionally to the array size, we expect the PRO improvement on reconstruction time to be more significant.
For completeness, we also conducted a comparative experiment on the reconstruction time without any workload. DOR and PRO consistently took nearly the same amount of time respectively to rebuild all data units on the failed disk. This result verifies that PRO has no negative impact on the recovery process, relative to DOR.
Figure 3 illustrate the noticeable improvement of the PRO algorithm over the DOR algorithm. Two aspects of significant improvement in user response time are demonstrated in these figures as a result of PRO's effective exploitation of access locality. First, the onset of the PRO-led performance improvement is much earlier than that of DOR during the reconstruction period. In other words, users in PRO experience a much longer time span of improved performance in recovery than in DOR. This PRO-induced time span occupies on average a majority of the total reconstruction time, compared to the relatively small part of the DOR-induced time span. What's more important is that, during this time span, the PRO algorithm outperforms DOR in user response time.
Second, the PRO algorithm rapidly reduces user response time to a lower level prior to the DOR algorithm, and preserves this level steadily until the recovery is ended. For example, user response times in Figure 3(a) and 3(b) appear to reach the stable level at the 20%∼30% of the total reconstruction time using PRO, compared to 40%∼60% of the total reconstruction time using DOR. In other words, users in the PRO recovery suffer only a very short time of increased response time at the very beginning of the recovery process, compared to that in the DOR recovery.
To examine the impact caused by the different stripe unit size, we conducted a performance evaluation on the platform of a RAID-5 disk array consisting of 5 disks and 1 hot-spare disk, with stripe unit sizes of 64KB and 16KB. We plot the measured reconstruction time in Figure 4(a) and the average response time in Figure 4(b). One can see that for the two stripe sizes we examined, the PRO algorithm outperforms the DOR in both reconstruction time and response time. And with the 16KB stripe unit, the improvement of reliability and performance caused by PRO is more than the one with 64KB stripe unit except for the reconstruction time of web trace 3.
In summary, by exploiting access locality the PRO algorithm consistently outperforms the DOR algorithm both in reliability and in performance during recovery.

Figure 3: (a), (b), (c): A comparison of PRO and DOR user response time as a function of the respective traces: web trace 1, 2 and 3. In all of the figures, the bottom columns mean the I/O intensity of the trace and the two curves mean the PRO and DOR user response time trend during recovery. The RAID-5 disk array consists of 3 disks and 1 hot-spare disk, with a stripe unit size of 64KB.

Figure 4: (a), (b): A comparison of PRO and DOR reconstruction time and response time as a function of the respective stripe unit: 64KB and 16KB. The RAID-5 disk array consists of 5 disks and 1 hot-spare disk.

4.4  Complexity Analysis
Space Complexity
Because the PRO approach is based on the implementation of the DOR algorithm, the memory requirements are almost the same as that for the latter. Compared with DOR, PRO requires extra memory only for the storage of reconstruction thread descriptors. In our actual implementation, each thread descriptor consumes 1KB memory, and if we set the maximum threshold of the number of threads to 128 (as is in our implementation), the extra memory needed for the PRO approach is about 128KB.

Time Complexity
Since the time for each thread-scheduling event is mostly consumed in the selection of the highest-priority thread from all of the candidate threads, we can estimate approximately that the computation overhead of the PRO algorithm is O(n), where n is the total number of the existing threads. If we use a priority queue, the computation complexity can be reduced to O(logn). However, the computation overhead will be negligible on a modern processor compared with the enormous I/O latency of a disk.

Implementation Complexity
We briefly quantify the implementation complexity of PRO. Table 5 lists the lines of code, counted by the number of semicolons and braces, which are modified or added to the RAIDframe. From the table, one can see that very few additions and modifications are needed to add to the reconstruction module. It is clear that most of the complexity is found in the Reconstruction Scheduler, because this module is in fact the bridge between the Access Monitor and the Reconstruction Executor, and its functions are shared with these modules. Compared with the 39691 lines of the RAIDframe implementation, the modifications for PRO occupy only 1.7% of the total lines.

Table 5: Code complexity for the DOR modifications in the RAIDframe.

Component Mod. Lines Add. Lines Total Lines

AM 0 84 84

RS 0 590 590

RE 2 12 14

Total 2 686 688

5  Conclusion
The recovery mechanism of RAID becomes increasingly more critical to the reliability and availability of storage systems. System administrators demand a solution that can reconstruct the entire content of a failed disk as quickly as possible and, at the same time, alleviate performance degradation during recovery as much as possible. However, it is very difficult to achieve these two goals simultaneously.
In this paper, we developed and evaluated a novel dynamic reconstruction optimization algorithm for redundant disk arrays, called a Popularity-based multi-threaded Reconstruction Optimization algorithm (PRO). The PRO algorithm exploits the access locality of I/O workload characteristics, which is ubiquitous in real workloads, especially in the web server environment. The PRO algorithm allows the reconstruction process to rebuild the frequently-accessed areas prior to building infrequently-accessed areas. Our experimental analysis shows that the PRO algorithm results in a 3.6%∼23.9% user performance improvements during recovery over the DOR algorithm. Further, PRO leads to a much earlier onset of performance improvement and longer time span of such improvement than DOR during recovery. More importantly, PRO can reduce the reconstruction time of DOR by up to 44.7%. By effectively exploiting access locality, the PRO algorithm accomplishes the goal of simultaneous improvement of reliability and user and system performance.
However, the PRO algorithm in its current form may not be suitable for all types of workloads, for example, it will not likely work well under write-dominated workload. A good solution for write-dominated workload remains one of our directions for future research on PRO. In addition, current PRO only integrates the access locality into the reconstruction algorithm, without distinguishing or predicting access patterns. We believe that PRO's effectiveness can be increased if the access patterns are discovered, predicted, and incorporated into the reconstruction process, which is another direction of our future research.

Acknowledgments
We would like to thank our shepherd, Andrea C. Arpaci-Dusseau, and the anonymous reviewers for their helpful comments in reviewing this paper. We also thank Greg Oster (the RAIDframe maintainer in NetBSD operating system) at the University of Saskatchewan for his insightful suggestions, and thank Chao Jin, Xiongzi Ge and Qiang Zou for their help in writing this paper.
This work is sponsored by the National Basic Research Program of China (973 Program) under Grant No. 2004CB318201, the Key Basic Research Project under Grant No.2004CCA07400, Project CNGI-04-5-1D, Program for New Century Excellent Talents in University NCET-04-0693, the China NSF under Grant No.60273074, No.60503059, No. 60303032, No. 60603048, and the US NSF under Grant No. CCF-0621526.

References
[1]
D. Wenk. Is 'Good Enough' Storage Good Enough for Compliance? Disaster Recovery Journal. 2004.
[2]
Q. Zhu and Y. Zhou. Chameleon: A Self-Adaptive Energy-Efficient Performance-Aware RAID. In Proceedings of the fourth annual Austin Conference on Energy-Efficient Design(ACEED 2005), Poster Session, Austin, Texas, March 1-3, 2005.
[3]
D. Patterson, G. Gibson, and R. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the Conference on Management of Data, 1988.
[4]
G. Gibson, Redundant Disk Arrays: Reliable, Parallel Secondary Storage. MIT Press, 1992.
[5]
Q. Xin, E. L. Miller, T. J. Schwarz, D. D. E. Long, S. A.Brandt, and W. Litwin. Reliability mechanisms for very large storage systems. In Proceedings of the 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 146-156, April 2003.
[6]
M. Holland, G. Gibson, and D. Siewiorek. Fast, On-Line Failure Recovery in Redundant Disk Arrays. In Proceedings of the International Symposium on Fault-Tolerant Computing, pages 422-43, 1993.
[7]
Jack Y.B. Lee and John C.S. Lui. Automatic Recovery from Disk Failure in Continuous-Media Servers. IEEE Transaction On Parallel And Distributed Systems, Vol. 13, No. 5, May 2002.
[8]
M. Holland, G.A. Gibson and D. Siewiorek. Architectures and Algorithms for On-Line Failure Recovery in Redundant Disk Arrays. Journal of Distributed and Parallel Databases, Vol. 2, No. 3, pages 295-335, July 1994.
[9]
R. Hou, J. Menon, and Y. Patt. Balancing I/O Response Time and Disk Rebuild Time in a RAID5 Disk Array. In Proceedings of the Hawaii International Conference on Systems Sciences, pages 70-79, 1993.
[10]
Q. Xin, E. L. Miller and T. J. Schwarz. Evaluation of Distributed Recovery in Large-Scale Storage Systems. In Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing, pages 172-181, June 2004.
[11]
M. Holland. On-Line Data Reconstruction in Redundant Disk Arrays. Carnegie Mellon Ph.D. Dissertation CMU-CS-94-164, April 1994.
[12]
R. Muntz and J. Lui, Performance Analysis of Disk Arrays Under Failure. In Proceedings of the 16th Conference on Very Large Data Bases, 1990.
[13]
H. H. Kari, H. K. Saikkonen, N. Park and F. Lombardi. Analysis of repair algorithms for mirrored-disk systems. IEEE Transactions on Reliability, Vol 46, No. 2, pages 193-200, 1997.
[14]
H.M Vin, P.J. Shenoy and S. Rao. Efficient Failure Recovery in Multi-Disk Multimedia Servers. In Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, pp.12-21, 1995.
[15]
M. Sivathanu, V. Prabhakaran, F. Popovici, T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Improving Storage System Availability with D-GRAID. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies (FAST'03), San Francisco, CA, March 2003.
[16]
A. Smith. Cache Memories. Computing Surveys, Vol 14, No. 3, pages 473-480, 1982
[17]
M. E. Gomez and V. Santonja. Characterizing Temporal Locality in I/O Workload. In Proceedings of the 2002 International Symposium on Performance Evaluation of Computer and Telecommunication Systems (SPECTS'02). San Diego, USA, July 2002
[18]
L. Cherkasova and M. Gupta. Analysis of Enterprise Media Server Workloads: Access Patterns, Locality, Content Evolution, and Rates of Change. IEEE/ACM Transactions on Networking, Vol, 12, No. 5, October 2004
[19]
L. Cherkasova, G Ciardo. Characterizing Temporal Locality and its Impact on Web Server Performance. Technical Report HPL-2000-82, Hewlett Packard Laboratories, July 2000
[20]
M. Arlitt and C. Williamson. Web server workload characterization: the search for invariants. In Proceedings of the ACM SIGMETRICS '96 Conference, Philadelphia, PA, May 1996.
[21]
D. Roselli, J. R. Lorch, and T. E. Anderson. A Comparison of File System Workloads. In Proceedings of 2000 USENIX Annual Technical Conference. San Diego, California, USA, June 2000.
[22]
P. G. Sikalinda, L. Walters and P. S. Kritzinger. A Storage System Workload Analyzer. Technical Report CS06-02-00, University of Cape Town, 2006.
[23]
The NetBSD Foundation. The NetBSD Guide. https://www.netbsd.org/guide/en/chap-rf.html.
[24]
The OpenBSD Foundation. The OpenBSD FAQ. https://www.se.openbsd.org/faq/faq11.html/raid.
[25]
The FreeBSD Foundation. The FreeBSD/i386 5.3-RELEASE Release Notes. https://www.freebsd.org/releases/5.3R/relnotes-i386.html.
[26]
W.V. Courtright II, G.A. Gibson, M. Holland and J. Zelenka. RAIDframe: Rapid Prototyping for Disk Arrays. In Proceedings of the 1996 Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Vol. 24 No. 1, pages 268-269, May 1996.
[27]
The randread pkg. https://pkgsrc.se/benchmarks/randread
[28]
SPC Web Search Engine I/O Trace. https://traces.cs.umass.edu/storage/

This document was translated from L^AT_EX by H^EV^EA.

Last changed: 1 Feb. 2007 ch