Scalable Performance of the Panasas Parallel File System

Brent Welch1, Marc Unangst1, Zainul Abbasi1, Garth Gibson12, Brian Mueller1,

Jason Small1, Jim Zelenka1, Bin Zhou1
1Panasas, Inc.  2Carnegie Mellon
{welch,mju,zabbasi,garth,bmueller,jsmall,jimz,bzhou}@panasas.com


Abstract

The Panasas file system uses parallel and redundant access to object storage devices (OSDs), per-file RAID, distributed metadata management, consistent client caching, file locking services, and internal cluster management to provide a scalable, fault tolerant, high performance distributed file system.  The clustered design of the storage system and the use of client-driven RAID provide scalable performance to many concurrent file system clients through parallel access to file data that is striped across OSD storage nodes.  RAID recovery is performed in parallel by the cluster of metadata managers, and declustered data placement yields scalable RAID rebuild rates as the storage system grows larger.  This paper presents performance measures of I/O, metadata, and recovery operations for storage clusters that range in size from 10 to 120 storage nodes, 1 to 12 metadata nodes, and with file system client counts ranging from 1 to 100 compute nodes.  Production installations are as large as 500 storage nodes, 50 metadata managers, and 5000 clients.

1           Introduction

Storage systems for high performance computing environments must be designed to scale in performance so that they can be configured to match the required load.  Clustering techniques are often used to provide scalability.  In a storage cluster, many nodes each control some storage, and the overall distributed file system assembles the cluster elements into one large, seamless storage system.  The storage cluster can be hosted on the same computers that perform data processing, or they can be a separate cluster that is devoted entirely to storage and accessible to the compute cluster via a network protocol.

The Panasas storage system is a specialized storage cluster, and this paper presents its design and a number of performance measurements to illustrate the scalability.  The Panasas system is a production system that provides file service to some of the largest compute clusters in the world, in scientific labs, in seismic data processing, in digital animation studios, in computational fluid dynamics, in semiconductor manufacturing, and in general purpose computing environments.  In these environments, hundreds or thousands of file system clients share data and generate very high aggregate I/O load on the file system.  The Panasas system is designed to support several thousand clients and storage capacities in excess of a petabyte.

The unique aspects of the Panasas system are its use of per-file, client-driven RAID, its parallel RAID rebuild, its treatment of different classes of metadata (block, file, system) and a commodity parts based blade hardware with integrated UPS.  Of course, the system has many other features (such as object storage, fault tolerance, caching and cache consistency, and a simplified management model) that are not unique, but are necessary for a scalable system implementation.

2           Panasas File System Background

This section makes a brief tour through the system to provide an overview for the following sections. The two overall themes to the system are object storage, which affects how the file system manages its data, and clustering of components, which allows the system to scale in performance and capacity. 

The storage cluster is divided into storage nodes and manager nodes at a ratio of about 10 storage nodes to 1 manager node, although that ratio is variable.  The storage nodes implement an object store, and are accessed directly from Panasas file system clients during I/O operations.  The manager nodes manage the overall storage cluster, implement the distributed file system semantics, handle recovery of storage node failures, and provide an exported view of the Panasas file system via NFS and CIFS.  Figure 1 gives a basic view of the system components.

2.1         Object Storage

An object is a container for data and attributes; it is analogous to the inode inside a traditional UNIX file system implementation.  Specialized storage nodes called Object Storage Devices (OSD) store objects in a local OSDFS file system.  The object interface addresses objects in a two-level (partition ID/object ID) namespace. The OSD wire protocol provides byte-oriented access to the data, attribute manipulation, creation and deletion of objects, and several other specialized operations [OSD04].  We use an iSCSI transport to carry OSD commands that are very similar to the OSDv2 standard currently in progress within SNIA and ANSI-T10 [SNIA].

The Panasas file system is layered over the object storage.  Each file is striped over two or more objects to provide redundancy and high bandwidth access. The file system semantics are implemented by metadata managers that mediate access to objects from clients of the file system.  The clients access the object storage using the iSCSI/OSD protocol for Read and Write operations.  The I/O operations proceed directly and in parallel to the storage nodes, bypassing the metadata managers. The clients interact with the out-of-band metadata managers via RPC to obtain access capabilities and location information for the objects that store files.  The performance of striped file access is presented later in the paper.

Object attributes are used to store file-level attributes, and directories are implemented with objects that store name to object ID mappings.  Thus the file system metadata is kept in the object store itself, rather than being kept in a separate database or some other form of storage on the metadata nodes.  Metadata operations are described and measured later in this paper.

2.2         System Software Components

The major software subsystems are the OSDFS object storage system, the Panasas file system metadata manager, the Panasas file system client, the NFS/CIFS gateway, and the overall cluster management system. 

·   The Panasas client is an installable kernel module that runs inside the Linux kernel. The kernel module implements the standard VFS interface, so that the client hosts can mount the file system and use a POSIX interface to the storage system.  We don’t require any patches to run inside the 2.4 or 2.6 Linux kernel, and have tested with over 200 Linux variants.

·   Each storage cluster node runs a common platform that is based on FreeBSD, with additional services to provide hardware monitoring, configuration management, and overall control.

·   The storage nodes use a specialized local file system (OSDFS) that implements the object storage primitives. They implement an iSCSI target and the OSD command set. The OSDFS object store and iSCSI target/OSD command processor are kernel modules. OSDFS is concerned with traditional block-level file system issues such as efficient disk arm utilization, media management (i.e., error handling), high throughput, as well as the OSD interface.

·   The cluster manager (SysMgr) maintains the global configuration, and it controls the other services and nodes in the storage cluster. There is an associated management application that provides both a command line interface (CLI) and an HTML interface (GUI). These are all user level applications that run on a subset of the manager nodes.  The cluster manager is concerned with membership in the storage cluster, fault detection, configuration management, and overall control for operations like software upgrade and system restart [Welch07].

·   The Panasas metadata manager (PanFS) implements the file system semantics and manages data striping across the object storage devices. This is a user level application that runs on every manager node.  The metadata manager is concerned with distributed file system issues such as secure multi-user access, maintaining consistent file- and object-level metadata, client cache coherency, and recovery from client, storage node, and metadata server crashes. Fault tolerance is based on a local transaction log that is replicated to a backup on a different manager node.

·   The NFS and CIFS services provide access to the file system for hosts that cannot use our Linux installable file system client. The NFS service is a tuned version of the standard FreeBSD NFS server that runs inside the kernel. The CIFS service is based on Samba and runs at user level.  In turn, these services use a local instance of the file system client, which runs inside the FreeBSD kernel.  These gateway services run on every manager node to provide a clustered NFS and CIFS service.

2.3         Commodity Hardware Platform

The storage cluster nodes are implemented as blades that are very compact computer systems made from commodity parts.  The blades are clustered together to provide a scalable platform.  Up to 11 blades fit into a 4U (7 inches) high shelf chassis that provides dual power supplies, a high capacity battery, and one or two 16-port GE switches.  The switches aggregate the GE ports from the blades into a 4 GE trunk.  The 2nd switch provides redundancy and is connected to a 2nd GE port on each blade.  The battery serves as a UPS and powers the shelf for a brief period of time (about five minutes) to provide orderly system shutdown in the event of a power failure.  Any number of blades can be combined to create very large storage systems.

The OSD StorageBlade module and metadata manager DirectorBlade module use the same form factor blade and fit into the same chassis slots.  The StorageBlade module contains a commodity processor, two disks, ECC memory, and dual GE NICs.  The DirectorBlade module has a faster processor, more memory, dual GE NICs, and a small private disk. In addition to metadata management, DirectorBlades also provide NFS and CIFS service, and their large memory is used as a data cache when serving these protocols.  Details of the different blades used in the performance experiments are given in Appendix I.

Any number of shelf chassis can be grouped into the same storage cluster.  A shelf typically has one or two DirectorBlade modules and 9 or 10 StorageBlade modules.  A shelf with 10 StorageBlade modules contains 5 to 15 TB of raw storage in 4U of rack space.  Customer installations range in size from 1 shelf to around 50 shelves, although there is no enforced limit on system size.

While the hardware is essentially a commodity PC (i.e., no ASICs), there are two aspects of the hardware that simplified our software design. The first is the integrated UPS in the shelf chassis that makes all of main memory NVRAM. The metadata managers do fast logging to memory and reflect that to a backup with low latency network protocols.  OSDFS buffers write data so it can efficiently manage block allocation. The UPS powers the system for several minutes to protect the system as it shuts down cleanly after a power failure. The metadata managers flush their logs to a local disk, and OSDFS flushes writes through to disk. The logging mechanism is described and measured in detail later in the paper. The system monitors the battery charge level, and will not allow a shelf chassis to enter service without an adequately charged battery to avoid data loss during back-to-back power failures.

The other important aspect of the hardware is that blades are a Field Replaceable Unit (FRU).  Instead of trying to repair a blade, if anything goes wrong with the hardware, the whole blade is replaced. We settled on a two-drive storage blade as a compromise between cost, performance, and reliability.  Having the blade as a failure domain simplifies our fault tolerance mechanisms, and it provides a simple maintenance model for system administrators. Reliability and data reconstruction are described and measured in detail later in the paper.

3           Storage Management

Traditional storage management tasks involve partitioning available storage space into LUNs (i.e., logical units that are one or more disks, or a subset of a RAID array), assigning LUN ownership to different hosts, configuring RAID parameters, creating file systems or databases on LUNs, and connecting clients to the correct server for their storage.  This can be a labor-intensive scenario.  We sought to provide a simplified model for storage management that would shield the storage administrator from these kinds of details and allow a single, part-time admin to manage systems that were hundreds of terabytes in size.

The Panasas storage system presents itself as a file system with a POSIX interface, and hides most of the complexities of storage management. Clients have a single mount point for the entire system.  The /etc/fstab file references the cluster manager, and from that the client learns the location of the metadata service instances. The administrator can add storage while the system is online, and new resources are automatically discovered.  To manage available storage, we introduced two basic storage concepts: a physical storage pool called a BladeSet, and a logical quota tree called a Volume. 

The BladeSet is a collection of StorageBlade modules in one or more shelves that comprise a RAID fault domain.  We mitigate the risk of large fault domains with the scalable rebuild performance described in Section 4.2. The BladeSet is a hard physical boundary for the volumes it contains. A BladeSet can be grown at any time, either by adding more StorageBlade modules, or by merging two existing BladeSets together.

The Volume is a directory hierarchy that has a quota constraint and is assigned to a particular BladeSet.  The quota can be changed at any time, and capacity is not allocated to the Volume until it is used, so multiple volumes compete for space within their BladeSet and grow on demand.  The files in those volumes are distributed among all the StorageBlade modules in the BladeSet.    

Volumes appear in the file system name space as directories. Clients have a single mount point for the whole storage system, and volumes are simply directories below the mount point.  There is no need to update client mounts when the admin creates, deletes, or renames volumes.

Each Volume is managed by a single metadata manager.  Dividing metadata management responsibility along volume boundaries (i.e., directory trees) was done primarily to keep the implementation simple.  We figured that administrators would introduce volumes (i.e., quota trees) for their own reasons, and this would provide an easy, natural boundary.  We were able to delay solving the multi-manager coordination problems created when a parent directory is controlled by a different metadata manager than a file being created, deleted, or renamed within it.  We also had a reasonable availability model for metadata manager crashes; well-defined subtrees would go offline rather than a random sampling of files.  The file system recovery check implementation is also simplified; each volume is checked independently (and in parallel when possible), and errors in one volume don’t affect availability of other volumes. Finally, clients bypass the metadata manager during read and write operations, so the metadata manager’s load is already an order of magnitude smaller than a traditional file server storing the same number of files. This reduces the importance of fine-grain metadata load balancing.  That said, uneven volume utilization can result in uneven metadata manager utilization. Our protocol allows the metadata manager to redirect the client to another manager to distribute load, and we plan to exploit this feature in the future to provide finer-grained load balancing.

While it is possible to have a very large system with one BladeSet and one Volume, and we have customers that take this approach, we felt it was important for administrators to be able to configure multiple storage pools and manage quota within them. Our initial model only had a single storage pool: a file would be partitioned into component objects, and those objects would be distributed uniformly over all available storage nodes.  Similarly, metadata management would be distributed by randomly assigning ownership of new files to available metadata managers. This is similar to the Ceph model [Weil06]. The attraction of this model is smooth load balancing among available resources.  There would be just one big file system, and capacity and metadata load would automatically balance.  Administrators wouldn’t need to worry about running out of space, and applications would get great performance from large storage systems.

There are two problems with a single storage pool: the fault and availability model, and performance isolation between different users. If there are ever enough faults to disable access to some files, then the result would be that a random sample of files throughout the storage system would be unavailable.  Even if the faults were transient, such as a node or service crash and restart, there will be periods of unavailability.  Instead of having the entire storage system in one big fault domain, we wanted the administrator to have the option of dividing a large system into multiple fault domains, and of having a well defined availability model in the face of faults.  In addition, with large installations the administrator can assign different projects or user groups to different storage pools.  This isolates the performance and capacity utilization among different groups.

Our storage management design reflects a compromise between the performance and capacity management benefits of a large storage pool, the backup and restore requirements of the administrator, and the complexity of the implementation.  In practice, our customers use BladeSets that range in size from a single shelf to more than 20 shelves, with the largest production Bladeset being about 50 shelves, or 500 StorageBlade modules and 50 DirectorBlade modules.  The most common sizes, however, range from 5 to 10 shelves.  While we encourage customers to introduce Volumes so the system can better exploit the DirectorBlade modules, we have customers that run large systems (e.g., 20 shelves) with a single Volume.

3.1         Automatic Capacity Balancing

Capacity imbalance occurs when expanding a BladeSet (i.e., adding new, empty storage nodes), merging two BladeSets, and replacing a storage node following a failure.  In the latter scenario, the imbalance is the result of our RAID rebuild, which uses spare capacity on every storage node rather than dedicating a specific “hot spare” node. This provides better throughput during rebuild (see section 4.2), but causes the system to have a new, empty storage node after the failed storage node is replaced. Our system automatically balances used capacity across storage nodes in a BladeSet using two mechanisms: passive balancing and active balancing.

Passive balancing changes the probability that a storage node will be used for a new component of a file, based on its available capacity. This takes effect when files are created, and when their stripe size is increased to include more storage nodes.  Active balancing is done by moving an existing component object from one storage node to another, and updating the storage map for the affected file.  During the transfer, the file is transparently marked read-only by the storage management layer, and the capacity balancer skips files that are being actively written. Capacity balancing is thus transparent to file system clients.

Capacity balancing can serve to balance I/O load across the storage pool.  We have validated this in large production systems.  Of course there can always be transient hot spots based on workload.  It is important to avoid long term hot spots, and we did learn from some mistakes.  The approach we take is to use a uniform random placement algorithm for initial data placement, and then preserve that during capacity balancing.  The system must strive for a uniform distribution of both objects and capacity. This is more subtle than it may appear, and we learned that biases in data migration and placement can cause hot spots.

Initial data placement is uniform random, with the components of a file landing on a subset of available storage nodes.  Each new file gets a new, randomized storage map.  However, the uniform random distribution is altered by passive balancing that biases the creation of new data onto emptier blades. On the surface, this seems reasonable. Unfortunately, if a single node in a large system has a large bias as the result of being replaced recently, then it can end up with a piece of every file created over a span of hours or a few days. In some workloads, recently created files may be hotter than files created several weeks or months ago. Our initial implementation allowed large biases, and we occasionally found this led to a long-term hot spot on a particular storage node. Our current system bounds the effect of passive balancing to be within a few percent of uniform random, which helps the system fine tune capacity when all nodes are nearly full, but does not cause a large bias that can lead to a hot spot.

Another bias we had was favoring large objects for active balancing because it is more efficient. There is per-file overhead to update its storage map, so it is more efficient to move a single 1 GB component object than to move 1000 1 MB component objects. However, consider a system that has relatively few large files that are widely striped, and lots of other small files.  When it is expanded from N to N+M storage nodes (e.g., grows from 50 to 60), should the system balance capacity by moving a few large objects, or by moving many small objects?  If the large files are hot, it is a mistake to bias toward them because the new storage nodes can get a disproportionate number of hot objects.  We found that selecting a uniform random sample of objects from the source blades was the best way to avoid bias and inadvertent hot spots, even if it means moving lots of small objects to balance capacity.

4           Object RAID and Reconstruction

We protect against loss of a data object or an entire storage node by striping files across objects stored on different storage nodes, using a fault-tolerant striping algorithm such as RAID-1 or RAID-5.  Small files are mirrored on two objects, and larger files are striped more widely to provide higher bandwidth and less capacity overhead from parity information. The per-file RAID layout means that parity information for different files is not mixed together, and easily allows different files to use different RAID schemes alongside each other. This property and the security mechanisms of the OSD protocol [Gobioff97] let us enforce access control over files even as clients access storage nodes directly. It also enables what is perhaps the most novel aspect of our system, client-driven RAID. That is, the clients are responsible for computing and writing parity. The OSD security mechanism also allows multiple metadata managers to manage objects on the same storage device without heavyweight coordination or interference from each other.

Client-driven, per-file RAID has four advantages for large-scale storage systems.  First, by having clients compute parity for their own data, the XOR power of the system scales up as the number of clients increases.  We measured XOR processing during streaming write bandwidth loads at 7% of the client’s CPU, with the rest going to the OSD/iSCSI/TCP/IP stack and other file system overhead.  Moving XOR computation out of the storage system into the client requires some additional work to handle failures. Clients are responsible for generating good data and good parity for it. Because the RAID equation is per-file, an errant client can only damage its own data. However, if a client fails during a write, the metadata manager will scrub parity to ensure the parity equation is correct.

The second advantage of client-driven RAID is that clients can perform an end-to-end data integrity check.  Data has to go through the disk subsystem, through the network interface on the storage nodes, through the network and routers, through the NIC on the client, and all of these transits can introduce errors with a very low probability.  Clients can choose to read parity as well as data, and verify parity as part of a read operation.  If errors are detected, the operation is retried.  If the error is persistent, an alert is raised and the read operation fails.  We have used this facility to track down flakey hardware components; we have found errors introduced by bad NICs, bad drive caches, and bad customer switch infrastructure.  While file systems like ZFS [ZFS] maintain block checksums within a local file system, which does not address errors introduced during the transit of information to a network client.  By checking parity across storage nodes within the client, the system can ensure end-to-end data integrity.  This is another novel property of per-file, client-driven RAID.

Third, per-file RAID protection lets the metadata managers rebuild files in parallel. Although parallel rebuild is theoretically possible in block-based RAID, it is rarely implemented. This is due to the fact that the disks are owned by a single RAID controller, even in dual-ported configurations.  Large storage systems have multiple RAID controllers that are not interconnected.  Since the SCSI Block command set does not provide fine-grained synchronization operations, it is difficult for multiple RAID controllers to coordinate a complicated operation such as an online rebuild without external communication. Even if they could, without connectivity to the disks in the affected parity group, other RAID controllers would be unable to assist. Even in a high-availability configuration, each disk is typically only attached to two different RAID controllers, which limits the potential speedup to 2x.

When a StorageBlade module fails, the metadata managers that own Volumes within that BladeSet determine what files are affected, and then they farm out file reconstruction work to every other metadata manager in the system.  Metadata managers rebuild their own files first, but if they finish early or do not own any Volumes in the affected Bladeset, they are free to aid other metadata managers.  Declustered parity groups [Holland92] spread out the I/O workload among all StorageBlade modules in the BladeSet.  The result is that larger storage clusters reconstruct lost data more quickly.  Scalable reconstruction performance is presented later in this paper.

The fourth advantage of per-file RAID is that unrecoverable faults can be constrained to individual files.  The most commonly encountered double-failure scenario with RAID-5 is an unrecoverable read error (i.e., grown media defect) during the reconstruction of a failed storage device.  The 2nd storage device is still healthy, but it has been unable to read a sector, which prevents rebuild of the sector lost from the first drive and potentially the entire stripe or LUN, depending on the design of the RAID controller. With block-based RAID, it is difficult or impossible to directly map any lost sectors back to higher-level file system data structures, so a full file system check and media scan will be required to locate and repair the damage. A more typical response is to fail the rebuild entirely.  RAID controllers monitor drives in an effort to scrub out media defects and avoid this bad scenario, and the Panasas system does media scrubbing, too.  However, with high capacity SATA drives, the chance of encountering a media defect on drive B while rebuilding drive A is still significant.  With per-file RAID-5, this sort of double failure means that only a single file is lost, and the specific file can be easily identified and reported to the administrator. While block-based RAID systems have been compelled to introduce RAID-6 (i.e., fault tolerant schemes that handle two failures), we have been able to deploy highly reliable RAID-5 systems with large, high performance storage pools.

4.1         RAID I/O Performance

This section shows I/O performance as a function of the size of the storage system, the number of clients, and the striping configuration.  Streaming I/O and random I/O performance are shown.

   Figure 2: IOzone Streaming Bandwidth MB/sec

Figure 2 charts iozone [Iozone] streaming bandwidth performance from a cluster of up to 100 clients against storage clusters of 1, 2, 4 and 8 shelves.  Each client ran two instances of iozone writing and reading a 4GB file with 64KB record size.  (Note that the X-axis is not linear; there is a jump from 160 I/O streams to 200.)  Appendix I summarizes the details of the hardware used in the experiments.

This is a complicated figure, but there are two basic results.  The first is that performance increases linearly as the size of the storage system increases.  The second is that write performance scales up and stays flat as the number of clients increases, while the read performance tails off as the number of clients increases.  The write performance curves demonstrate the performance scalability.  A one-shelf system delivered about 330 MB/sec, a two-shelf system delivered about 640 MB/sec, a four-shelf system delivered about 1280 MB/sec, and the eight-shelf system peaked around 2500 MB/sec. This corresponds to a scaling factor that is 95% of linear. In another experiment, a 30-shelf system achieved just over 10 GB/sec of read performance, for a per-shelf bandwidth of 330 MB/sec. 

These kinds of results depend on adequate network bandwidth between clients and the storage nodes.  They also require a 2-level RAID striping pattern for large files to avoid network congestion [Nagle04].  For a large file, the system allocates parity groups of 8 to 11 storage nodes until all available storage nodes have been used.  Approximately 1 GB of data (2000 stripes) is stored in each parity group before rotating to the next one. When all parity groups have been used, the file wraps around to the first group again.  The system automatically selects the size of the parity group so that an integral number of them fit onto the available storage nodes with the smallest unused remainder. The 2-level RAID pattern concentrates I/O on a small number of storage nodes, yet still lets large files expand to cover the complete set of storage nodes. Each file has its own mapping of parity groups to storage nodes, which diffuses load and reduces hot-spotting.

The difference between read and write scaling stems from the way OSDFS writes data.  It performs delayed block allocation for new data so it can be batched and written efficiently.  Thus new data and its associated metadata (i.e., indirect blocks) are streamed out to the next available free space, which results in highly efficient utilization of the disk arm.  Read operations, in contrast, must seek to get their data because the data sets are created to be too large to fit in any cache.  While OSDFS does object-aware read ahead, as the number of concurrent read streams increases, it becomes more difficult to optimize the workload because the amount of read-ahead buffering available for each stream shrinks.

Figure 3 charts iozone performance for mixed (i.e., read and write) random I/O operations against a 4 GB file with different transfer sizes and different numbers of clients. Each client has its own file, so the working set size increases with more clients.  Two parameters were varied: the amount of memory on the StorageBlade modules, and the I/O transfer size.  The vertical axis shows the throughput of the storage system, and the chart compares the different configurations as the number of clients increases from 1 to 6.  The results show that larger caches on the StorageBlade modules can significantly improve the performance of small block random I/O.

  Figure 3: Mixed Random I/O MB/sec

We tested two different hardware configurations: StorageBlade modules with 512 MB of memory (labeled as “.5GB $”) and with 2 GB of memory (labeled “2GB $”).  In each case the system had 9 StorageBlade modules, so the total memory on the StorageBlade modules was 4.5 GB and 18 GB, respectively. Two different transfer sizes are used: 64 KB matches the stripe unit size, and 4 KB is the underlying block size of OSDFS.  Obviously, the larger memory configuration is able to cache most or all of the working set with small numbers of clients.  As the number of clients increases such that the working set size greatly exceeds the cache, then the difference in cache size will matter less.  The throughput with 4 KB random I/O is very low with inadequate cache.  One client gets approximately 1.1 MB/sec, or about 280 4 KB ops/sec, and the rate with 4 clients drops to 700 KB/sec, or about 175 ops/sec.  The 4 KB and 64 KB writes in the mixed workload require four OSD operations to complete the RAID-5 update to the full stripe (two reads, two writes).  In addition, we observed extra I/O traffic between the client cache and the OSD due to read ahead and write gathering optimizations that are enabled by default to optimize streaming workloads.  The iozone test does 1 million I/Os from each client in the 4 KB block and 4 GB file case, so we elected not to run that with 5 and 6 clients in the 512 MB cache configuration simply because it ran too long.

4.2         RAID Rebuild Performance

RAID rebuild performance determines how quickly the system can recover data when a storage node is lost.  Short rebuild times reduce the window in which a second failure can cause data loss.  There are three techniques to reduce rebuild times: reducing the size of the RAID parity group, declustering the placement of parity group elements, and rebuilding files in parallel using multiple RAID engines. 

The rebuild bandwidth is the rate at which reconstructed data is written to the system when a storage node is being reconstructed.  The system must read N times as much as it writes, depending on the width of the RAID parity group, so the overall throughput of the storage system is several times higher than the rebuild rate.  A narrower RAID parity group requires fewer read and XOR operations to rebuild, so will result in a higher rebuild bandwidth.  However, it also results in higher capacity overhead for parity data, and can limit bandwidth during normal I/O.  Thus, selection of the RAID parity group size is a tradeoff between capacity overhead, on-line performance, and rebuild performance.

Understanding declustering is easier with a picture.  In Figure 4, each parity group has 4 elements, which are indicated by letters placed in each storage device.  They are distributed among 8 storage devices.  The ratio between the parity group size and the available storage devices is the declustering ratio, which in this example is ˝.  In the picture, capital letters represent those parity groups that all share the 2nd storage node.  If the 2nd storage device were to fail, the system would have to read the surviving members of its parity groups to rebuild the lost elements.  You can see that the other elements of those parity groups occupy about ˝ of each other storage device.

Figure 4: Declustered parity groups

 

For this simple example you can assume each parity element is the same size so all the devices are filled equally.  In a real system, the component objects will have various sizes depending on the overall file size, although each member of a parity group will be very close in size.  There will be thousands or millions of objects on each device, and the Panasas system uses active balancing to move component objects between storage nodes to level capacity.

Declustering means that rebuild requires reading a subset of each device, with the proportion being approximately the same as the declustering ratio. The total amount of data read is the same with and without declustering, but with declustering it is spread out over more devices.  When writing the reconstructed elements, two elements of the same parity group cannot be located on the same storage node.  Declustering leaves many storage devices available for the reconstructed parity element, and randomizing the placement of each file’s parity group lets the system spread out the write I/O over all the storage. Thus declustering RAID parity groups has the important property of taking a fixed a