Check out the new USENIX Web site.

Introduction

Storage arrays form the backbone of modern data centers by providing consolidated data access to multiple applications simultaneously. Deployments of consolidated storage using Storage Area Network (SAN) or Network-Attached Storage (NAS) hardware are increasing, motivated by easy access to data from anywhere at any time, ease of backup, flexibility in provisioning, and centralized administration. This trend is further fueled by the proliferation of virtualization technologies, which rely on shared storage to support features such as live migration of workloads across hosts.

A typical virtualized data center consists of multiple physical hosts, each running several virtual machines (VMs). Many VMs may compete for access to one or more logical units (LUNs) on a single storage array. The resulting contention at the array for resources such as controllers, caches, and disk arms leads to unpredictable IO completion times. Resource management mechanisms and policies are required to enable performance isolation, control service rates, and enforce service-level agreements.

In this paper, we target the problem of providing coarse-grained fairness to VMs, without assuming any support from the storage array itself. We also strive to remain work-conserving, so that the array is utilized efficiently. We focus on proportionate allocation of IO resources as a flexible building block for constructing higher-level policies. This problem is challenging for several reasons, including the need to treat the array as an unmodifiable black box, unpredictable array performance, uncertain available bandwidth, and the desire for a scalable decentralized solution.

Many existing approaches [13,16,14,25,28,27,21] allocate bandwidth among multiple applications running on a single host. In such systems, one centralized scheduler has complete control over all requests to the storage system. Other centralized schemes [19,30] attempt to control the queue length at the device to provide tight latency bounds. Although centralized schedulers are useful for host-level IO scheduling, in our virtualized environment we need an approach for coordinating IO scheduling across multiple independent hosts accessing a shared storage array.

More decentralized approaches, such as Triage [18], have been proposed, but still rely on centralized measurement and control. A central agent adjusts per-host bandwidth caps over successive time periods and communicates them to hosts. Throttling hosts using caps can lead to substantial inefficiency by under-utilizing array resources. In addition, host-level changes such as VMs becoming idle need to propagate to the central controller, which may cause a prohibitive increase in communication costs.

We instead map the problem of distributed storage access from multiple hosts to the problem of flow control in networks. In principle, fairly allocating storage bandwidth with high utilization is analogous to distributed hosts trying to estimate available network bandwidth and consuming it in a fair manner. The network is effectively a black box to the hosts, providing little or no information about its current state and the number of participants. Starting with this loose analogy, we designed PARDA, a new software system that enforces coarse-grained proportional-share fairness among hosts accessing a storage array, while still maintaining high array utilization.

PARDA uses the IO latency observed by each host as an indicator of load at the array, and uses a control equation to adjust the number of IOs issued per host, i.e., the host window size. We found that variability in IO latency, due to both request characteristics (e.g., degree of sequentiality, reads vs. writes, and IO size) and array internals (e.g., request scheduling, caching and block placement) could be magnified by the independent control loops running at each host, resulting in undesirable divergent behavior.

To handle such variability, we found that using the average latency observed across all hosts as an indicator of overall load produced stable results. Although this approach does require communication between hosts, we need only compute a simple average for a single metric, which can be accomplished using a lightweight, decentralized aggregation mechanism. PARDA also handles idle VMs and bursty workloads by adapting per-host weights based on long-term idling behavior, and by using a local scheduler at the host to handle short-term bursts. Integrating with a local proportional-share scheduler [10] enables fair end-to-end access to VMs in a distributed environment.

We implemented a complete PARDA prototype in the VMware ESX Server hypervisor [24]. For simplicity, we assume all hosts use the same PARDA protocol to ensure fairness, a reasonable assumption in most virtualized clusters. Since hosts run compatible hypervisors, PARDA can be incorporated into the virtualization layer, and remain transparent to the operating systems and applications running within VMs. We show that PARDA can maintain cluster-level latency close to a specified threshold, provide coarse-grained fairness to hosts in proportion to per-host weights, and provide end-to-end storage IO isolation to VMs or applications while handling diverse workloads.

The next section presents our system model and goals in more detail. Section 3 develops the analogy to network flow control, and introduces our core algorithm, along with extensions for handling bursty workloads. Storage-specific challenges that required extensions beyond network flow control are examined in Section 4. Section 5 evaluates our implementation using a variety of quantitative experiments. Related work is discussed in section 6, while conclusions and directions for future work are presented in Section 7.

Ajay Gulati 2009-01-14