Check out the new USENIX Web site.
FCW '11 Banner Tab
USENIX HotStorage '11 Banner


Tuesday, June 14, 2011
10:15 a.m.–noon

Don't Thrash: How to Cache Your Hash on Flash
Back to Program
Many large storage systems use approximate-membership-query (AMQ) data structures to deal with the massive amounts of data that they process. An AMQ data structure is a dictionary that trades off space for a false positive rate on membership queries. It is designed to fit into small, fast storage, and it is used to avoid I/Os on slow storage. The Bloom filter is a well-known example of an AMQ data structure. Bloom filters, however, do not scale outside of main memory. This paper describes the Cascade Filter, an AMQ data structure that scales beyond main memory, supporting over half a million insertions/deletions per second and over 500 lookups per second on a commodity flash-based SSD.

Onyx: A Prototype Phase Change Memory Storage Array
Back to Program
We describe a prototype high-performance solid-state drive based on first-generation phase-change memory (PCM) devices called Onyx. Onyx has a capacity of 10 GB and connects to the host system via PCIe. We describe the internal architecture of Onyx including the PCM memory modules we constructed and the FPGA-based controller that manages them. Onyx can perform a 4 KB random read in 38 μs and sustain 191K 4 KB read IO operations per second. A 4 KB write requires 179 μs. We describe our experience tuning the Onyx system to reduce the cost of wear-leveling and increase performance. We find that Onyx out-performs a state-of-the-art flash-based SSD for small writes (< 2 KB) by between 72 and 120% and for reads of all sizes. In addition, Onyx incurs 20-51% less CPU overhead per IOP for small requests. Combined, our results demonstrate that even first-generation PCM SSDs can out-perform flash-based arrays for the irregular (and frequently read-dominated) access patterns that define many of today's "killer" storage applications. Next generation PCM devices will widen the performance gap further and set the stage for PCM becoming a serious flash competitor in many applications.

SSD Characterization: From Energy Consumption's Perspective
Back to Program
In this work, we perform μsec time scale analysis on energy consumption behavior of the SSD Write operation and exploit this information to extract key technical characteristics of SSD internals: channel utilization policy, page allocation strategy, cluster size, channel switch delay, way switch delay, etc. We found that some SSDs adopt a multi-page cluster as a write unit instead of a page. We found that SSDs adopt significantly different ways of exploiting channel-level and way-level parallelism to maximize write throughput, which governs the peak current consumption. The X25M(Intel) emphasizes the performance aspect of SSDs and linearly increases the channel parallelism as the IO size increases. The MXP(Samsung) puts more emphasis on energy consumption aspect of SSD and controls the degree of parallelism to reduce the peak current consumption. Cluster size of the X25M and the MXP correspond to one and eight pages, respectively. The current consumed when writing a page to NAND flash varies significantly depending on the NAND model(17 mA- 35 mA).

Exploiting Heat-Accelerated Flash Memory Wear-Out Recovery to Enable Self-Healing SSDs
Back to Program
This paper proposes a self-healing solid-state drive (SSD) design strategy that exploits heat-accelerated recovery of NAND flash memory cell wear-out to improve SSD lifetime. The key is to make each NAND flash memory chip self-healable by stacking an extra heater die, and to employ system-level redundancy to ensure SSD data storage integrity when one memory chip is being self-heated for memory cell wear-out recovery. Based upon detailed thermal modeling and memory cell device modeling, we carried out simulations and performed detailed analysis. The results show that SSD lifetime can be improved by over five times at reasonable performance and energy consumption overhead.

1:00 p.m.–2:15 p.m.

Italian for Beginners: The Next Steps for SLO-Based Management
Back to Program
Literature is rife with compelling ideas on simplifying storage management using service-level objectives (SLOs). However, very few of these ideas have actually been productized, despite the fact that many of the original ideas came from industry and were developed more than a decade ago. While many good research ideas do not become products, in this case, we believe that there are important reasons why adoption has been slow. This paper investigates the reasons for slow adoption and discusses ideas that can truly simplify storage management using SLOs.

In Search of I/O-Optimal Recovery from Disk Failures
Back to Program
We address the problem of minimizing the I/O needed to recover from disk failures in erasure-coded storage systems. The principal result is an algorithm that finds the optimal I/O recovery from an arbitrary number of disk failures for any XOR-based erasure code. We also describe a family of codes with high-fault tolerance and low recovery I/O, e.g. one instance tolerates up to 11 failures and recovers a lost block in 4 I/Os. While we have determined I/O optimal recovery for any given code, it remains an open problem to identify codes with the best recovery properties. We describe our ongoing efforts toward characterizing space overhead versus recovery I/O tradeoffs and generating codes that realize these bounds.

ViDeDup: An Application-Aware Framework for Video De-duplication
Back to Program
Key to the compression-capability of a data de-duplication system is the definition of redundancy. Traditionally, two data items are considered redundant if their underlying bit-streams are identical. However, this notion of redundancy is too strict for many applications. For example, for a video storage platform, two videos encoded in different formats would be unique at the system level but redundant at the content level. Intuitively, introducing application-level intelligence in redundancy detection can yield improved data compression. We propose ViDeDup (Video De-Duplication), a novel framework for video de-duplication based on an application-level view of redundancy. The framework goes beyond duplicate data detection to similarity-detection, thereby providing application-level knobs for defining acceptable level of noise during replica detection. Our results show that by trading CPU for storage, a 45% reduction in storage space could be achieved, in comparison to 8% yielded by system level de-duplication for a dataset collected from video sharing sites on the Web. We also present tradeoff analysis for various tunable parameters of the system to optimally tune the system for performance, compression and quality.

2:30 p.m.–3:45 p.m.

Truly Non-blocking Writes
Back to Program
Writing data within user or operating system memory often encounters the classic read-before-write problem whereby the page written to must first be read from the backing store, effectively blocking the writing process before modifications are made. Unfortunately, the large gap between memory and storage access performance adversely affects workloads that require substantial read-before-write operations when accessing memory. In this paper, we present techniques that make writes to memory truly non-blocking. The basic approach involves absorbing writes immediately in temporary buffer pages and asynchronously merging the updates after reading in the on-disk version of the page. Doing so improves system performance by first, reducing blocking of processes and second, improving the parallelism of data retrieval from the backing store leading to better throughput for read-before-write operations. We analyze the potential benefits of our approach using full-system memory access traces for several benchmark workloads and present techniques that commodity operating systems can employ to implement non-blocking writes.

Exposing File System Mappings with MapFS
Back to Program
The conventional model of a file as a contiguous array of bytes hides information about the physical location of data from users. While this simplifying abstraction can be useful in some cases, it can also lead to suboptimal performance and unnecessary overhead. A growing number of applications – even those as basic as the Unix cp utility – can benefit from increased access to file system metadata. We present MapFS, a file system which allows applications to create, inspect, modify, and remove the mappings established between individual files and physical storage. MapFS gives users increased power and flexibility, facilitates true end-to-end application design, and optimizes many common file system tasks.

Stratified B-trees and Versioned Dictionaries
Back to Program
External-memory versioned dictionaries are fundamental to file systems, databases and many other algorithms. The ubiquitous data structure is the copy-on-write (CoW) B-tree. Unfortunately, it doesn't inherit the B-tree's optimality properties; it has poor space utilization, cannot offer fast updates, and relies on random IO to scale. We describe the 'stratified B-tree', which is the first versioned dictionary offering fast updates and an optimal tradeoff between space, query and update costs.

3:55 p.m.–4:30 p.m.

Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes
Back to Program
The long life and low usage of archival data make cost considerations paramount. Today, most archival storage architectures depend on magnetic or optical media such as tape and disk because they have a low initial cost per byte. The high initial cost of storage class memories (SCMs) such as flash has been seen as prohibitive for archival use. Nevertheless, SCMs have many advantages for archival use, including physical robustness and low power usage. In this work, we argue that the advantages of SCMs outweigh their relatively high acquisition cost, and outline the use of SCM in a Pergamum and FAWN like architecture. Our extension to those architectures we call DAWN, a Durable Array of Wimpy Nodes. DAWN will make use of low-power system-on-chip technology paired with SCM to provide a simple, reliable, self-managing archival storage system.

Principles of Operation for Shingled Disk Devices
Back to Program
A leading strategy for driving the areal density of magnetic disk drives through 1–10 terabit/inch2 (the coming decade) is to shingle (partially overlap) adjacent tracks, imposing significant restrictions on where data can be written without incurring multi-track read-modify-write penalties. These restrictions and penalties can be 1) fully hidden from system software using techniques familiar in NAND Flash disks; 2) minimally exposed as multi-track, shingled bands of predetermined size that can be read normally, but only appended to or trimmed (erased); or 3) maximally exposed as dynamically sized bands of shingles separated by guard regions of previously erased tracks, allowing maximal capacity to be obtained by the most sophisticated system software. While the latter options require significant changes in system software, there is a rich history of demonstrations of log-structured file systems that should be able to do this, and a profusion of write-once cloud storage systems that could provide the economic "killer application". Now is a very good time for systems software experts to take interest and weigh in as magnetic disk technologists are experimenting and prototyping shingled disks. Experience shows that changes in the interface model for magnetic disks can take decades to change (for example, 512B to 4096B sectors) unless device vendors and systems software developers work together toward a mutually desired principles of operation. For more information, please see the full paper published as Parallel Data Lab (PDL) Technical Report #CMU-PDL-11-107.

?Need help? Use our Contacts page.

Last changed: 24 August 2011 jel