Testing 1-2-3-4

The full Proceedings published by USENIX for the conference are available for download below. Individual papers can also be downloaded from the presentation page. Copyright to the individual works is retained by the author[s].

Proceedings Front Matter 
Cover Page | Title Page and List of Organizers | Table of Contents | Message from the Program Co-Chairs

Full Proceedings PDFs
 FAST '15 Full Proceedings (PDF)
 FAST '15 Proceedings Interior (PDF, best for mobile devices)
 Errata Slip (PDF)

Full Proceedings ePub (for iPad and most eReaders)
 FAST '15 Full Proceedings (ePub)

Full Proceedings Mobi (for Kindle)
 FAST '15 Full Proceedings (Mobi)

Download Proceedings Archives (Conference Attendees Only)

Attendee Files 
FAST '15 Proceedings Archive (ZIP, includes Conference Attendee list)

 

Tuesday, February 17, 2015

8:00 am–9:00 am Tuesday

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:30 am Tuesday

Opening Remarks and Best Paper Awards

9:00 am–9:15 am

Grand Ballroom ABCFGH

Program Co-Chairs: Jiri Schindler, SimpliVity, and Erez Zadok, Stony Brook University

The Theory of Everything: Scaling for Future Systems

Grand Ballroom ABCFGH
Session Chair: Raju Rangaswami, Florida International University

CalvinFS: Consistent WAN Replication and Scalable Metadata Management for Distributed File Systems

Alexander Thomson, Google; Daniel J. Abadi, Yale University

Existing file systems, even the most scalable systems that store hundreds of petabytes (or more) of data across thousands of machines, store file metadata on a single server or via a shared-disk architecture in order to ensure consistency and validity of the metadata.

This paper describes a completely different approach for the design of replicated, scalable file systems, which leverages a high-throughput distributed database system for metadata management. This results in improved scalability of the metadata layer of the file system, as file metadata can be partitioned (and replicated) across a (shared-nothing) cluster of independent servers, and operations on file metadata transformed into distributed transactions.

In addition, our file system is able to support standard file system semantics—including fully linearizable random writes by concurrent users to arbitrary byte offsets within the same file—across wide geographic areas. Such high performance, fully consistent, geographically distributed files systems do not exist today.

We demonstrate that our approach to file system design can scale to billions of files and handle hundreds of thousands of updates and millions of reads per second—while maintaining consistently low read latencies. Furthermore, such a deployment can survive entire datacenter outages with only small performance hiccups and no loss of availability.

Available Media

Analysis of the ECMWF Storage Landscape

Matthias Grawinkel, Lars Nagel, Markus Mäsker, Federico Padua, and André Brinkmann, Johannes-Gutenberg University Mainz; Lennart Sorth, European Centre for Medium-Range Weather Forecasts

Despite domain-specific digital archives are growing in number and size, there is a lack of studies describing their architectures and runtime characteristics. This paper investigates the storage landscape of the European Centre for Medium-RangeWeather Forecasts (ECMWF) whose storage capacity has reached 100 PB and experiences an annual growth rate of about 45%. Out of this storage, we examine a 14.8 PB user archive and a 37.9 PB object database for metereological data over a period of 29 and 50 months, respectively.

We analzye the system’s log files to characterize traffic and user behavior, metadata snapshots to identify the current content of the storage systems, and logs of tape libraries to investigate cartridge movements. We have built a caching simulator to examine the efficiency of disk caches for various cache sizes and algorithms, and we investigate the potential of tape prefetching strategies. While the findings of the user archive resemble previous studies on digital archives, our study of the object database is the first one in the field of large-scale active archives.

Available Media

Efficient Intra-Operating System Protection Against Harmful DMAs

Moshe Malka, Nadav Amit, and Dan Tsafrir, Technion—Israel Institute of Technology

Operating systems can defend themselves against misbehaving I/O devices and drivers by employing intra-OS protection. With “strict” intra-OS protection, the OS uses the IOMMU to map each DMA buffer immediately before the DMA occurs and to unmap it immediately after. Strict protection is costly due to IOMMU-related hardware overheads, motivating “deferred” intra-OS protection, which trades off some safety for performance.

We investigate the Linux intra-OS protection mapping layer and discover that hardware overheads are not exclusively to blame for its high cost. Rather, the cost is amplified by the I/O virtual address (IOVA) allocator, which regularly induces linear complexity. We find that the nature of IOVA allocation requests is inherently simple and constrained due to the manner by which I/O devices are used, allowing us to deliver constant time complexity with a compact, easy-to-implement optimization. Our optimization improves the throughput of standard benchmarks by up to 5.5x. It delivers strict protection with performance comparable to that of the baseline deferred protection.

To generalize our case that OSes drive the IOMMU with suboptimal software, we additionally investigate the FreeBSD mapping layer and obtain similar findings.

Available Media

SNIA Industry Track Session 1

Grand Ballroom DE

Download tutorial materials at the SNIA Web site.

NVML: Implementing Persistent Memory Applications

Paul Von Behren, Intel

NVML is an open-source library that simplifies development of applications utilizing byte-addressable persistent memory. The SNIA NVM Programming Model describes basic behavior for a persistent memory-aware file system enabling applications to directly access persistent memory. NVML extends the SNIA NVM Programming Model providing application APIs that help applications create and update data structures in persistent memory avoiding pitfalls such as persistent leaks and inconsistencies due to unexpected hardware or software restarts.  This tutorial includes an overview of persistent memory hardware (NVDIMMs) and the SNIA NVM Programming Model, then describes the APIs provided by NVML and examples showing how these APIs may be used by applications. 

NVML is an open-source library that simplifies development of applications utilizing byte-addressable persistent memory. The SNIA NVM Programming Model describes basic behavior for a persistent memory-aware file system enabling applications to directly access persistent memory. NVML extends the SNIA NVM Programming Model providing application APIs that help applications create and update data structures in persistent memory avoiding pitfalls such as persistent leaks and inconsistencies due to unexpected hardware or software restarts.  This tutorial includes an overview of persistent memory hardware (NVDIMMs) and the SNIA NVM Programming Model, then describes the APIs provided by NVML and examples showing how these APIs may be used by applications. 

Paul von Behren is a Software Architect at Intel Corporation. Currently he is co-chair of the SNIA NVM Programming Technical Work Group. His background includes software for managing Fibre Channel, iSCSI, and SAS storage devices, multipath management software, and RAID systems. He has helped lead the creation of the SNIA Storage Management Initiative—Standard (SMI-S) and Multipath Management API standards, and contributed to a number of other storage and management software standards.

Combining SNIA Cloud, Tape, and Container Format Technologies for Long-Term Retention

Sam Fineberg, Hewlett-Packard

Generating and collecting very large data sets is becoming a necessity in many domains that also need to keep that data for long periods. Examples include astronomy, genomics, medical records, photographic archives, video archives, and large-scale e-commerce. While this presents significant opportunities, a key challenge is providing economically scalable storage systems to efficiently store and preserve the data, as well as to enable search, access, and analytics on that data in the far future.

Generating and collecting very large data sets is becoming a necessity in many domains that also need to keep that data for long periods. Examples include astronomy, genomics, medical records, photographic archives, video archives, and large-scale e-commerce. While this presents significant opportunities, a key challenge is providing economically scalable storage systems to efficiently store and preserve the data, as well as to enable search, access, and analytics on that data in the far future.

Both cloud and tape technologies are viable alternatives for storage of big data and SNIA supports their standardization. The SNIA Cloud Data Management Interface (CDMI) provides a standardized interface to create, retrieve, update, and delete objects in a cloud. The SNIA Linear Tape File System (LTFS) takes advantage of a new generation of tape hardware to provide efficient access to tape using standard, familiar system tools and interfaces. In addition, the SNIA Self-contained Information Retention Format (SIRF) defines a storage container for long term retention that will enable future applications to interpret stored data regardless of the application that originally produced it.

We'll present advantages and challenges in long-term retention of big data, as well as initial work on how to combine SIRF with LTFS and SIRF with CDMI to address some of those challenges. We will also describe an emerging SIRF specification as well as an implementation of SIRF for the cloud.

Dr. Fineberg is a Distinguished Technologist in the HP Storage Chief Technologist Office. In that role he leads technical strategy for big data, the cloud, and analytics storage. He helped develop the MPI2-I/O API and has over 20 years of experience in storage and high-performance computing I/O. Dr. Fineberg is currently co-chair of the Storage Networking Industry Association (SNIA) Long Term Retention Technical Working Group, and he is a member of the SNIA Analytics and Big Data Committee.

10:30 am–11:00 am Tuesday

Break with Refreshments

Grand Ballroom Foyer

11:00 am–12:30 pm Tuesday

Big: Big Systems

Grand Ballroom ABCFGH
Session Chair: Nisha Talagala, SanDisk

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

Da Zheng, Disa Mhembere, Randal Burns, Joshua Vogelstein, Carey E. Priebe, and Alexander S. Szalay, Johns Hopkins University

Graph analysis performs many random reads and writes, thus, these workloads are typically performed in memory. Traditionally, analyzing large graphs requires a cluster of machines so the aggregate memory exceeds the graph size. We demonstrate that a multicore server can process graphs with billions of vertices and hundreds of billions of edges, utilizing commodity SSDs with minimal performance loss. We do so by implementing a graph-processing engine on top of a user-space SSD file system designed for high IOPS and extreme parallelism. Our semi-external memory graph engine called FlashGraph stores vertex state in memory and edge lists on SSDs. It hides latency by overlapping computation with I/O. To save I/O bandwidth, FlashGraph only accesses edge lists requested by applications from SSDs; to increase I/O throughput and reduce CPU overhead for I/O, it conservatively merges I/O requests. These designs maximize performance for applications with different I/O characteristics. FlashGraph exposes a general and flexible vertex-centric programming interface that can express a wide variety of graph algorithms and their optimizations. We demonstrate that FlashGraph in semi-external memory performs many algorithms with performance up to 80% of its in-memory implementation and significantly outperforms PowerGraph, a popular distributed in-memory graph engine.

Available Media

Host-side Filesystem Journaling for Durable Shared Storage

Andromachi Hatzieleftheriou and Stergios V. Anastasiadis, University of Ioannina

Hardware consolidation in the datacenter occasionally leads to scalability bottlenecks due to the heavy utilization of critical resources, such as the shared network bandwidth. Hostside caching on durable media is already applied at the block level in order to reduce the load of the storage backend. However, block-level caching is often criticized for added overhead, and restricted data sharing across different hosts. During client crashes,writeback caching can also lead to unrecoverable loss of written data that was previously acknowledged as stable.We improve the durability of shared storage in the datacenter by supporting journaling at the kernel-level client of an object-based distributed filesystem. Storage virtualization at the file interface achieves clear consistency semantics across data and metadata blocks, supports native file sharing between clients over the same or different hosts, and provides flexible configuration of the time period during which the data is durably staged at the host side. Over a prototype implementation that we developed, we experimentally demonstrate improved performance up to 58%for specific durability guarantees, and reduced network and disk bandwidth at the storage servers by up to 42% and 82%, respectively.

Available Media

LADS: Optimizing Data Transfers Using Layout-Aware Data Scheduling

Youngjae Kim, Scott Atchley, and Geoffroy R. Vallée, Oak Ridge National Laboratory; Galen M. Shipman, Los Alamos National Laboratory

While future terabit networks hold the promise of significantly improving big-data motion among geographically distributed data centers, significant challenges must be overcome even on today’s 100 gigabit networks to realize end-to-end performance. Multiple bottlenecks exist along the end-to-end path from source to sink. Data storage infrastructure at both the source and sink and its interplay with the wide-area network are increasingly the bottleneck to achieving high performance. In this paper, we identify the issues that lead to congestion on the path of an end-to-end data transfer in the terabit network environment, and we present a new bulk data movement framework called LADS for terabit networks. LADS exploits the underlying storage layout at each endpoint to maximize throughput without negatively impacting the performance of shared storage resources for other users. LADS also uses the Common Communication Interface (CCI) in lieu of the sockets interface to use zero-copy, OS-bypass hardware when available. It can further improve data transfer performance under congestion on the end systems using buffering at the source using flash storage. With our evaluations, we show that LADS can avoid congested storage elements within the shared storage resource, improving I/O bandwidth, and data transfer rates across the high speed networks.

Available Media

Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth

KV Rashmi, Preetum Nakkiran, Jingyan Wang, Nihar B. Shah, and Kannan Ramchandran, University of California, Berkeley

Erasure codes, such as Reed-Solomon (RS) codes, are increasingly being deployed as an alternative to data-replication for fault tolerance in distributed storage systems. While RS codes provide significant savings in storage space, they can impose a huge burden on the I/O and network resources when reconstructing failed or otherwise unavailable data. A recent class of erasure codes, called minimum-storage-regeneration (MSR) codes, has emerged as a superior alternative to the popular RS codes, in that it minimizes network transfers during reconstruction while also being optimal with respect to storage and reliability. However, existing practical MSR codes do not address the increasingly important problem of I/O overhead incurred during reconstructions, and are, in general, inferior to RS codes in this regard. In this paper, we design erasure codes that are simultaneously optimal in terms of I/O, storage, and network bandwidth. Our design builds on top of a class of powerful practical codes, called the product-matrix-MSR codes. Evaluations show that our proposed design results in a significant reduction the number of I/Os consumed during reconstructions (a 5 reduction for typical parameters), while retaining optimality with respect to storage, reliability, and network bandwidth.

Available Media

SNIA Industry Track Session 2

Grand Ballroom DE

Download tutorial materials at the SNIA Web site.

SMB Remote File Protocol (Including SMB 3.x)

Tom Talpey, Microsoft

The SMB protocol evolved over time from CIFS to SMB1 to SMB2, with implementations by dozens of vendors including most major Operating Systems and NAS solutions. The SMB 3.0 protocol had its first commercial implementations by Microsoft, NetApp and EMC by the end of 2012, and many other implementations exist or are in-progress. The SMB3 protocol continues to advance. This SNIA Tutorial describes the basic architecture of the SMB protocol and basic operations, including connecting to a share, negotiating a dialect, executing operations and disconnecting from a share. The second part of the tutorial covers improvements in the version 2 of the protocol, including a reduced command set, support for asynchronous operations, compounding of operations, durable and resilient file handles, file leasing and large MTU support.

The SMB protocol evolved over time from CIFS to SMB1 to SMB2, with implementations by dozens of vendors including most major Operating Systems and NAS solutions. The SMB 3.0 protocol had its first commercial implementations by Microsoft, NetApp and EMC by the end of 2012, and many other implementations exist or are in-progress. The SMB3 protocol continues to advance. This SNIA Tutorial describes the basic architecture of the SMB protocol and basic operations, including connecting to a share, negotiating a dialect, executing operations and disconnecting from a share. The second part of the tutorial covers improvements in the version 2 of the protocol, including a reduced command set, support for asynchronous operations, compounding of operations, durable and resilient file handles, file leasing and large MTU support. The final part covers the latest changes in SMB3, including persistent handles (SMB Transparent Failover), active/active clusters (SMB Scale-Out), multiple connections per sessions (SMB Multichannel), support for RDMA protocols (SMB Direct), snapshot-based backups (VSS for Remote File Shares) opportunistic locking of folders (SMB Directory Leasing), and SMB encryption.

Tom Talpey is an Architect in the File Server Team at Microsoft. His responsibilities include SMB 3, SMB Direct (SMB over RDMA), and all the protocols and technologies that support the SMB ecosystem. Tom has worked in the areas of network filesystems, network transports, and RDMA for many years and recently has been working on storage traffic management, with application not only to SMB but in broad end-to-end scenarios. He is a frequent presenter at Storage Dev.

Separate vs. Combined Server Clusters for App Workloads and Shared Storage

Craig Dunwoody, GraphStream Incorporated

A widely used "separate-cluster" architecture for datacenters uses a server cluster, sometimes called a "storage array," that implements highly available shared-storage volumes, and a separate server cluster running application workloads that access those volumes.

An alternative "combined-cluster" architecture, sometimes called "hyperconverged," uses a single server cluster in which every node can participate in implementing highly available shared-storage volumes, and also run application workloads that access those volumes.

For each of these architectures, there are many commercially available implementations. Using technical (not marketing) language, and without naming specific products, this tutorial evaluates key strengths and weaknesses of each approach, including some practical issues that are often omitted in such evaluations.

A widely used "separate-cluster" architecture for datacenters uses a server cluster, sometimes called a "storage array," that implements highly available shared-storage volumes, and a separate server cluster running application workloads that access those volumes.

An alternative "combined-cluster" architecture, sometimes called "hyperconverged," uses a single server cluster in which every node can participate in implementing highly available shared-storage volumes, and also run application workloads that access those volumes.

For each of these architectures, there are many commercially available implementations. Using technical (not marketing) language, and without naming specific products, this tutorial evaluates key strengths and weaknesses of each approach, including some practical issues that are often omitted in such evaluations.

Craig Dunwoody is co-founder and CTO of two Silicon Valley companies: GraphStream, an integrator of advanced scalable data infrastructure, and Birchbridge, an early-stage startup that is developing a new cabinet-scale datacenter building block product with an innovative physical-layer architecture. Previously, at Silicon Graphics, he developed system software for seven successive generations of industry-leading visual computing systems. He earned BSEE, MSEE, and MSCS degrees from Stanford University, and has co-authored five issued and seven pending U.S. patents.

12:30 pm–2:00 pm Tuesday

Conference Luncheon

Santa Clara Ballroom

2:00 pm–3:30 pm Tuesday

Hackers: Cutting Things to Pieces

Grand Ballroom ABCFGH
Session Chair: Nitin Agrawal, NEC Labs

Efficient MRC Construction with SHARDS

Carl A. Waldspurger, Nohhyun Park, Alexander Garthwaite, and Irfan Ahmad, CloudPhysics, Inc.

Reuse-distance analysis is a powerful technique for characterizing temporal locality of workloads, often visualized with miss ratio curves (MRCs). Unfortunately, even the most efficient exact implementations are too heavyweight for practical online use in production systems.

We introduce a new approximation algorithm that employs uniform randomized spatial sampling, implemented by tracking references to representative locations selected dynamically based on their hash values. A further refinement runs in constant space by lowering the sampling rate adaptively. Our approach, called SHARDS (Spatially Hashed Approximate Reuse Distance Sampling), drastically reduces the space and time requirements of reuse-distance analysis, making continuous, online MRC generation practical to embed into production firmware or system software. SHARDS also enables the analysis of long traces that, due to memory constraints, were resistant to such analysis in the past.

We evaluate SHARDS using trace data collected from a commercial I/O caching analytics service. MRCs generated for more than a hundred traces demonstrate high accuracy with very low resource usage. MRCs constructed in a bounded 1 MB footprint, with effective sampling rates significantly lower than 1%, exhibit approximate miss ratio errors averaging less than 0.01. For large traces, this configuration reduces memory usage by a factor of up to 10,800 and run time by a factor of up to 204.

Available Media

ANViL: Advanced Virtualization for Modern Non-Volatile Memory Devices

Zev Weiss, University of Wisconsin—Madison; Sriram Subramanian, Swaminathan Sundararaman, and Nisha Talagala, SanDisk; Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

We present a new form of storage virtualization based on block-level address remapping. By allowing the host system to manipulate this address map with a set of three simple operations (clone, move, and delete), we enable a variety of useful features and optimizations to be readily implemented, including snapshots, deduplication, and single-write journaling. We present a prototype implementation called Project ANViL and demonstrate its utility with a set of case studies.

Available Media

Reducing File System Tail Latencies with Chopper

Jun He, Duy Nguyen, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau, University of Wisconsin—Madison

We present Chopper, a tool that efficiently explores the vast input space of file system policies to find behaviors that lead to costly performance problems. We focus specifically on block allocation, as unexpected poor layouts can lead to high tail latencies. Our approach utilizes sophisticated statistical methodologies, based on Latin Hypercube Sampling (LHS) and sensitivity analysis, to explore the search space efficiently and diagnose intricate design problems. We apply Chopper to study the overall behavior of two file systems, and to study Linux ext4 in depth. We identify four internal design issues in the block allocator of ext4 which form a large tail in the distribution of layout quality. By removing the underlying problems in the code, we cut the size of the tail by an order of magnitude, producing consistent and satisfactory file layouts that reduce data access latencies.

Available Media

Skylight—A Window on Shingled Disk Operation

Abutalib Aghayev and Peter Desnoyers, Northeastern University
Awarded Best Paper!

We introduce Skylight, a novel methodology that combines software and hardware techniques to reverse engineer key properties of drive-managed ShingledMagnetic Recording (SMR) drives. The software part of Skylight measures the latency of controlled I/O operations to infer important properties of drive-managed SMR, including type, structure, and size of the persistent cache; type of cleaning algorithm; type of block mapping; and size of bands. The hardware part of Skylight tracks drive head movements during these tests, using a high-speed camera through an observation window drilled through the cover of the drive. These observations not only confirm inferences from measurements, but resolve ambiguities that arise from the use of latency measurements alone.We show the generality and efficacy of our techniques by running them on top of three emulated and two real SMR drives, discovering valuable performance-relevant details of the behavior of the real SMR drives.

Available Media

SNIA Industry Track Session 3

Grand Ballroom DE

Download tutorial materials at the SNIA Web site.

Storage Grid Using iSCSI

Felix Xavier, CloudByte

The advent of cloud brought a new requirement on storage. The storage nodes in the cloud have to communicate with each other and bring the hot data near the application across the data centers and the communication must be standard-based.

This session proposes iSCSI protocol to achieve the inter-storage node communication for enterprise grade cloud. This may not be applicable for object storage. This is NOT intended to cover the underlying storage implementation to optimally support this communication. ISCSI is chosen for the communication emerging as a de facto standard.

The advent of cloud brought a new requirement on storage. The storage nodes in the cloud have to communicate with each other and bring the hot data near the application across the data centers and the communication must be standard-based.

This session proposes iSCSI protocol to achieve the inter-storage node communication for enterprise grade cloud. This may not be applicable for object storage. This is NOT intended to cover the underlying storage implementation to optimally support this communication. ISCSI is chosen for the communication emerging as a de facto standard.

Recognized as one of the top 250 MSP thought and entrepreneurial leaders globally by MSPmentor, Felix has more than 15 years of development and technology management experience. With the right blend of expertise in both networking and storage technologies, he co-founded CloudByte. Felix has built many high-energy technology teams, re-architected products and developed core features from scratch. Most recently, Felix helped NetApp gain leadership position in storage array-based data protection by driving innovations around its product suite. He has filed numerous patents with the U.S. Patent and Trademark Office around core storage technologies. Prior to this, Felix worked at Juniper, Novell and IBM, where he handled networking technologies, including LAN, WAN, and security protocols, and Intrusion Prevention Systems (IPS). Felix has Masters degrees in technology and business administration.

Practical Secure Storage: A Vendor-agnostic Overview

Walt Hubis, Hubis Technical Associates

This tutorial will explore the fundamental concepts of implementing secure enterprise storage using current technologies. It has been significantly updated to include current and emerging technologies and changes in international security standards (e.g., ISO/IEC).

The focus of this tutorial is the implementation of a practical secure storage system, independent of any specific vendor implementation or methodology. The high-level requirements that drive the implementation of secure storage for the enterprise, including legal issues, key management, current technologies available to the end user, and fiscal considerations will be explored in detail. In addition, actual implementation examples will be provided that illustrate how these requirements are applied to actual systems implementations.

This tutorial will explore the fundamental concepts of implementing secure enterprise storage using current technologies. It has been significantly updated to include current and emerging technologies and changes in international security standards (e.g., ISO/IEC).

The focus of this tutorial is the implementation of a practical secure storage system, independent of any specific vendor implementation or methodology. The high-level requirements that drive the implementation of secure storage for the enterprise, including legal issues, key management, current technologies available to the end user, and fiscal considerations will be explored in detail. In addition, actual implementation examples will be provided that illustrate how these requirements are applied to actual systems implementations.

Walt is the owner of Hubis Technical Associates. He provides expertise related to storage interface and storage security standards organizations with a focus on protocols and software interfaces and how innovative and disruptive computer storage technologies impact these standards. Walt has over 25 years of experience in storage systems engineering in both development and managerial positions, and has authored several key patents in RAID and other storage-related technologies. He is the vice-chair of the SNIA SSSI Initiative and has served as the Chair of the Trusted Computing Group Key Management Services Subgroup, Chair of the IEEE SISWG P1619.3 Key Management Subcommittee, and Secretary of the IEEE Security in Storage Work Group (SISWG). Walt holds a Bachelor of Science degree in Electrical Engineering.

3:30 pm–4:00 pm Tuesday

Break with Refreshments

Grand Ballroom Foyer

4:00 pm–5:30 pm Tuesday

The Internship: WiPS

Grand Ballroom ABCFGH
Session Chair: Donald Porter, Stony Brook University

The list of accepted Work-in-Progress reports is available here.

SNIA Industry Track Session 4

Grand Ballroom DE
Note: This session ends at 5:15 pm

Download tutorial materials at the SNIA Web site.

Storage Industry Forging Academic Alliances

Ramin Elahi, UC Santa Cruz Silicon Valley

The three most common challenges facing the IT Managers and CTOs today are: 1) Ingesting gigabytes of data that are generated globally every second, 2) Managing and accessing data in the most efficient manner 24/7, and 3) Extracting values from all these data. Hence, managing, analyzing and sustaining the astronomical amount of corporate data in the most secure ways via new technologies, such as data center storage and virtualization.

Today, the majority of Computer Science and Information Engineering programs are lacking Storage and Virtualization studies; subsequently, the graduating bodies will miss out on so many job opportunities. Today, many storage companies have to provide extensive training for their new hires on data storage and virtualization, which are the building blocks of rapidly growing Cloud Computing and Services and Big Data technologies.

The three most common challenges facing the IT Managers and CTOs today are: 1) Ingesting gigabytes of data that are generated globally every second, 2) Managing and accessing data in the most efficient manner 24/7, and 3) Extracting values from all these data. Hence, managing, analyzing and sustaining the astronomical amount of corporate data in the most secure ways via new technologies, such as data center storage and virtualization.

Today, the majority of Computer Science and Information Engineering programs are lacking Storage and Virtualization studies; subsequently, the graduating bodies will miss out on so many job opportunities. Today, many storage companies have to provide extensive training for their new hires on data storage and virtualization, which are the building blocks of rapidly growing Cloud Computing and Services and Big Data technologies.

Today, few companies have successfully forged academic alliances with colleges to fill the gap for much needed data storage and virtualization savvy new-hire engineers and IT staff.

Ramin Elahi, MSEE, is an Adjunct Faculty and Advisory Board Member with the University of California, Santa Cruz Extension. He’s also an Engineering Training Solutions Architect at NetApp Inc., responsible for the on-boarding and training curricula development. Prior to NetApp, he was Training Site Manager for Hitachi Data Systems Academy. He was also the Global Curriculum Manager at Hewlett-Packard. His areas of expertise are data center storage design and architecture, Data ONTAP, cloud storage, and virtualizations. He also held variety of positions at Cisco, Novell and SCO as a consultant and escalation engineer. He implemented the first university-level Data Storage and Virtualization curriculum in Northern California in 2007. He has taught Data Storage, TCP/IP, and Unix System Administration at the UCSC Extension and UC Berkeley Extension since 1996.

6:00 pm–8:00 pm Tuesday

Poster Session and Reception I

Santa Clara Ballroom

Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks.

The list of accepted posters is available here.

Wednesday, February 18, 2015

8:00 am–9:00 am Wednesday

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:15 am Wednesday

History of the World, Part I

Grand Ballroom ABCFGH
Session Chair: Margo Seltzer, Harvard School of Engineering and Applied Sciences and Oracle

FAST '15 Test of Time Award Presentation

Keynote Address: A Brief History of the BSD Fast Filesystem

Dr. Marshall Kirk McKusick, Author and Consultant

This talk provides a taxonomy of filesystem and storage development from 1979 to the present with the BSD Fast Filesystem as its focus. It describes the early performance work done by increasing the disk block size and by being aware of the disk geometry and using that knowledge to optimize rotational layout. With the abstraction of the geometry in the late 1980's and the ability of the hardware to cache and handle multiple requests, filesystems performance ceased trying to track geometry and instead sought to maximize performance by doing contiguous file layout. Small file performance was optimized through the use of techniques such as journaling and soft updates. By the late 1990's, filesystems had to be redesigned to handle the ever growing disk capacities. The addition of snapshots allowed for faster and more frequent backups. The increasingly harsh environment of the Internet required greater data protection provided by access-control lists and mandatory-access controls.

This talk provides a taxonomy of filesystem and storage development from 1979 to the present with the BSD Fast Filesystem as its focus. It describes the early performance work done by increasing the disk block size and by being aware of the disk geometry and using that knowledge to optimize rotational layout. With the abstraction of the geometry in the late 1980's and the ability of the hardware to cache and handle multiple requests, filesystems performance ceased trying to track geometry and instead sought to maximize performance by doing contiguous file layout. Small file performance was optimized through the use of techniques such as journaling and soft updates. By the late 1990's, filesystems had to be redesigned to handle the ever growing disk capacities. The addition of snapshots allowed for faster and more frequent backups. The increasingly harsh environment of the Internet required greater data protection provided by access-control lists and mandatory-access controls. The talk concludes with a discussion of the addition of symmetric multi-processing support needed to utilize all the CPUs found in the increasingly ubiquitous multi-core processors.

Dr. Marshall Kirk McKusick's work with Unix and BSD development spans over four decades. It begins with his first paper on the implementation of Berkeley Pascal in 1979, goes on to his pioneering work in the eighties on the BSD Fast File System, the BSD virtual memory system, the final release of 4.4BSD-Lite from the UC Berkeley Computer Systems Research Group, and carries on with his work on FreeBSD. A key figure in Unix and BSD development, his experiences chronicle not only the innovative technical achievements but also the interesting personalities and philosophical debates in Unix over the past thirty-five years.

Dr. McKusick’s second edition of The Design and Implementation of the FreeBSD Operating System was released in September 2014.

Available Media

10:15 am–10:45 am Wednesday

Break with Refreshments

Grand Ballroom Foyer

10:45 am–12:15 pm Wednesday

Speed

Grand Ballroom ABCFGH
Session Chair: Theodore Ts'o, Google

Non-blocking Writes to Files

Daniel Campello and Hector Lopez, Florida International University; Luis Useche, Google Inc.; Ricardo Koller, IBM T. J. Watson Research Center; Raju Rangaswami, Florida International University

Writing data to a page not present in the file-system page cache causes the operating system to synchronously fetch the page into memory first. Synchronous page fetch defines both policy (when) and mechanism (how), and always blocks the writing process. Non-blocking writes eliminate such blocking by buffering the written data elsewhere in memory and unblocking the writing process immediately. Subsequent reads to the updated page locations are also made non-blocking. This new handling of writes to non-cached pages allow processes to overlap more computation with I/O and improves page fetch I/O throughput by increasing fetch parallelism. Our empirical evaluation demonstrates the potential of nonblocking writes in improving the overall performance of systems with no loss of performance when workloads cannot benefit from it. Across the Filebench write workloads, non-blocking writes improve benchmark throughput by 7X on average (up to 45.4X) when using disk drives and by 2.1X on average (up to 4.2X) when using SSDs. For the SPECsfs2008 benchmark, non-blocking writes decrease overall average latency of NFS operations between 3.5% and 70% and average write latency between 65% and 79%. When replaying the MobiBench file system traces, non-blocking writes decrease average operation latency by 20-60%.

Available Media

NV-Tree: Reducing Consistency Cost for NVM-based Single Level Systems

Jun Yang, Qingsong Wei, Cheng Chen, Chundong Wang, and Khai Leong Yong, Data Storage Institute, A-STAR; Bingsheng He, Nanyang Technological University

The non-volatile memory (NVM) has DRAM-like performance and disk-like persistency which make it possible to replace both disk and DRAM to build single level systems. To keep data consistency in such systems is non-trivial because memory writes may be reordered by CPU and memory controller. In this paper, we study the consistency cost for an important and common data structure, B+Tree. Although the memory fence and CPU cacheline flush instructions can order memory writes to achieve data consistency, they introduce a significant overhead (more than 10X slower in performance). Based on our quantitative analysis of consistency cost, we propose NV-Tree, a consistent and cache-optimized B+Tree variant with reduced CPU cacheline flush. We implement and evaluate NV-Tree and NV-Store, a key-value store based on NV-Tree, on an NVDIMM server. NVTree outperforms the state-of-art consistent tree structures by up to 12X under write-intensive workloads. NV-Store increases the throughput by up to 4.8X under YCSB workloads compared to Redis.

Available Media

Towards SLO Complying SSDs Through OPS Isolation

Jaeho Kim and Donghee Lee, University of Seoul; Sam H. Noh, Hongik University

Virtualization systems should be responsible for satisfying the service level objectives (SLOs) for each VM. Performance SLOs, in particular, are generally achieved by isolating the underlying hardware resources among the VMs. In this paper, we show through empirical evaluation that performance SLOs cannot be satisfied with current commercial SSDs. We show that garbage collection is the source of this problem and that this cannot be easily controlled because of the interaction between VMs. To control the effect of garbage collection on VMs, we propose a scheme called OPS isolation. OPS isolation allocates flash memory blocks so that blocks of one VM do not interfere with blocks of other VMs during garbage collection. Experimental results show that performance SLO can be achieved through OPS isolation.

Available Media

Boosting Quasi-Asynchronous I/O for Better Responsiveness in Mobile Devices

Daeho Jeong, Samsung Electronics Co.; Youngjae Lee and Jin-Soo Kim, Sungkyunkwan University

Providing quick system response for mobile devices is of great importance due to their interactive nature. However, we observe that the latency of file system operations increases dramatically under heavy asynchronous I/Os in the background. A careful analysis reveals that most of the delay arises from an unexpected situation where the file system operations are blocked until one or more asynchronous I/O operations are completed. We call such an I/O—which is issued as an asynchronous I/O but has the property of a synchronous I/O as some tasks are blocked on it—Quasi-Asynchronous I/O (QASIO).

We classify the types of dependencies between tasks and QASIOs and then show when such dependencies occur in the Linux kernel. Also, we propose a novel scheme to detect QASIOs at run time and boost them over the other asynchronous I/Os in the I/O scheduler. Our measurement results on the latest smartphone demonstrate that the proposed scheme effectively improves the responsiveness of file system operations.

Available Media

12:15 pm–1:45 pm Wednesday

Lunch, on your own

1:45 pm–3:15 pm Wednesday

The Fault in Our Stars: Reliability

Grand Ballroom ABCFGH
Session Chair: Haryadi Gunawi, University of Chicago

Failure-Atomic Updates of Application Data in a Linux File System

Rajat Verma and Anton Ajay Mendez, Hewlett-Packard; Stan Park, Hewlett-Packard Labs; Sandya Mannarswamy, Hewlett-Packard; Terence Kelly and Charles B. Morrey III, Hewlett-Packard Labs

We present the design, implementation, and evaluation of a file system mechanism that protects the integrity of application data from failures such as process crashes, kernel panics, and power outages. A simple interface offers applications a guarantee that the application data in a file always reflects the most recent successful fsync or msync operation on the file. Our file system furthermore offers a new syncv mechanism that failure-atomically commits changes to multiple files. Failure-injection tests verify that our file system protects the integrity of application data from crashes and performance measurements confirm that our implementation is efficient. Our file system runs on conventional hardware and unmodified Linux kernels and will be released commercially. We believe that our mechanism is implementable in any file system that supports per-file writable snapshots.

Available Media

A Tale of Two Erasure Codes in HDFS

Mingyuan Xia, McGill University; Mohit Saxena, Mario Blaum, and David A. Pease, IBM Research Almaden

Distributed storage systems are increasingly transitioning to the use of erasure codes since they offer higher reliability at significantly lower storage costs than data replication. However, these codes tradeoff recovery performance as they require multiple disk reads and network transfers for reconstructing an unavailable data block. As a result, most existing systems use an erasure code either optimized for storage overhead or recovery performance.

In this paper, we present HACFS, a new erasure-coded storage system that instead uses two different erasure codes and dynamically adapts to workload changes. It uses a fast code to optimize for recovery performance and a compact code to reduce the storage overhead. A novel conversion mechanism is used to efficiently upcode and downcode data blocks between fast and compact codes. We show that HACFS design techniques are generic and successfully apply it to two different code families: Product and LRC codes.

We have implemented HACFS as an extension to the Hadoop Distributed File System (HDFS) and experimentally evaluate it with five different workloads from production clusters. The HACFS system always maintains a low storage overhead and significantly improves the recovery performance as compared to three popular singlecode storage systems. It reduces the degraded read latency by up to 46%, and the reconstruction time and disk/network traffic by up to 45%.

Available Media

How Much Can Data Compressibility Help to Improve NAND Flash Memory Lifetime?

Jiangpeng Li, Kai Zhao, and Xuebin Zhang, Rensselaer Polytechnic Institute; Jun Ma, Shanghai Jiao Tong University; Ming Zhao, Florida International University; Tong Zhang, Rensselaer Polytechnic Institute

Although data compression can benefit flash memory lifetime, little work has been done to rigorously study the full potential of exploiting data compressibility to improve memory lifetime. This work attempts to fill this missing link. Motivated by the fact that memory cell damage strongly depends on the data content being stored, we first propose an implicit data compression approach (i.e., compress each data sector but do not increase the number of sectors per flash memory page) as a complement to conventional explicit data compression that aims to increase the number of sectors per flash memory page. Due to the runtime variation of data compressibility, each flash memory page almost always contains some unused storage space left by compressed data sectors. We develop a set of design strategies for exploiting such unused storage space to reduce the overall memory physical damage. We derive a set of mathematical formulations that can quantitatively estimate flash memory physical damage reduction gained by the proposed design strategies for both explicit and implicit data compression. Using 20nm MLC NAND flash memory chips, we carry out extensive experiments to quantify the content dependency of memory cell damage, based upon which we empirically evaluate and compare the effectiveness of the proposed design strategies under a wide spectrum of data compressibility characteristics.

Available Media

RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures

Ao Ma, Fred Douglis, Guanlin Lu, and Darren Sawyer, EMC Corporation; Surendar Chandra and Windsor Hsu, Datrium, Inc.

Modern storage systems orchestrate a group of disks to achieve their performance and reliability goals. Even though such systems are designed to withstand the failure of individual disks, failure of multiple disks poses a unique set of challenges. We empirically investigate disk failure data from a large number of production systems, specifically focusing on the impact of disk failures on RAID storage systems. Our data covers about one million SATA disks from 6 disk models for periods up to 5 years. We show how observed disk failures weaken the protection provided by RAID. The count of reallocated sectors correlates strongly with impending failures.

With these findings we designed RAIDSHIELD, which consists of two components. First, we have built and evaluated an active defense mechanism that monitors the health of each disk and replaces those that are predicted to fail imminently. This proactive protection has been incorporated into our product and is observed to eliminate 88% of triple disk errors, which are 80% of all RAID failures. Second, we have designed and simulated a method of using the joint failure probability to quantify and predict how likely a RAID group is to face multiple simultaneous disk failures, which can identify disks that collectively represent a risk of failure even when no individual disk is flagged in isolation. We find in simulation that RAID-level analysis can effectively identify most vulnerable RAID-6 systems, improving the coverage to 98% of triple errors.

Available Media

3:15 pm–3:45 pm Wednesday

Break with Refreshments

Grand Ballroom Foyer

3:45 pm–5:25 pm Wednesday

Quoth the Raven, Neveread: Write-Optimized File Systems

Grand Ballroom ABCFGH
Session Chair: Geoff Kuenning, Harvey Mudd College

Write Once, Get 50% Free: Saving SSD Erase Costs Using WOM Codes

Gala Yadgar, Eitan Yaakobi, and Assaf Schuster, Technion—Israel Institute of Technology

NAND flash, used in modern SSDs, is a write-once medium, where each memory cell must be erased prior to writing. The lifetime of an SSD is limited by the number of erasures allowed on each cell. Thus, minimizing erasures is a key objective in SSD design.

A promising approach to eliminate erasures and extend SSD lifetime is to use write-once memory (WOM) codes, designed to accommodate additional writes on write-once media. However, these codes inflate the physically stored data by at least 29%, and require an extra read operation before each additional write. This reduces the available capacity and I/O performance of the storage device, so far preventing the adoption of these codes in SSD design.

We present Reusable SSD, in which invalid pages are reused for additional writes, without modifying the drive’s exported storage capacity or page size. Only data written as a second write is inflated, and the required additional storage is provided by the SSD’s inherent overprovisioning space. By prefetching invalid data and parallelizing second writes between planes, our design achieves latency equivalent to a regular write. We reduce the number of erasures by 33% in most cases, resulting in a 15% lifetime extension and an overall reduction of up to 35% in I/O response time, on a wide range of synthetic and production workloads and flash chip architectures.

Available Media

F2FS: A New File System for Flash Storage

Changman Lee, Dongho Sim, Joo-Young Hwang, and Sangyeun Cho, Samsung Electronics Co., Ltd.

F2FS is a Linux file system designed to perform well on modern flash storage devices. The file system builds on append-only logging and its key design decisions were made with the characteristics of flash storage in mind. This paper describes the main design ideas, data structures, algorithms and the resulting performance of F2FS.

Experimental results highlight the desirable performance of F2FS; on a state-of-the-art mobile system, it outperforms EXT4 under synthetic workloads by up to 3.1 (iozone) and 2 (SQLite). It reduces elapsed time of several realistic workloads by up to 40%. On a server system, F2FS is shown to perform better than EXT4 by up to 2.5 (SATA SSD) and 1.8 (PCIe SSD).

Available Media

A Practical Implementation of Clustered Fault Tolerant Write Acceleration in a Virtualized Environment

Deepavali Bhagwat, Mahesh Patil, Michal Ostrowski, Murali Vilayannur, Woon Jung, and Chethan Kumar, PernixData, Inc.

Host-side flash storage opens up an exciting avenue for accelerating Virtual Machine (VM) writes in virtualized datacenters. The key challenge with implementing such an acceleration layer is to do so without breaking live VM migration which is essential for providing distributed resource management and high availability. High availability also powers-on VMs on new host when the previous host crashes. We introduce FVP, a fault tolerant host-side flash write acceleration layer that seamlessly integrates with the virtualized environment while preserving dynamic resource management and high availability, the holy tenets of a virtualized environment. FVP integrates with the VMware ESX hypervisor kernel to intercept VM I/O and redirects the I/O to host-side flash devices. VMs experience flash latencies instead of SAN latencies and write intensive applications such as databases and email servers benefit from predictable write throughput. No changes are required to the VM guest operating systems so VM applications can continue to function seamlessly without any modifications. FVP pools together all the host-side flash devices in the cluster so every host can access another host’s flash device preserving VM mobility. By replicating VM writes onto peer host-side flash devices, FVP is able to tolerate multiple cascading host and flash failures. Failure recovery is distributed, requiring no central co-ordination. We describe the workings of the FVP key components and demonstrate how FVP reduces VM latencies to accelerate VM writes, improves performance predictability, and increases virtualized datacenter efficiency.

Available Media

BetrFS: A Right-Optimized Write-Optimized File System

William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, Stony Brook University; John Esmet, Tokutek Inc.; Yizheng Jiao, Ankur Mittal, Prashant Pandey, and Phaneendra Reddy, Stony Brook University; Leif Walsh, Tokutek Inc.; Michael Bender, Stony Brook University; Martin Farach-Colton, Rutgers University; Rob Johnson, Stony Brook University; Bradley C. Kuszmaul, Massachusetts Institute of Technology; Donald E. Porter, Stony Brook University

The Bε-tree File System, or BetrFS, (pronounced “better eff ess”) is the first in-kernel file system to use a write-optimized index. Write optimized indexes (WOIs) are promising building blocks for storage systems because of their potential to implement both microwrites and large scans efficiently.

Previous work on WOI-based file systems has shown promise but has also been hampered by several open problems, which this paper addresses. For example, FUSE issues many queries into the file system, superimposing read-intensive workloads on top of writeintensive ones, thereby reducing the effectiveness of WOIs. Moving to an in-kernel implementation can address this problem by providing finer control of reads. This paper also contributes several implementation techniques to leverage kernel infrastructure without throttling write performance.

Our results show that BetrFS provides good performance for both arbitrary microdata operations, which include creating small files, updating metadata, and small writes into large or small files, and for large sequential I/O. On one microdata benchmark, BetrFS provides more than 4 x the performance of ext4 or XFS. BetrFS is an ongoing prototype effort, and requires additional data-structure tuning to match current generalpurpose file systems on some operations such as deletes, directory renames, and large sequential writes. Nonetheless, many applications realize significant performance improvements. For instance, an in-place rsync of the Linux kernel source realizes roughly 1.6–22 speedup over other commodity file systems

Available Media

6:00 pm–8:00 pm Wednesday

Poster Session and Reception II

Santa Clara Ballroom

Sponsored by NetApp
Check out the cool new ideas and the latest preliminary research on display at the Poster Session and Reception. Take part in discussions with your colleagues over complimentary food and drinks.

The list of accepted posters is available here.

Thursday, February 19, 2015

7:30 am–8:30 am Thursday

Continental Breakfast

Grand Ballroom Foyer

8:30 am–9:45 am Thursday

The Imitation Game: Benchmarking and Workloads

Grand Ballroom ABCFGH
Session Chair: Fred Douglis, EMC

SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks

Raúl Gracia-Tinedo, Universitat Rovira i Virgili; Danny Harnik, Dalit Naor, and Dmitry Sotnikov, IBM Research Haifa; Sivan Toledo and Aviad Zuck, Tel-Aviv University

Storage system benchmarks either use samples of proprietary data or synthesize artificial data in simple ways (such as using zeros or random data). However, many storage systems behave completely differently on such artificial data than they do on real-world data. This is the case with systems that include data reduction techniques, such as compression and/or deduplication.

To address this problem, we propose a benchmarking methodology called mimicking and apply it in the domain of data compression. Our methodology is based on characterizing the properties of real data that influence the performance of compressors. Then, we use these characterizations to generate new synthetic data that mimics the real one in many aspects of compression. Unlike current solutions that only address the compression ratio of data, mimicking is flexible enough to also emulate compression times and data heterogeneity. We show that these properties matter to the system’s performance.

In our implementation, called SDGen, characterizations take at most 2:5KB per data chunk (e.g., 64KB) and can be used to efficiently share benchmarking data in a highly anonymized fashion; sharing it carries few or no privacy concerns. We evaluated our data generator’s accuracy on compressibility and compression times using real-world datasets and multiple compressors (lz4, zlib, bzip2 and lzma). As a proof-of-concept, we integrated SDGen as a content generation layer in two popular benchmarks (LinkBench and Impressions).

Available Media

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Min Fu, Dan Feng, and Yu Hua, Huazhong University of Science and Technology; Xubin He, Virginia Commonwealth University; Zuoning Chen, National Engineering Research Center for Parallel Computer; Wen Xia and Yucheng Zhang, Huazhong University of Science and Technology; Yujuan Tan, Chongqing University

Data deduplication has become a standard component in modern backup systems. In order to understand the fundamental tradeoffs in each of its design choices (such as prefetching and sampling), we disassemble data deduplication into a large N-dimensional parameter space. Each point in the space is of various parameter settings, and performs a tradeoff among backup and restore performance, memory footprint, and storage cost. Existing and potential solutions can be considered as specific points in the space. Then, we propose a general-purpose frame- work to evaluate various deduplication solutions in the space. Given that no single solution is perfect in all metrics, our goal is to find some reasonable solutions that have sustained backup performance and perform a suitable tradeoff between deduplication ratio, memory footprints, and restore performance. Our findings from extensive experiments using real-world workloads provide a detailed guide to make efficient design decisions according to the desired tradeoff.

Available Media

Chronicle: Capture and Analysis of NFS Workloads at Line Rate

Ardalan Kangarlou, Sandip Shete, and John D. Strunk, NetApp, Inc.

Insights from workloads have been instrumental in hardware and software design, problem diagnosis, and performance optimization. The recent emergence of software-defined data centers and application-centric computing has further increased the interest in studying workloads. Despite the ever-increasing interest, the lack of general frameworks for trace capture and workload analysis at line rate has impeded characterizing many storage workloads and systems. This is in part due to complexities associated with engineering a solution that is tailored enough to use computational resources efficiently yet is general enough to handle different types of analyses or workloads.

This paper presents Chronicle, a high-throughput framework for capturing and analyzing Network File System (NFS) workloads at line rate. More specifically, we designed Chronicle to characterize NFS network traffic at rates above 10Gb/s for days to weeks. By leveraging the actor programming model and a pluggable, pipelined architecture, Chronicle facilitates a highly portable and scalable framework that imposes little burden on application programmers. In this paper, we demonstrate that Chronicle can reconstruct, process, and record storage-level semantics at the rate of 14Gb/s using general-purpose CPUs, disks, and NICs.

Available Media

9:45 am–10:00 am Thursday

Break with Refreshments

Grand Ballroom Foyer

10:00 am–10:50 am Thursday

The Social Network: Mobile and Social-Networking Systems

Grand Ballroom ABCFGH
Session Chair: Dean Hildebrand, IBM Research—Almaden

Reliable, Consistent, and Efficient Data Sync for Mobile Apps

Younghwan Go, Korea Advanced Institute of Science and Technology (KAIST) and NEC Labs; Nitin Agrawal, Akshat Aranya, and Cristian Ungureanu, NEC Labs

Mobile apps need to manage data, often across devices, to provide users with a variety of features such as seamless access, collaboration, and offline editing. To do so reliably, an app must anticipate and handle a host of local and network failures while preserving data consistency. For mobile environments, frugal usage of cellular bandwidth and device battery are also essential. The above requirements place an enormous burden on the app developer. We built Simba, a data-sync service that provides mobile app developers with a high-level local-programming abstraction unifying tabular and object data—a need common to mobile apps—and transparently handles data storage and sync in a reliable, consistent, and efficient manner. In this paper we present a detailed description of Simba’s client software which acts as the gateway to the data sync infrastructure. Our evaluation shows Simba’s effectiveness in rapid development of robust mobile apps that are consistent under all failure scenarios, unlike apps developed with Dropbox. Simba apps are also demonstrably frugal with cellular resources.

Available Media

RIPQ: Advanced Photo Caching on Flash for Facebook

Linpeng Tang, Princeton University; Qi Huang, Cornell University and Facebook; Wyatt Lloyd, University of Southern California and Facebook; Sanjeev Kumar, Facebook; Kai Li, Princeton University

Facebook uses flash devices extensively in its photo caching stack. The key design challenge for an efficient photo cache on flash at Facebook is its workload: many small random writes are generated by inserting cache-missed content, or updating cache-hit content for advanced caching algorithms. The Flash Translation Layer on flash devices performs poorly with such a workload, lowering throughput and decreasing device lifespan. Existing coping strategies under-utilize the space on flash devices, sacrificing cache capacity, or are limited to simple caching algorithms like FIFO, sacrificing hit ratios.

We overcome these limitations with the novel Restricted Insertion Priority Queue (RIPQ) framework that supports advanced caching algorithms with large cache sizes, high throughput, and long device lifespan. RIPQ aggregates small random writes, co-locates similarly prioritized content, and lazily moves updated content to further reduce device overhead. We show that two families of advanced caching algorithms, Segmented-LRU and Greedy-Dual-Size-Frequency, can be easily implemented with RIPQ. Our evaluation on Facebook’s photo trace shows that these algorithms running on RIPQ increase hit ratios up to ~20% over the current FIFO system, incur low overhead, and achieve high throughput.

Available Media