Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • Overview
  • Registration Information
  • Registration Discounts
  • Symposium Organizers
  • At a Glance
  • Calendar
  • Technical Sessions
  • Live Streaming
  • Purchase the Box Set
  • Tutorial on GENI
  • Posters and Demos
  • Sponsorship
  • Activities
  • Hotel and Travel Information
  • Services
  • Students
  • Questions?
  • Help Promote
  • For Participants
  • Call for Papers
  • Past Proceedings

sponsors

Silver Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
General Sponsor
General Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor

twitter

Tweets by @usenix

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » Rhea: Automatic Filtering for Unstructured Cloud Storage
Tweet

connect with us

http://www.twitter.com/usenix
https://www.facebook.com/usenixassociation
http://www.linkedin.com/groups/USENIX-Association-49559/about
https://plus.google.com/108588319090208187909/posts
http://www.youtube.com/user/USENIXAssociation

Rhea: Automatic Filtering for Unstructured Cloud Storage

Authors: 

Christos Gkantsidis, Dimitrios Vytiniotis, Orion Hodson, Dushyanth Narayanan, Florin Dinu, and Antony Rowstron, Microsoft Research, Cambridge

Abstract: 

Unstructured storage and data processing using platforms such as MapReduce are increasingly popular for their simplicity, scalability, and flexibility. Using elastic cloud storage and computation makes them even more attractive. However cloud providers such as Amazon and Windows Azure separate their storage and compute resources even within the same data center. Transferring data from storage to compute thus uses core data center network bandwidth, which is scarce and oversubscribed. As the data is unstructured, the infrastructure cannot automatically apply selection, projection, or other filtering predicates at the storage layer. The problem is even worse if customers want to use compute resources on one provider but use data stored with other provider(s). The bottleneck is now the WAN link which impacts performance but also incurs egress bandwidth charges.

This paper presents Rhea, a system to automatically generate and run storage-side data filters for unstructured and semi-structured data. It uses static analysis of application code to generate filters that are safe, stateless, side effect free, best effort, and transparent to both storage and compute layers. Filters never remove data that is used by the computation. Our evaluation shows that Rhea filters achieve a reduction in data transfer of 2x–20,000x, which reduces job run times by up to 5x and dollar costs for cross-cloud computations by up to 13x.

Christos Gkantsidis, Microsoft Research, Cambridge

Dimitrios Vytiniotis, Microsoft Research Cambridge

Orion Hodson, Microsoft Research Cambridge

Dushyanth Narayanan, Microsoft Research Cambridge

Florin Dinu, Microsoft Research Cambridge

Antony Rowstron, Microsoft Research Cambridge

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {180321,
author = {Christos Gkantsidis and Dimitrios Vytiniotis and Orion Hodson and Dushyanth Narayanan and Florin Dinu and Antony Rowstron},
title = {Rhea: Automatic Filtering for Unstructured Cloud Storage},
booktitle = {10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13)},
year = {2013},
isbn = {978-1-931971-00-3},
address = {Lombard, IL},
pages = {343--355},
url = {https://www.usenix.org/conference/nsdi13/technical-sessions/presentation/gkantsidis},
publisher = {USENIX Association},
month = apr,
}
Download
Gkantsidis PDF
View the slides

Presentation Video 

Presentation Audio

MP3 Download

Download Audio

Public Summary: 

by Wenke Lee

This paper presents a new approach to reduce transmission between data storage nodes and compute nodes, in the context of MapReduce on cloud. The main idea is to use static analysis techniques to extract the row and column filtering logic implicitly contained in the original MapReduce programs. An evaluation was performed on 160 mappers and the results showed this approach is effective in filtering data not necessary for further processing.

This paper identifies and addresses an important problem facing applications in the cloud environment. The solution is sound, simple and elegant, and is transparent to application programmers. The authors implemented a real system and evaluated it using real data.

However, the solution currently only works for Hadoop/Java. More seriously, the solution does not address the harder problem of extracting structured information from unstructured or semi-structured data.

  • Log in or    Register to post comments

Silver Sponsors

Bronze Sponsors

General Sponsors

Media Sponsors & Industry Partners

© USENIX

  • Privacy Policy
  • Contact Us