Scaling Databases and File APIs with Programmable Ceph Object Storage

Monday, February 24, 2020 - 1:30 pm2:00 pm

Jeff LeFevre and Carlos Maltzahn, University of California, Santa Cruz

Abstract: 

The Skyhook Data Management project (SkyhookDM.com) at the Center for Research in Open Source Software (cross.ucsc.edu) at UC Santa Cruz implements customized extensions through Ceph's object class interface that enables offloading database operations to the storage system. In our previous Vault '19 talk, we showed how SkyhookDM can transparently scale out databases. The SkyhookDM Ceph extensions are an example of our 'programmable storage' research efforts at UCSC, and can be accessed through commonly available external/foreign table database interfaces. Utilizing fast in-memory serialization libraries such as Google Flatbuffers and Apache Arrow, SkyhookDM currently implements common database functions such as SELECT, PROJECT, AGGREGATE, and indexing inside Ceph, along with lower-level data manipulations such as transforming data from row to column formats on RADOS servers.

In this talk, we will present three of our latest developments on the SkyhookDM project since Vault '19. First, SkyhookDM can be used to also offload operations of access libraries that support plugins for backends, such as HDF5 and its Virtual Object Layer. Second, in addition to row-oriented data format using Google's Flatbuffers, we have added support for column-oriented data formats using the Apache Arrow library within our Ceph extensions. Third, we added dynamic switching between row and column data formats within Ceph objects, a first step towards physical design management in storage systems, similar to physical design tuning in database systems.

Jeff LeFevre, University of California, Santa Cruz

Jeff LeFevre is an adjunct professor for Computer Science & Engineering at UC Santa Cruz. He currently leads the SkyhookDM project, and his research interests are in cloud databases, database physical design, and storage systems. Dr. LeFevre joined the CSE faculty in 2018, and has previously worked on the Vertica database for HP.

Carlos Maltzahn, University of California, Santa Cruz

Carlos Maltzahn is an adjunct professor for Computer Science & Engineering at UC Santa Cruz. He is the founder and director of Center for Research in Open Source Software (cross.ucsc.edu), and a co-founder of the Systems Research Lab, known for its cutting-edge work on programmable storage systems, big data storage & processing, scalable data management, distributed system performance management, and practical replicable evaluation of computer systems. In 2005 he co-founded and became a key mentor on Sage Weil’s Ceph project. Dr. Maltzahn joined the CSE faculty in 2008, has graduated nine Ph.D. students since, and has previously worked on storage for NetApp.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {246548,
author = {Jeff LeFevre and Carlos Maltzahn},
title = {Scaling Databases and File APIs with Programmable Ceph Object Storage},
year = {2020},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},
month = feb,
}

Presentation Video