Canopus: Enabling Extreme-Scale Data Analytics on Big HPC Storage via Progressive Refactoring


Tao Lu, New Jersey Institute of Technology; Eric Suchyta, Jong Choi, Norbert Podhorszki, and Scott Klasky, Oak Ridge National Laboratory; Qing Liu, New Jersey Institute of Technology; Dave Pugmire and Matt Wolf, Oak Ridge National Laboratory; Mark Ainsworth, Brown University


High accuracy scientific simulations on high performance computing (HPC) platforms generate large amounts of data. To allow data to be efficiently analyzed, simulation outputs need to be refactored, compressed, and properly mapped onto storage tiers. This paper presents Canopus, a progressive data management framework for storing and analyzing big scientific data. Canopus allows simulation results to be refactored into a much smaller dataset along with a series of deltas with fairly low overhead. Then, the refactored data are compressed, mapped, and written onto storage tiers. For data analytics, refactored data are selectively retrieved to restore data at a specific level of accuracy that satisfies analysis requirements. Canopus enables end users to make trade-offs between analysis speed and accuracy on-the-fly. Canopus is demonstrated and thoroughly evaluated using blob detection on fusion simulation data.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {203370,
author = {Tao Lu and Eric Suchyta and Jong Choi and Norbert Podhorszki and Scott Klasky and Qing Liu and Dave Pugmire and Matthew Wolf and Mark Ainsworth},
title = {Canopus: Enabling {Extreme-Scale} Data Analytics on Big {HPC} Storage via Progressive Refactoring},
booktitle = {9th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage 17)},
year = {2017},
address = {Santa Clara, CA},
url = {},
publisher = {USENIX Association},
month = jul