Experiences in Managing the Performance and Reliability of a Large-Scale Genomics Cloud Platform

Authors: 

Michael Hao Tong, Robert L. Grossman, and Haryadi S. Gunawi, University of Chicago

Abstract: 

We share our technical experiences in improving the performance of long-running jobs on the Genomic Data Commons (GDC), a large-scale cancer genomics cloud platform. We show how common bioinformatics workloads can cause VMs to age after several days, causing a large number of Extended Page Table (EPT) violations that significantly impact performance. We present host- and VM-level EPT monitoring and evaluate several possible mitigation scenarios. We highlight the long investigative process required for this research, with experiments requiring many days to complete.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {273829,
author = {Michael Hao Tong and Robert L. Grossman and Haryadi S. Gunawi},
title = {Experiences in Managing the Performance and Reliability of a {Large-Scale} Genomics Cloud Platform},
booktitle = {2021 USENIX Annual Technical Conference (USENIX ATC 21)},
year = {2021},
isbn = {978-1-939133-23-6},
pages = {973--988},
url = {https://www.usenix.org/conference/atc21/presentation/tong},
publisher = {USENIX Association},
month = jul
}

Presentation Video