Experiences in Managing the Performance and Reliability of a Large-Scale Genomics Cloud Platform


Michael Hao Tong, Robert L. Grossman, and Haryadi S. Gunawi, University of Chicago


We share our technical experiences in improving the performance of long-running jobs on the Genomic Data Commons (GDC), a large-scale cancer genomics cloud platform. We show how common bioinformatics workloads can cause VMs to age after several days, causing a large number of Extended Page Table (EPT) violations that significantly impact performance. We present host- and VM-level EPT monitoring and evaluate several possible mitigation scenarios. We highlight the long investigative process required for this research, with experiments requiring many days to complete.

