Hyrax: Fail-in-Place Server Operation in Cloud Platforms

Authors: 

Jialun Lyu, Microsoft Azure and University of Toronto; Marisa You, Celine Irvene, Mark Jung, Tyler Narmore, Jacob Shapiro, Luke Marshall, and Savyasachi Samal, Microsoft Azure; Ioannis Manousakis and Lisa Hsu, Formerly of Microsoft Azure; Preetha Subbarayalu, Ashish Raniwala, Brijesh Warrier, and Ricardo Bianchini, Microsoft Azure; Bianca Schroeder, University of Toronto; Daniel S. Berger, Microsoft Azure and University of Washington

Abstract: 

Today’s cloud platforms handle server hardware failures by shutting down the affected server and only turning it back online once it has been repaired by a technician. At cloud scale, this all-or-nothing operating model is becoming increasingly unsustainable. This model is also at odds with technology trends, such as the need for new cooling technology.

This paper introduces Hyrax, a datacenter stack that enables compute servers with failed components to continue hosting VMs while hiding the underlying degraded capacity and performance. A key enabler of Hyrax is a novel model of changes in memory interleaving when deactivating faulty memory modules. Experiments on cloud production servers show that Hyrax overcomes common hardware failures without impacting peak VM performance. In large-scale simulations with production traces, Hyrax reduces server repair requirements by 50-60% without impacting VM scheduling.

OSDI '23 Open Access Sponsored by
King Abdullah University of Science and Technology (KAUST)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {288590,
author = {Jialun Lyu and Marisa You and Celine Irvene and Mark Jung and Tyler Narmore and Jacob Shapiro and Luke Marshall and Savyasachi Samal and Ioannis Manousakis and Lisa Hsu and Preetha Subbarayalu and Ashish Raniwala and Brijesh Warrier and Ricardo Bianchini and Bianca Schroeder and Daniel S. Berger},
title = {Hyrax: {Fail-in-Place} Server Operation in Cloud Platforms},
booktitle = {17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23)},
year = {2023},
isbn = {978-1-939133-34-2},
address = {Boston, MA},
pages = {287--304},
url = {https://www.usenix.org/conference/osdi23/presentation/lyu},
publisher = {USENIX Association},
month = jul
}

Presentation Video