Organizational Design for Technical Emergency Response in Distributed Computing Systems

Note: Presentation times are in Pacific Daylight Time (PDT).

Wednesday, June 02, 2021 - 9:45 am10:30 am

Adrienne Walcer and Alexander Perry, Google Inc.


When a company critically relies on the ongoing functioning of a complex and highly interconnected technical stack, support of that stack implies that appropriate personnel be reliably available to troubleshoot and correct issues that occur. These personnel will be referred to as responders. When the scope of a technical stack grows beyond one person's capacity to understand and maintain state, we split up the technical stack such that multiple responders can each provide coverage on a single component of the whole stack. Such a highly interconnected system-of-systems (SoS) allows production issues to cascade throughout wide swaths of the SoS, or sneak in between system-to-system (StS) boundaries. We will here explore one private industry implementation of a responder group designed to respond to emergent distributed computing SoS failures. In contrasting the functions of component responders and SoS responders, we demonstrate that the component ownership skillset is distinguishable from the core skill set of an SoS responder. Technical organizations can benefit from setting up SoS response to enable expedient distributed system outage mitigation.

Adrienne Walcer, Google Inc.

Adrienne has been at Google for 8 years, currently as a Technical Program Manager in Site Reliability Engineering (SRE). She is the program lead for Incident Management, and focuses on the lifecycle of large scale emergencies. Before Google, Adrienne was a Data Scientist at Explorys Inc. She studied Biostatistics at the University of Rochester and is currently pursuing a Master of Science in Systems Engineering at George Washington University.

Alexander Perry, Google Inc.

Dr. Perry received his Ph.D. and Masters in Engineering from the University of Cambridge, England, completing research to develop new techniques for precision electromagnetic characterization of superconductors. He has worked at Google for 15 years as a Staff Site Reliability Engineer (SRE) on high performance network technologies and their associated security systems. He currently leads testing programs in support of disaster resiliency.

@conference {272763,
author = {Adrienne Walcer and Alexander Perry},
title = {Organizational Design for Technical Emergency Response in Distributed Computing Systems},
year = {2021},
publisher = {{USENIX} Association},
month = jun,