The Secret Lives of SREs - Controlling the Costs of Coordination across Remote Teams

Monday, December 07, 2020 - 10:15 am11:00 am

Laura Maguire, PhD

Abstract: 

If you ask a group of engineers how they resolved a particularly difficult outage they typically talk about the dashboards that got pulled up, the logs they looked at, the node someone restarted, or the jobs they killed on their way to restoring the service. But that doesn't do much to tell us how, given conditions of uncertainty and time pressure, practitioners flexibly apply their knowledge to novel problems.

In other words, how did an engineer know what was the 'right' thing to do in spite of ambiguous data or who was the 'right' person to help diagnose a particularly out-of-control problem?

Over the last 3 years, as part of her dissertation work, Dr. Maguire studied both established and ad hoc teams of engineers responding in real-time to service outages ranging from minor disruptions to potentially organizational-viability-crushing events. In her research, she examined 62 cases of incident response across 4 organizations of varying scale and complexity to understand how engineering teams manage these costs of coordination under differing circumstances.

In this talk, Dr. Maguire will highlight some surprising (and provocative!) findings such as:

  • Incident management works very differently than existing domain models (like GoogleSRE) suggest and incident command can actually be counterproductive to fast resolution;
  • The choreography of the cognitive work in this joint activity is shown much more subtle and highly integrated into the technical efforts of dynamic fault management than previously understood;
  • Tooling designed to aid coordination can incur additional cognitive costs for practitioners;
  • Strategies of 'adaptive choreography' enable practitioners to cope with dynamic events and dynamic coordination demands;
  • How tooling and intra-organizational dependencies can shift costs of coordination across time and organizational boundaries, increasing complexity for SREs.

Laura Maguire, PhD

Laura leads the research program at Jeli.io, where she studies software engineers as they cope with the cognitive complexities of keeping distributed, continuous deployment systems reliably functioning and helps to translate those findings into a product that is advancing the state of the art of incident management in the software industry. Her research interests lie in resilience engineering, coordination design, resilient systems control, and building tooling to enable adaptive capacity across distributed work teams. She was a researcher with the SNAFU Catchers Consortium from 2017–2020, working closely with large and medium-sized digital service companies to identify and support resilient performance within their engineering teams. Laura has a Master's degree in Human Factors & Systems Safety, a Ph.D. in Integrated Systems Engineering from the Ohio State University, and extensive experience working in industrial safety & risk management

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {262255,
author = {Laura Maguire},
title = {The Secret Lives of SREs - Controlling the Costs of Coordination across Remote Teams},
booktitle = {SREcon20 Americas (SREcon20 Americas)},
year = {2020},
url = {https://www.usenix.org/node/262256},
publisher = {{USENIX} Association},
month = dec,
}

Presentation Video