We're Still Down: A Metastable Failure Tale

Tuesday, March 21, 2023 - 11:00 am11:45 am

Kyle Lexmond

Abstract: 

"The status? The system has been down for hours, and we haven't been able to get it back up yet"—words on an incident conference call that you probably don't want to hear.

This talk explores how a globally distributed CDN experienced a metastable failure, design changes that make future failures less likely, and the unorthodox fix that made a recovery possible (and can hopefully apply to future metastable failures—maybe even yours).

Kyle Lexmond[node:field-speakers-institution]

Kyle is an almost-SWE who learned about Site Reliability Engineering in passing conversation during university, changing the course of his career.

Having worked at big names (Twitter, Amazon, Facebook) and small (CBSA, Kik), he mainly enjoys working on building optimized and efficient systems that break less often after he touches them.

He currently lives in Seattle with a partner and an adorable dog. (Yes, he has pictures.)

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {286238,
author = {Kyle Lexmond},
title = {We{\textquoteright}re Still Down: A Metastable Failure Tale},
year = {2023},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}

Presentation Video