We're Still Down: A Metastable Failure Tale

Tuesday, March 21, 2023 - 11:00 am11:45 am

Kyle Lexmond


"The status? The system has been down for hours, and we haven't been able to get it back up yet"—words on an incident conference call that you probably don't want to hear.

This talk explores how a globally distributed CDN experienced a metastable failure, design changes that make future failures less likely, and the unorthodox fix that made a recovery possible (and can hopefully apply to future metastable failures—maybe even yours).

Kyle is an almost-SWE who learned about Site Reliability Engineering in passing conversation during university, changing the course of his career.

Having worked at big names (Twitter, Amazon, Facebook) and small (CBSA, Kik), he mainly enjoys working on building optimized and efficient systems that break less often after he touches them.

He currently lives in Seattle with a partner and an adorable dog. (Yes, he has pictures.)

