Improving a Distributed System Post-Incident

Due to the evolving Coronavirus/COVID-19 situation, SREcon20 Americas West has been rescheduled to June 2–4, 2020.
More information is available here.

Thursday, March 26, 2020 - 2:30 pm3:20 pm

Julius Zerwick, DigitalOcean

Abstract: 

What happens when a team faces an extended, complex incident that forces them to reevaluate their distributed system's performance and reliability?

In this talk, we will take a deep dive into such an incident faced by the software-defined networking team at DigitalOcean. This incident led us to reevaluate our system's architecture and overhaul key areas in our codebase to improve our monitoring, testing, database interactions, reliability, and system performance.

We'll also explore how DigitalOcean's practices & processes cultivate a blameless culture that enables teams to rally together during high-pressure incidents. Join us as we explore a case study in how a team of engineers can improve a distributed system post-incident!

Julius Zerwick, Digital Ocean

Julius Zerwick is a software engineer at DigitalOcean where he works on software-defined networking and distributed systems. His areas of interest include distributed systems, computer networking, web development, and Go. He lives in New York City and can be found touring national parks when not coding.

BibTeX
@conference {247351,
author = {Julius Zerwick},
title = {Improving a Distributed System Post-Incident},
year = {2020},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},
month = mar,
}