Jupyter as Incident Response Tool

Monday, December 07, 2020 - 3:25 pm–3:45 pm

Moshe Zadka, Twisted Matrix Laboratories


Jupyter is commonly thought of as a "data science tool". But the same features that make it appealing to data scientists make it appealing for Site Reliability Engineering: dynamic exploration and ability to share results. The talk will set up an "incident" where a cache slowdown is causing site problems and will show how we can use Jupyter to triage and remediate the problem. I'll also cover post-incident best practices: how to make sure that what has been done is properly documented and ready for the incident retrospective.

Moshe Zadka, Twisted Matrix Laboratories

Moshe has been a DevOps/SRE since before those terms existed, caring deeply about software reliability, build reproducibility, and other such things. He has worked in companies as small as three people and as big as tens of thousands—usually someplace around where software meets system administration.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@inproceedings {262192,
author = {Moshe Zadka},
title = {Jupyter as Incident Response Tool},
booktitle = {SREcon20 Americas (SREcon20 Americas)},
year = {2020},
url = {https://www.usenix.org/conference/srecon20americas/presentation/zadka},
publisher = {{USENIX} Association},
month = dec,

Presentation Video