Jupyter as Incident Response Tool

Monday, December 07, 2020 - 3:25 pm–3:45 pm

Moshe Zadka, Twisted Matrix Laboratories


Jupyter is commonly thought of as a "data science tool". But the same features that make it appealing to data scientists make it appealing for Site Reliability Engineering: dynamic exploration and ability to share results. The talk will set up an "incident" where a cache slowdown is causing site problems and will show how we can use Jupyter to triage and remediate the problem. I'll also cover post-incident best practices: how to make sure that what has been done is properly documented and ready for the incident retrospective.

Moshe has been a DevOps/SRE since before those terms existed, caring deeply about software reliability, build reproducibility, and other such things. He has worked in companies as small as three people and as big as tens of thousands—usually someplace around where software meets system administration.

