Fixing On-Call When Nobody Thinks It's (Too) Broken

Monday, March 25, 2019 - 11:05 am11:35 am

Tony Lykke, Hudson River Trading


What's a team to do when they receive more than 30 pages a day, every day, for almost a decade? Deny there's a problem of course! Join me as we relive the data-informed journey from around 70,000 pages over 7 years (~200/week) to under 50/week in just a few short months in a way that shows those carrying the pager improvement is possible and empowers them to continue questioning and improving the status quo moving forward. We'll look at not only the technical challenges (like working with Nagios 3 in 2018, *shudder*), but also non-technical challenges like getting buy-in when nobody thinks there's a problem and managing risk when the on-call team is concerned about silencing legitimate pages along with the noise.

Tony Lykke, Hudson River Trading

Tony is an SRE on the trade systems team at Hudson River Trading based in NYC, where he gets to tackle hard (often not just technically) automation problems and tech debt cleanup projects across a variety of environments. He is obsessively anti-toil, and regularly refuses to accept "that's just the way it is" as an answer.

@conference {229509,
author = {Tony Lykke},
title = {Fixing On-Call When Nobody Thinks It{\textquoteright}s (Too) Broken},
year = {2019},
address = {Brooklyn, NY},
publisher = {{USENIX} Association},