How To Not Get Paged: Managing Oncall to Reduce Outages
LISA: Where systems engineering and operations professionals share real-world knowledge about designing, building, and maintaining the critical systems of our interconnected world.
The LISA conference has long served as the annual vendor-neutral meeting place for the wider system administration community. The LISA14 program recognized the overlap and differences between traditional and modern IT operations and engineering, and developed a highly-curated program around 5 key topics: Systems Engineering, Security, Culture, DevOps, and Monitoring/Metrics. The program included 22 half- and full-day training sessions; 10 workshops; and a conference program consisting of 50 invited talks, panels, refereed paper presentations, and mini-tutorials.
Grand Ballroom A
People think of “oncall” as responding to a pager that beeps because of an outage. In this class you will learn how to use oncall as a vehicle to improve system reliability so that you get paged less often.
This talk includes never-before seen material from the new book, “The Practice of Cloud System Administration” by Limoncelli, Chalup, Hogan.
Anyone with an oncall responsibility (or their manager).
- How to monitor more accurately so you get paged less
- How to design an oncall schedule so that it is more fair and less stressful
- How to assure preventative work and long-term solutions get done between oncall shifts
- How to conduct “Fire Drills” and “Game Day Exercises” to create antifragile systems
- How to write a good Post-mortem document that communicates better and prevents future problems
- Why your monitoring strategy is broken and how to fix it
- Building a more fair oncall schedule
- Monitoring to detect outages vs. monitoring to improve reliability
- Alert review strategies
- Conducting “Fire Drills” and “Game Day Exercises”
- "Blameless Post-mortem documents"






















