You are here
How To Not Get Paged: Managing Oncall to Reduce Outages
Grand Ballroom A
People think of “oncall” as responding to a pager that beeps because of an outage. In this class you will learn how to use oncall as a vehicle to improve system reliability so that you get paged less often.
This talk includes never-before seen material from the new book, “The Practice of Cloud System Administration” by Limoncelli, Chalup, Hogan.
Anyone with an oncall responsibility (or their manager).
- How to monitor more accurately so you get paged less
- How to design an oncall schedule so that it is more fair and less stressful
- How to assure preventative work and long-term solutions get done between oncall shifts
- How to conduct “Fire Drills” and “Game Day Exercises” to create antifragile systems
- How to write a good Post-mortem document that communicates better and prevents future problems
- Why your monitoring strategy is broken and how to fix it
- Building a more fair oncall schedule
- Monitoring to detect outages vs. monitoring to improve reliability
- Alert review strategies
- Conducting “Fire Drills” and “Game Day Exercises”
- "Blameless Post-mortem documents"