People think of "on call” as responding to a pager that beeps because of an outage. In this class, you will learn how to run an on-call system that improves uptime and reduces how often you are paged. We will start with a monitoring philosophy that prevent outages. Then we will discuss how to construct an on-call schedule—possibly in more detail than you've cared about before—but, as a result, it will be more fair and less stressful. We'll discuss how to conduct “fire drills” and “game day exercises” that create antifragile systems. Lastly, we'll discuss how to conduct a postmortem exercise that promotes better communication and prevents future problems.
Managers or Sysadmins with oncall responsibility
- Knowledge that makes being on call more fair and less stressful
- Strategies for using monitoring to improve uptime and reliability
- Team-training techniques such as "fire drills" and "game day exercises"
- How to conduct better postmortems/learning retrospectives
- Why your monitoring strategy is broken and how to fix it
- Building a more fair on-call schedule
- Monitoring to detect outages vs. monitoring to improve reliability
- Alert review strategies
- Conducting “fire drills” and “game day exercises”
- "Blameless postmortem documents"