Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • Home
  • Attend
    • Registration
    • Discounts
    • Venue, Hotel, and Travel
    • Why Attend?
    • Students and Grants
  • Program
    • Program at a Glance
    • Conference Program
    • Training Program
      • Training Program - Details
    • Workshops
    • Conference Topics
      • Systems and Network Engineering
      • Monitoring and Metrics
      • SRE and Software Engineering
      • Culture
    • UCMS '15
    • URES '15
    • Puppet Camp DC
  • Activities
    • Birds-of-a-Feather
    • LISA Build
    • LISA Lab
  • Sponsors and Expo
    • LISA15 Expo
    • Sponsor and Exhibitor List
    • Exhibitor Services
  • Participate
    • Call for Participation
    • Call for Research Papers and Posters
      • Submitting Papers and Posters
    • Speaker Resources
  • About
    • Conference Organizers
    • Help Promote
    • Services
    • Code of Conduct
    • Past Conferences

help promote

LISA16 CFP button

Get more
Help Promote graphics!

twitter

Tweets by @LISAConference

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » How to Not Get Paged: Managing On-call to Reduce Outages
Tweet

connect with us

How to Not Get Paged: Managing On-call to Reduce Outages

Half Day Afternoon
(1:30 pm-5:00 pm)

Lincoln 4

LISA15: Culture
T7
Thomas A. Limoncelli, Stack Overflow
Description: 

People think of "on call” as responding to a pager that beeps because of an outage. In this class, you will learn how to run an on-call system that improves uptime and reduces how often you are paged. We will start with a monitoring philosophy that prevent outages. Then we will discuss how to construct an on-call schedule—possibly in more detail than you've cared about before—but, as a result, it will be more fair and less stressful. We'll discuss how to conduct “fire drills” and “game day exercises” that create antifragile systems. Lastly, we'll discuss how to conduct a postmortem exercise that promotes better communication and prevents future problems.

Who should attend: 

Sysadmins, devs, operations, and their managers

Take back to work: 
  • Knowledge that makes being on call more fair and less stressful
  • Strategies for using monitoring to improve uptime and reliability
  • Team-training techniques such as "fire drills" and "game day exercises"
  • How to conduct better postmortems/learning retrospectives
Topics include: 
  • Why your monitoring strategy is broken and how to fix it
  • Building a more fair on-call schedule
  • Monitoring to detect outages vs. monitoring to improve reliability
  • Alert review strategies
  • Conducting “fire drills” and “game day exercises”
  • "Blameless postmortem documents"
Presentation Type: 
Training

© USENIX

LISA is a registered trademark of the USENIX Association.

  • Privacy Policy
  • Conference Policies
  • Contact Us