Site Reliability Engineering (SRE) and the Art of Service Level Objectives (SLOs)

Due to the evolving Coronavirus/COVID-19 situation, SREcon20 Americas West has been rescheduled to June 2–4, 2020.
More information is available here.

Wednesday, March 25, 2020 - 11:30 am1:00 pm

Nathen Harvey and Stephanie Hippo, Google

Abstract: 

SRE is a set of principles, practices, and organizational constructs that seek to balance the reliability of a service with the need to continually deliver new features. An error budget is the primary construct used to help balance these seemingly competing goals.

This workshop introduces error budgets and their components: service level indicators (SLIs) and service level objectives (SLOs). Participants will learn how to create and implement SLOs through a series of guided discussions and group exercises.

The workshop is appropriate for all levels of technical capability and non-technical participants from "the business" are encouraged to attend; we seek to build a common language across teams.

By the end of this workshop, participants will be able to:

  • Describe key concepts: Error Budget, SLIs, and SLOs
  • Create an error budget
  • Recommend actions to take when the error budget is consumed
  • Recommend actions to take when excess error budget remains

Stephanie Hippo, Google

Stephanie is a Senior Site Reliability Engineer at Google, where she leads a team of eight supporting Google's internal services. She enjoys exploring the data-driven nature of reliability work, teaching teams how to adopt reliability principles, and encouraging the career growth of others. Away from the keyboard, she enjoys baking desserts, playing soccer, and hanging with her lazy, tiny dog.

BibTeX
@conference {247271,
author = {Nathen Harvey and Stephanie Hippo},
title = {Site Reliability Engineering ({SRE}) and the Art of Service Level Objectives (SLOs)},
year = {2020},
address = {Santa Clara, CA},
publisher = {{USENIX} Association},
month = mar,
}