Avoiding the 'SLOs as Reliability Theater' trap

July 31, 2021

Column

Authors:

Article shepherded by:

Laura Nolan

In her recent ;login: article 'Seeing Like an SRE: Site Reliability Engineering as High Modernism', [1], Laura Nolan considers SRE through the prism of James C. Scott’s Seeing Like a State [2], and suggests that for some sysadmins, “SRE is seen as a high modernist project, intent on scientifically managing their systems, all techne and no metis” [3]. This is an astute observation, but one complicated by the extent to which SRE has become overloaded in recent years. What exactly in modern software reliability is at risk of being undermined by high modernism? Let us consider the common practice of managing reliability according to Service Level Objectives (SLOs).

In the foreword to Implementing Service Level Objectives [4], David N. Blank-Edelman frames SLOs as a way to have “better reliability conversations. Conversations which put humans first.” The Site Reliability Workbook [5] characterizes them as “collaborative decisions among stakeholders.” This is not high-modernist language, but it is unfortunately also not always an accurate description of how SLOs are used in practice.

What exactly in modern software reliability is at risk of being undermined by high modernism? Let us consider the common practice of managing reliability according to Service Level Objectives (SLOs).

Modern software organizations tend to be enamored of metrics, an affinity that can be traced from Peter Drucker’s Management by Objectives [6] to Andy Grove’s seminal High Output Management [7] to Google’s Objectives and Key Results goal-setting methodology [8]. Metrics are a powerful abstraction: they can be transformed, stored, visualized, composed, rolled up. As practitioners working on production systems, metrics are foundational to how we monitor our systems. This is also true for executives, whose system is the organization writ large.

Metrics are vulnerable to surrogation: when what we care about and the measure of what we care about can drift apart, and we tend to focus on the latter (this is Goodhart’s Law). If we care about user happiness, measure happiness as the percentage of HTTP 200s, then return incorrect data with a 200 status and are not alerted, we’re in trouble. In this simple example we can, of course, update our measure to consider data correctness. However, SLOs can surrogate their way into problematic high modernism when their metric materialization replaces the collaborative conversation between stakeholders which they actually represent.

If SLOs are introduced top-down with the primary goal of enabling leadership to exert control over and hold teams accountable for (insufficient) reliability, and teams review them only prior to presentations to leadership to ensure their graphs look good, a tremendous amount of what is happening is reliability theater and problematic high modernism. What makes this path a real risk and not just a rare edge case is the drastic extent to which improvements in metric collection and computation have outpaced our ability to align stakeholders on meaningful SLOs. Sadly, while Moore’s law turbo-charges our ability to generate graphs, it doesn’t increase our capacity for communicating with each other or to understand and manage the complexity of our systems.

Much as a complex system can fail despite components working as intended, metric surrogation doesn’t require ill intent. SLOs are challenging to do right, especially when everything is on fire — which is precisely when most reliability initiatives are launched!

Perhaps an elephant in the room here is that the software organizations inside which reliability practitioners sit are, almost by definition, themselves high modernist. The success of our industry is built on rapid, disruptive growth made possible by working with bits rather than atoms. Automation is the currency of our realm; promotions and status come from releasing new products. The downside is that this magic falls apart when things go wrong. During incidents, abstractions which allowed for rapid development through composition become blobs of implementation detail where expertise is suddenly necessary. For example, while load balancing and service discovery are powerful abstractions that are often critical to developer productivity in modern systems, implementation bugs can be complex and painful — as Slack found out in May of 2020, when just that sort of bug caused a forty-five minute site-wide outage [9].

Automation is the currency of our realm; promotions and status come from releasing new products. The downside is that this magic falls apart when things go wrong.

Discussing her experience working on Chaos Engineering at Netflix [10], Nora Jones says that “the most insightful part of doing this journey was not actually coming up with this automation, or coming up with these experiments, it was actually designing it. It was working with the teams. It was getting them to talk about their mental models.” SLOs succeed when they remain rooted in the ground truth teams have about the health of services they own and operate, and when they are seen by those teams as an asset, rather than a tax imposed by leadership. An exercise for the reader (or a future article) is to consider more generally the difference in contributing factors to the success — or failure — of reliability initiatives when they are driven from below vs above.

Beyond the platforms and automation we use for computing SLOs, beyond even choosing a contextually appropriate organizational design for how SRE is structured or embedded [11], I suggest you consider what will happen when SLOs are missed? It is likely that you already have an ad-hoc SLO for your most critical services — general sentiment that it is not on fire! If you have a smooth process for responding to degradations (e.g., shifting resources or priorities) but feel like sizing the fire is ambiguous (one alarm? five alarms?), then SLOs may be just the tool you’re looking for.

It is likely that you already have an ad-hoc SLO for your most critical services — general sentiment that it is not on fire!

On the other hand, introducing SLOs to an organization whose current reaction to clear reliability misses more resembles acrimony, stress, or heroics is at greater risk of turning into high modernist reliability theater. Consider applying James C. Scott’s suggestions to government planners to your software organization:

Take small steps
Favor reversibility
Plan on surprise
Plan on human inventiveness

Finally, remember Scott’s maxim: “no Taylorist factory (web service) can sustain production (nines of availability) without the unplanned improvisations of an experienced workforce.”

Appendix

References:

[1]: Laura Nolan, ‘Seeing Like an SRE: Site Reliability Engineering as High Modernism’, ;login:, (USENIX Association, April 2021). https://www.usenix.org/publications/loginonline/seeing-sre-site-reliabil...

[2] James Scott, Seeing Like a State (Veritas, 2012).

[3] Techne is universal knowledge: things like the boiling point of water, Pythagoras’ theorem, or that we should alert if no instances of our jobs are running. Metis, on the other hand, is local, specific, and practical, and can’t be easily codified.

[4] Alex Hidalgo, Implementing Service Level Objectives (O’Reilly Media, 2020).

[5] B. Beyer, N. R. Murphy, D. K. Rensin, K. Kawahara, and S. Thorne, eds., Site Reliability Workbook: Practical Ways to Implement SRE (O’Reilly Media, 2018).

[6] Peter Drucker, The Practice of Management (Harper Books, 1954).

[7] Andy Grove, High Output Management (Vintage, 1995).

[8] Christina Wodtke, Introduction to OKRs (O’Reilly Media, 2016).

[9] Laura Nolan, A Terrible, Horrible, No-Good, Very Bad Day at Slack (Slack Engineering Blog, 2020). https://slack.engineering/a-terrible-horrible-no-good-very-bad-day-at-sl...

[10] Nora Jones, 'Rethinking How the Industry Approaches Chaos Engineering' (InfoQ, 2019). https://www.infoq.com/presentations/rethinking-chaos-engineering/

[11] Gustavo Franco and Matt Brown, 'How SRE teams are organized, and how to get started' (Google Cloud Blog, 2019). https://cloud.google.com/blog/products/devops-sre/how-sre-teams-are-orga...

Article Categories:

SRE

Last updated February 8, 2023

Authors:

Jacob Scott is a software engineer currently focusing on reliability at Stripe. He is an enthusiastic participant in the resilience engineering community and curious about the application of learnings in modern safety science to real, concrete complex socio-technical software systems.
Find him as @jhscott on Twitter.

[email protected]