Reimagining Correctness SLOs: When 100% Means Failure

A statistical approach to building business-centric SLOs

September 5, 2023

Opinion

Authors:

Article shepherded by:

Laura Nolan

As the discipline of SRE is adopted by new industries, it may experience a shift from product owners who are IT-savvy toward product owners who are more business-oriented. This comes with the potential for different — but effective — definitions of measuring success. Getting business-focused product owners on board with SRE principles many times comes with a very different answer to “what does happiness mean to your customer?” — which subsequently leads to SLIs and SLO patterns that aren’t solely focused on traditional availability measures. This article showcases one such pattern and a creative way to solve for it.

The gambler's fallacy is a simple yet powerful cognitive bias: the belief that if a coin is flipped and lands on heads nine times in a row, the tenth flip is more likely to be heads — or, from the eternal pessimist's viewpoint, tails is overdue and therefore more probable.

Of course, the fallacy is that — barring a double-headed coin — past results do not alter the mathematical probabilities of future outcomes.

Now consider this: someone creates a coin-flipping program which, after an astronomical number of flips — let's say 100 trillion — produces heads 99% of the time. At this point, is it still a gambler's fallacy to predict a high likelihood of the next flip landing on heads? Clearly, this is not a case of gambler's fallacy; there is something significantly wrong with the program.

This scenario introduces a fascinating challenge: How would we, as SRE practitioners, define and then structure a Correctness SLO for this coin-flipping scenario such that we would breach said SLO if our program returned heads 99% of the time?

An approach for defining Correctness SLOs

First, a pair of SLIs is introduced where healthy is not 100%. In our coin-flipping example, the SLIs would be “number of flips that land heads” and “number of total flips” and ideal health would be if 50% of the flips were heads

Then, an SLO is built on top of those SLIs which would breach if the percentage deviated in either direction. In our coin-flipping example, this could be something simple like a 2% deviation – the SLO breaches if the heads rate is less than 48% or greater than 52%.

Use Cases

While the coin flip example might seem overly simplistic, it neatly illustrates a principle that holds true across a wide range of real-world scenarios. Here are a few to whet the appetite.

Online Gaming Balance

Balance is essential to the integrity and enjoyment of online competitive games, particularly those that feature diverse playable characters. A Correctness SLO in this context could focus on maintaining a near 50% win rate for each character. If a particular character's win rate consistently exceeds or falls below this mark, it may indicate an issue with the game's balance — perhaps that character is too powerful, too weak, or has some other advantage or disadvantage that's distorting the overall game balance. The aim isn't necessarily to achieve an exact 50% win rate for each character, but rather to avoid any significant, persistent deviations that might suggest a problem with the game's design.

Risk-Assessment Approval Rates

Credit card companies usually aim for a balance between approvals and rejections. If the approval rate of automated applications drops significantly, it's an indication that the risk assessment system may not be working correctly. On the other hand, if the approval rate skyrockets to 100%, it might ironically suggest the very same thing. An SLO here might aim to maintain an approval rate that is sustainable and profitable, and an SLO breach would be triggered if the actual rate diverges significantly from that target in either direction over time.

Recommendation Systems

Recommendation systems aim to provide users with content they will find engaging and relevant. One metric for the effectiveness of such systems is the "click-through rate" (CTR) — the percentage of recommendations that users click on. Over time, a well-tuned recommendation system might expect a fairly steady CTR, reflective of the system's understanding of its users' preferences.

A correctness SLO for this scenario could be designed around maintaining this expected click-through rate. If the CTR significantly increases or decreases, it could signal a potential issue. A sudden drop might indicate the system is suggesting less relevant content, while an unexpected surge could mean the system returned a too-good-to-be-true offer that would destroy the profitability of a business.

Compare and Contrast: The Infamous Global Chubby Planned Outage

These scenarios may seem similar to Google's discussion on Chubby in the Service Level Objectives chapter of the SRE book [1]. Chubby, known to schedule planned outages if its uptime exceeds SLOs, seeks to avoid fostering single point of failure dependencies. The concept Chubby brings is very similar to the concepts explained here — if an SLI is significantly over its target, then “do something.” The critical distinction between these two situations, however, is in the purpose of the “doing something.” With Chubby, the goal is to ensure clients are not over reliant on its system, whereas in the situations described above, a high-percentage SLI actually means failure and the error budget policy is enacted.

Constructing Correctness SLIs: The Bane of SRE Existence?

Product owners have always desired Correctness SLOs as a complement to the more popular and easier to implement Availability and Latency SLOs. In some cases, a well-defined Correctness SLO may bring more value than Availability or Latency SLOs, as incorrectness can lead to financial, reputational, or even legal ramifications.

However, it can be difficult to build suitable Correctness SLIs. The central challenge is that we cannot normally measure individual transaction correctness. In our coin-flipping example, 'heads' and 'tails' are both valid outcomes, making it impossible to assess the correctness of a single flip.

However, correctness evaluation becomes viable in aggregate when examining a large set of data. Approaching statistical significance, the results should lean towards our expected outcome — a 50/50 split between heads and tails in our coin flip scenario — within an acceptable deviation range.

The two SLIs we would use for this scenario are:

Count of coin flips that land heads
Total number of coin flips

Embracing this approach entails accepting three important concepts:

SLIs without clear pass/fail conditions: Standard SLIs typically let us establish a cut-off line; any 500-series API request or API request exceeding 3 seconds is a failure. However, in our coin-flip scenario, a 'heads' flip isn't inherently good or bad; it gains significance only in relation to the entire data set.
An ideal value less than 100%: This is evident when considering the above SLIs, where a 50% rate is the ideal target for the percentage of 'heads' flips compared to total flips.
Deviation from the ideal value is negative, regardless of direction: In our coin-flip scenario, a 45% or 55% heads rate is equally sub-optimal, as is a 1% or 99% heads rate

Once we get a firm grasp on these concepts, constructing the SLIs themselves not only becomes apparent; in many cases, it is very simple!

Establishing Correctness SLOs

Once we establish feasible correctness SLIs, the next step is to slightly modify the standard SLO process.

A product owner can specify:

Ideal success rate for 'heads' in a coin flip is 50%
Deviations beyond +/- 2% breach the SLO

This would trigger the error budget policy over a predetermined time period (e.g., 30 days) if:

Heads appear < 48% of the time
Heads appear > 52% of the time

The typical SLO breach formula will need slight modifications:

Introduce IDEALTARGET: The optimal target percentage (usually 100% for standard SLOs)
Define SLODIFF: The acceptable deviation from IDEALTARGET before SLO breach
Use an absolute value function to handle deviation, positive or negative, from IDEALTARGET

Expr: (
job:slo_errors_per_request_ratio_rate30d{job=”myjob”} > (1 – SLO)
)

An example of a standard SLO formula for unavailability over a trailing 30 day period (PromQL syntax).

Expr: (
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate30d{job=”myjob”}) > SLODIFF
)

An example of a potential correctness SLO measuring deviation from an ideal target over a trailing 30 day period (PromQL syntax).

Consider these scenarios with our coin flip example:

IDEALTARGET 50%, SLODIFF 2%, heads flip rate 47%: Abs(0.50 – 0.47) > 0.02 = True, SLO BREACHED
IDEALTARGET 50%, SLODIFF 2%, heads flip rate 54%: Abs(0.50 – 0.54) > 0.02 = True, SLO BREACHED
IDEALTARGET 50%, SLODIFF 2%, heads flip rate 51%: Abs(0.50 – 0.51) > 0.02 = False, SLO NOT BREACHED

This formula works with standard SLOs too:

IDEALTARGET 100%, SLODIFF 2% (i.e., 98% SLO), success rate 97%: Abs(1 – 0.97) > 0.02 = True, SLO BREACHED
One humorous caveat: if somehow the data was corrupted and showed a pass rate > 102%, the SLO would technically be BREACHED.

Adapting Burn Rate Alerting

Burn Rate Alerting formulas can effectively mirror our SLO calculations by incorporating the Burn Rate and Multiwindow Formulas.

Expr: (
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate1h{job=”myjob”}) > (14.4*SLODIFF)
and
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate5m{job=”myjob”}) > (14.4*SLODIFF)
)
or
(
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate6h{job=”myjob”}) > (6*SLODIFF)
and
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate30m{job=”myjob”}) > (6*SLODIFF)
)
)

Expr: (
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate24h{job=”myjob”}) > (3*SLODIFF)
and
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate2h{job=”myjob”}) > (3*SLODIFF)
)
or
(
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate3d{job=”myjob”}) > SLODIFF
and
job:abs(IDEALTARGET – slo_errors_per_request_ratio_rate6h{job=”myjob”}) > SLODIFF
)
)

Short-term burn rate alerting and multiwindow pattern (PromQL syntax).

What Next? How to Use Correctness SLOs

Once the SLIs are fleshed out and the SLOs are established with the burn rate alerting mentioned above, they can officially be integrated just like any other SLO might. Execute an error budget policy if the SLO breaches. Take action on the burn rate alerting to alert the engineer of variance. Use your SLO in canary deployments as a gauge of successful deployments or (should the SLO signal failure) a reversion back to the stable version of the system.

One caveat to this is that the playbooks executed to resolve correctness issues look significantly different from others, given a correctness failure is typically the result of different kinds of problems than an application that is unavailable or slow. For instance, an issue with entropy might be at play with a coin-flipping program that significantly deviates from a 50% heads ratio. Careful thought should be included in a playbook that really asks the question of what might be investigated if a Correctness SLO Burn Rate Alert fires.

Another notion to consider is the fact that, in many of these cases, the triggering event of Correctness SLO breaches is caused by the business themselves. Using our previous example of Risk-Assessment Approval Rates in the credit-card industry, a business partner might have accidentally deleted a subset of rules that caused too many or too few approvals.

In these scenarios we can treat a business rule change the same way one might treat a firewall policy rule push that broke connectivity by asking if there is a way to better fool-proof the application from such rule-changes in the future as part of the incident review process. Potential mechanisms for increasing safety might include automated integration testing, or a bulk deletion protection mechanism to prevent more than one rule from being deleted without a secondary confirmation.

Conclusion

This article advocates for Correctness SLOs and provides a blueprint to create and monitor these SLOs. These Correctness SLOs not only align with the objectives of SRE - to improve system reliability — but also bring about a stronger collaboration between business and IT with metrics that speak their language. SRE has always been focused on improving user experience: it is time to challenge convention and devise innovative methods to construct more meaningful, customer-centric SLOs.

Appendix

References:

Chris Jones, John Wilkes, Niall Murphy, and Cody Smith, 'Service Level Objectives', in Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Murphy (eds.) Site Reliability Engineering: How Google Runs Production Systems (2016, O'Reilly). https://sre.google/sre-book/service-level-objectives/

Article Categories:

SRE

Culture

Last updated September 6, 2023

Authors:

Adam Newman is one of the founding members of the Site Reliability Engineering organization at USAA. Before becoming a part of Site Reliability Engineering, Adam was deeply involved in influencing the Remote Deposit Capture industry and holds numerous patents in the space. Adam received his bachelor's degree in Computer Science from the University of Texas at San Antonio.

[email protected]