{SLX}: An Extended {SLO} Framework to Expedite Incident Recovery

Qian Ding; Xuan Zhang

Wednesday, 13 October, 2021 - 03:30–04:00

Qian Ding and Xuan Zhang, Ant Group

This talk is based on a real journey on establishing SLOs for an infrastructure SRE team whose availability target is higher than 99.999%. First, we reveal our process on defining SLOs and demonstrate the gaps between expectations and reality on using SLOs with dev teams. Secondly, we present a uniformed SLO framework (SLX) design to facilitate SREs to manage hundreds of SLOs. For example, other than using SLO data for basic alerting and weekly reporting, we combine the SLO framework with statistical anomaly detection algorithms to locate the pitfalls automatically. To achieve that, we introduce several new concepts like Service-Level-Factor (SLF) and Service-Level-Dependency (SLD) and use them to build SLO knowledge graphs across multiple infrastructure systems. Finally, we present our intent-driven SLX implementation inspired by the Kubernetes design and the Gitops paradigm.

Qian works at Ant Group as a staff engineer focusing on site reliability engineering. He is the SRE tech lead of adopting Kubernetes in Ant Group production environment. He is passionate about adopting and promoting SRE's philosophy for managing large-scale production systems. His current interest includes designing SLOs from end-user's perspective for using Kubernetes as well as using SLOs to drive reliability feature development for Kubernetes.

Xuan Zhang works at Ant Group as an SRE. He is a full-stack engineer with a passion for coding and building all kinds of systems. He has been focusing on automations, and with his out-of-the-box thinking, pushed through the boundaries of automating processes that deemed implausible.

BibTeX

@conference {276721,
author = {Qian Ding and Xuan Zhang},
title = {{SLX}: An Extended {SLO} Framework to Expedite Incident Recovery},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Download

View the slides

SLX: An Extended SLO Framework to Expedite Incident Recovery

Presentation Video