SLX: An Extended SLO Framework to Expedite Incident Recovery

Note: Presentation times are in Coordinated Universal Time (UTC).

Wednesday, 2021, October 13 - 03:3004:00

Qian Ding and Xuan Zhang, Ant Group


This talk is based on a real journey on establishing SLOs for an infrastructure SRE team whose availability target is higher than 99.999%. First, we reveal our process on defining SLOs and demonstrate the gaps between expectations and reality on using SLOs with dev teams. Secondly, we present a uniformed SLO framework (SLX) design to facilitate SREs to manage hundreds of SLOs. For example, other than using SLO data for basic alerting and weekly reporting, we combine the SLO framework with statistical anomaly detection algorithms to locate the pitfalls automatically. To achieve that, we introduce several new concepts like Service-Level-Factor (SLF) and Service-Level-Dependency (SLD) and use them to build SLO knowledge graphs across multiple infrastructure systems. Finally, we present our intent-driven SLX implementation inspired by the Kubernetes design and the Gitops paradigm.

Qian Ding, Ant Group

Qian works at Ant Group as a staff engineer focusing on site reliability engineering. He is the SRE tech lead of adopting Kubernetes in Ant Group production environment. He is passionate about adopting and promoting SRE's philosophy for managing large-scale production systems. His current interest includes designing SLOs from end-user's perspective for using Kubernetes as well as using SLOs to drive reliability feature development for Kubernetes.

Xuan Zhang, Ant Group

Xuan Zhang works at Ant Group as an SRE. He is a full-stack engineer with a passion for coding and building all kinds of systems. He has been focusing on automations, and with his out-of-the-box thinking, pushed through the boundaries of automating processes that deemed implausible.

