Experiments for SRE

Note: Presentation times are in Coordinated Universal Time (UTC).

Thursday, 14 October, 2021 - 15:3016:00

Debbie Ma, Google LLC


Incident management for complex services can be overwhelming. SREs can use experiments to attribute and mitigate production changes that contribute to an outage. With experiments to guard production changes, SREs can also reduce a (potential) outage's impact by preventing further experiment ramp up if the production change is associated with unhealthy metrics. Beyond incident management, SREs can use experiments to ensure that reliable changes are introduced to production.

Debbie Ma, Google

Debbie is a Site Reliability Engineer/Software Engineer (SRE/SWE) at Google focusing on improving the reliability of experiment infrastructure. Debbie initially worked on experiments for mobile devices but expanded her work area to include server experiments as well. Her current work interest is ensuring SREs and SWEs can easily develop and introduce new features into production safely.

SREcon21 Open Access Sponsored by Indeed

@conference {276701,
author = {Debbie Ma},
title = {Experiments for {SRE}},
year = {2021},
publisher = {USENIX Association},
month = oct

Presentation Video