10 Lessons Learned in 10 Years of SRE

Note: Presentation times are in Coordinated Universal Time (UTC).

Tuesday, 12 October, 2021 - 15:0015:30

Andrea Spadaccini, Microsoft Azure

Abstract: 

In this talk we'll discuss some key principles and lessons learned that I've developed and refined in more than 10 years of experience as a Site Reliability Engineer across several teams within Google and Microsoft.

These are topics that often come up as I discuss Site Reliability Engineering with Microsoft customers that are at different stages of their own SRE journey, and that they—hopefully!—find insightful. They broadly belong to the areas of "Starting SRE" and "Steady-state SRE."

Please join us if you want to discuss fundamental principles of adopting SRE, want to listen to my mistakes (so you can avoid making them!), and want to compare notes on different ways of doing SRE.

Andrea Spadaccini, Microsoft Azure

Andrea is a Principal Software Engineer in SRE at Microsoft Azure, where he currently acts as a tech lead for all the Azure SRE teams. He works on cross-team projects (currently, SLOs for all Azure products), while being on-call for Azure Resource Manager—the entry point for most Control Plane operations in Azure.

He joined Microsoft in 2018. Before that, he worked as a Site Reliability Engineer for Google since 2011, in various technical and management roles across teams in CorpEng, Ads, and GCP. He's been lucky enough to contribute to the first and second SRE books, mostly to the chapters about on-call.

He received his Ph.D. in Computer Engineering from the University of Catania in 2012, with a thesis on novel traits for biometric recognition, and he's the maintainer of the free CPU simulator EduMIPS64.

SREcon21 Open Access Sponsored by Indeed

BibTeX
@conference {276643,
author = {Andrea Spadaccini},
title = {10 Lessons Learned in 10 Years of {SRE}},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Presentation Video