Challenges of Starting an SRE Team from Scratch in an Enterprise

Tuesday, December 08, 2020 - 2:55 pm3:35 pm

Pauline Narvas, Wayne Bridgman, Graeme Bye, Amreen Firdouse, Anand Bobade, Shiv Patil, BT

Abstract: 

Implementing Site Reliability Engineering principles, values, and building an SRE team at a large enterprise has proven to be quite challenging. It turns out creating an SRE team is much more complex than just copying Google or renaming your Ops team to SRE.

Instead of jumping on what Google has done, to begin our SRE journey, we identified issues that were of most importance to the Business first. For us, it was security, reducing cloud sprawl, and getting our cloud costs under control. Off the back of this, we established our own standards.

This will be a structured talk where we will share the journey of building our SRE team, the main challenges that we've faced in a large enterprise, some reflections of what we learned along the way, and advice to newly formed SRE teams.

Pauline Narvas, BT

Pauline is a Site Reliability Engineer at BT, where she is part of the newly formed team to bring the SRE values to life within the organization. She's also a Women in Tech advocate, blogger, and enjoys weight training.

Wayne Bridgman, BT

Wayne is a Principal DevOps and Site Reliability Engineer consultant at BT. He has over 15 years of experience in leading strategic cloud architecture transformations and enterprise-wide DevOps and CI/CD initiatives in the Finance, Insurance, and Telecommunications sectors.

Graeme Bye, BT

Graeme is a DevOps Engineering Manager at BT. Having worked in IT across financial and telecom sectors over 20 years, leading engineering teams across all stages of the Software development lifecycle, he now oversees the Environment governance in Digital Engineering. His latest challenge? Leading the progressive development of a new SRE team from inception.

Amreen Firdouse, BT

Designated as a Site Reliability Engineer at THBS. With over 5 years of experience in Cloud, Operations, and DevOps. Amreen began her career as a Technical Consultant (AWS and Azure) at a start-up then later joined the Production Operations Team then Platform Services for EE/BT, handling security, monitoring, incident management, and learning infrastructure as code. Amreen is now part of the SRE team.

Anand Bobade, BT

Anand has 7+ years of experience in design, implementing, and managing cloud infrastructure and containers. Experience in virtualization, NOC engineer, orchestration, AWS cloud, IaaS & PaaS in medial, US government, and telecom sectors. He is now a Site Reliability Engineer at BT.

Shiv Patil, BT

Shiv has over 4.5 years of experience in managing AWS infrastructure and automation. He enjoys writing scripts in Python and finding new ways to improve the performance of systems. He is currently a Site Reliability Engineer at BT.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@inproceedings {262194,
author = {Pauline Narvas and Wayne Bridgman and Graeme Bye and Amreen Firdouse and Anand Bobade and Shiv Patil},
title = {Challenges of Starting an {SRE} Team from Scratch in an Enterprise},
booktitle = {SREcon20 Americas (SREcon20 Americas)},
year = {2020},
url = {https://www.usenix.org/conference/srecon20americas/presentation/narvas},
publisher = {{USENIX} Association},
month = dec,
}

Presentation Video