The MTTR Chronicles: Evolution of SRE Self Service Operations Platform

Friday, June 14, 2019 - 10:00 am10:30 am

Jason Wik, Jayan Kuttagupthan, and Shubham Patil, VMware

Abstract: 

Running a Cloud Platform reliably comes with its own set of challenges. Irrespective of the source of solution used, In-House or Off-The-Shelf, with the Incident Management spread across integrated systems, it is easy to lose sight of the latent issues. Impact Assessment, Communication & Coordination to add in to the mix.

Being an SRE is a tough job since the SRE is expected to know almost all aspects of software delivery. The life of an SRE becomes easier & empowered when equipped with the right set of tools.

Join us at our talk and we'll walk you through our experience of building a SRE Operations Platform for VMware Managed Cloud on AWS. We'll talk about how we have combined Automation, Monitoring and Incident Management under a single umbrella to drive down MTTR, increase productivity, tools to effectively communicate fleet-wide health to incident managers and customer success engineers and above all, reducing the toil of an SRE.

Jason Wik, VMware

I have been focused on service reliability and operating services at scale for 20+ years. My experience has been shaped by the challenges of engineering and supporting many large global services. I am the Director of the VMC SRE teams for the VMWare Managed Cloud on AWS service. Defining and measuring service health is one of my areas of passion.

Jayan Kuttagupthan, VMware

I am into software development since 9 years contributing to both backend and front end development. I am an SRE for VMware Managed Cloud on AWS (VMC on AWS) contributing to the development of Automation & Reporting platforms. My current interest is exploring around how ML, Deep Learning, and AI can contribute to the SRE arena.

Shubham Patil, VMware

I currently work on problems ranging from Service Health Measurement to developing Scalable Automation Platforms for VMware Managed Cloud. In the past, I have worked on VMware's ESXi Kernel to optimize schedulers and memory management in the hypervisor. In my spare time, I like playing around with Distributed Systems Design and Artificial Intelligence problems.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {233249,
author = {Jason Wik and Jayan Kuttagupthan and Shubham Patil},
title = {The {MTTR} Chronicles: Evolution of {SRE} Self Service Operations Platform},
year = {2019},
address = {Singapore},
publisher = {{USENIX} Association},
month = jun,
}

Presentation Video