Reliability When Everything Is a Platform: Why You Need to SRE Your Customers

Tuesday, March 14, 2017 - 5:15pm6:10pm

Dave Rensin, Google


The general trend in software over the last several years is to give every system an API and turn every product into a platform. When these systems only served end users, their reliability depended solely on how well we did our jobs as SREs. Increasingly, however, our customers' perceptions of our reliability are being driven by the quality of the software they bring to our platforms. The normal boundaries between our platforms and our customers are being blurred and it's getting harder to deliver a consistent end user reliability experience.

In this talk we'll discuss a provocative idea—that as SREs we should take joint operational responsibility and go on-call for the systems our customers build on our platforms. We'll discuss the specific technical and operational challenges in this approach and the results of an experiment we're running at Google to address this need.

Finally, we'll try to take a glimpse into the future and see what these changes mean for the future of SRE as a discipline.

Dave Rensin, Google

Dave Rensin is a Google SRE Director leading Customer Reliability Engineering (CRE)—a team of SREs pointed outward at customer production systems. Previously, he led Global Support for Google Cloud. As a longtime startup veteran he has lived through an improbable number of "success disasters" and pathologically weird failure modes. Ask him how to secure a handheld computer by accidentally writing software to make it catch fire, why a potato chip can is a terrible companion on a North Sea oil derrick, or about the time he told Steve Jobs that the iPhone was "destined to fail."

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {202297,
author = {Dave Rensin},
title = {Reliability When Everything Is a Platform: Why You Need to {SRE} Your Customers},
year = {2017},
address = {San Francisco, CA},
publisher = {USENIX Association},
month = mar

Presentation Video 

Presentation Audio