Björn Rabenstein, SoundCloud Ltd.
SoundCloud runs a complex microservice architecture to serve a great diversity of features to a large user base. All of this is done by a relatively small number of engineers, under constant pressure to innovate in the not exactly easy market of music streaming. While this might appear quite similar to the sitatution of many other startups, SoundCloud is a rather extreme example. As such, it is perfectly suited to find out how to tackle this tech-debt prone situation.
About six years ago, with the microservice migration in full swing, site reliability became more and more problematic at SoundCloud. At about the same time, SoundCloud happened to employ a handful of ex-Google SREs. Naively, one might have expected they would simply wave their magic G-wands and make the site reliable again. However, simply copying Google-style SRE and applying it to an organization very different in scale and culture was doomed to fail. Studying the exact reasons for the failure and SoundCloud's subsequent mission to find their own implementation of SRE is a helpful exercise for many smaller organizations in a similarly challenging situation of sustainably running a diverse set of services.
Björn Rabenstein is a Production Engineer at SoundCloud and a Prometheus developer. Previously, Björn was a Site Reliability Engineer at Google and a number cruncher for science.
SREcon18 Europe/Middle East/Africa Open Access Videos
Sponsored by Indeed
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.