Trustworthy Graceful Degradation: Fault Tolerance across Service Boundaries

Note: Presentation times are in Coordinated Universal Time (UTC).

Wednesday, 13 October, 2021 - 04:3005:00

Daniel Rodgers-Pryor, Stile Education


Does your service gracefully degrade when a database, cache, or remote service becomes unavailable? Are you sure?

Graceful failure when downstream services are overloaded or down is critical to maintaining availability in complex systems, but because each of these failures is (hopefully!) rare, it can be hard to really trust that your systems will respond as expected in an emergency.

Come along to learn: how easy it is to mess this up (repeatedly), what it takes to maintain a trustworthy system, and a set of simple approaches for testing system failures in both CI and production.

Daniel Rodgers-Pryor, Stile Education

Daniel's academic background in Physics and Computer Science gave him a passion for promoting scientific literacy and an interest in manage complex computing systems. Since joining Stile as a junior engineer in 2014, Daniel has spent his time developing features, optimizing processes, and battling fires. Since stepping into the role of CTO in 2017, Daniel now spearheads the technical and organizational challenges of delivering exciting interactive science lessons to more than 1 in 3 Australian high school students, while expanding internationally.

SREcon21 Open Access Sponsored by Indeed

@conference {276731,
author = {Daniel Rodgers-Pryor},
title = {Trustworthy Graceful Degradation: Fault Tolerance across Service Boundaries},
year = {2021},
publisher = {USENIX Association},
month = oct

Presentation Video