Chaos {Patterns—Architecting} for Failure in Distributed Systems

Jos Boumans

Chaos Patterns—Architecting for Failure in Distributed Systems

Jos Boumans, Krux

As we architect our systems for greater demands, scale, uptime, and performance, the hardest thing to control becomes the environment in which we deploy and the subtle but crucial interactions between complicated systems. Chaos Patterns help us establish and implement a virtuous cycle that let’s us both prove & improve our system along each of these dimensions before the inevitable happens.

While it may seem reckless or counter-intuitive, our experience has proven that it's a matter of how and when (not if) we will learn about the limitations and failure modes of the system.

This is the story of the pitfalls we encountered, and how, through architecture, convention, and common sense, we managed to build an infrastructure that is "Always Up" from the end-user perspective, and incredibly economical to build, scale and operate. Using chaos testing, we learn more about how our system fails from a 10 second controlled failure than a multi-hour uncontrolled outage.

In this session we will cover various implementation techniques, available to any developer and operator, which will vastly increase the resilience of your systems and provide a superior end user experience—from optimizing your use of DNS for failure, to configuring your CDN to have your back, to synthetic responses and expected database outages.

Jos Boumans, Krux

BibTeX

@conference {208666,
author = {Jos Boumans},
title = {Chaos {Patterns{\textemdash}Architecting} for Failure in Distributed Systems},
year = {2015},
address = {Washington, D.C.},
publisher = {USENIX Association},
month = nov
}

Download

help promote

USENIX Conference Policies

Chaos Patterns—Architecting for Failure in Distributed Systems

Jos Boumans, Krux