Paul Carleton, Stripe Inc.
This talk will cover a security focused project that evolved into a chaos injection system.
The system is called “Lifespan Management” and it enforces a lifespan on a cloud hosted VM. After the lifespan expires, the host is terminated, and a replacement is brought up. It has the benefits of making it easier to apply fixes for CVE’s (CVE comes out on day X, we know hosts will age out by day Y), and reducing the value of a compromised machine (“I’ve finally captured a host! It’s being shutdown?? No!”)
This seemed simple enough, but the complexity it uncovered made for a fun, year-long adventure in chaos engineering.
In this talk, I’ll cover the evolution of the system, and some lessons we learned along the way like:
- All termination API calls are not created equal
- Zero failing health checks does not mean a host is healthy
- Answering “Was this the chaos system?” quickly is essential
I’ll also include anecdotes like how it helped with Spectre/Meltdown mitigations, how it mercilessly killed all our kubernetes workers, and how it locked us out of our QA environment.
Paul Carleton, Stripe Inc.
I am an infrastructure engineer on the Cloud team at Stripe and I want to make systems that fix themselves. Outside of work, I write about tech, hike, bike, and spend too much time thinking about where the moon is in relation to Earth.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Paul Carleton},
title = {How Our Security Requirements Turned Us into Accidental Chaos Engineers},
year = {2018},
address = {Nashville, TN},
publisher = {USENIX Association},
month = oct
}