How Our Security Requirements Turned Us into Accidental Chaos Engineers

Wednesday, October 31, 2018 - 9:00 am9:30 am

Paul Carleton, Stripe Inc.


This talk will cover a security focused project that evolved into a chaos injection system.

The system is called “Lifespan Management” and it enforces a lifespan on a cloud hosted VM. After the lifespan expires, the host is terminated, and a replacement is brought up. It has the benefits of making it easier to apply fixes for CVE’s (CVE comes out on day X, we know hosts will age out by day Y), and reducing the value of a compromised machine (“I’ve finally captured a host! It’s being shutdown?? No!”)

This seemed simple enough, but the complexity it uncovered made for a fun, year-long adventure in chaos engineering.

In this talk, I’ll cover the evolution of the system, and some lessons we learned along the way like:

  • All termination API calls are not created equal
  • Zero failing health checks does not mean a host is healthy
  • Answering “Was this the chaos system?” quickly is essential

I’ll also include anecdotes like how it helped with Spectre/Meltdown mitigations, how it mercilessly killed all our kubernetes workers, and how it locked us out of our QA environment.

I am an infrastructure engineer on the Cloud team at Stripe and I want to make systems that fix themselves. Outside of work, I write about tech, hike, bike, and spend too much time thinking about where the moon is in relation to Earth.

