GameDay: Creating Resiliency Through Destruction

Jesse Robbins started his presentation "GameDay: Creating Resiliency Through Destruction" (slides) with this awesome quote:

"You don’t choose the moment,
the moment chooses you.
You only choose how prepared
you are when it does."
-Fire Chief Mike Burtch

and even if it was originally targeted to firefighters it relates so much to system administrators. Actually, during all this session, Jesse draws parallels between two of his greatest passions: firefighting and operations.

Gameday is an exercise designed to increase resilience through large-scale fault injection across critical systems where resilience is seen as the ability of a system to adapt to changes, failures, & disturbances. By "system", he means: people, culture, processes, applications & services, infrastructure, software and hardware.

GameDay increases resilience in 3 ways:
Preparation
- Identification and mitigation of risks and impact from failure
- Reduces frequency of failure (MTBF)
- Reduces duration of recovery (MTTR)
Participation
- Builds confidence & competence responding to failure and under stress.
- Strengthens individual and cultural ability to anticipate, mitigate, respond to, and recover from failures of all types.
Exercises
- Trigger and expose “latent defects”
- Choose discover them, instead of letting that be determined by the next real disaster.

Jesse also had some great practical advice on how people can start doing such a 'frightening' exercise: start small. Start with small, controlled failures. Announce them in advance to the team and let them prepare the best they can. Then run the gameday as planned, even if it might be scary to actually turn off a vital part of your infrastructure. Doing this will enable the team to grow and learn from a controlled fire drill. Once you have a good level of trust you can move to a full scale exercise, where you turn off a full datacenter or whatever is identified as the 'scariest' thing in your organization, the one thing that everybody is afraid to turn off. This will probably cause a disaster, but this is the only way how your team will learn and build the trust and experience to perform at its best during a real emergency.

Jesse concluded: "there is no substitute for experience... Failure free operations require experience with failure". Great presentation and great advice. I'm sure many of the attendees were intrigued by the gameday idea and will run such fire drills soon.
I know I will.