Autopsy of a MySQL Automation Disaster

Friday, 2019, October 4 - 11:4512:30

Jean-François Gagné, MessageBird

Abstract: 

You deployed automation, enabled automatic database master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such a surprise.

Once upon a time, a failure brought down a MySQL master database. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, an edge-case recovery, and a lack of oversight in tooling and scripting lead to a split-brain and data corruption. This talk will go into details about the convoluted—but still real-world—sequence of events that lead to this disaster. I cover what could have avoided the split-brain and what could have made data reconciliation easier.

Jean-François Gagné, MessageBird

Jean-François is a System/Infrastructure Engineer and MySQL Expert. One year ago, he joined MessageBird, an IT telco startup in Amsterdam, with the mission of scaling the MySQL infrastructure. Before that, J-F worked on growing the Booking.com MySQL and MariaDB installations (he also works on many other non-MySQL related projects). Some of his latest projects are finding the best way to automate MySQL master failover, making Parallel Replication run faster and promoting Binlog Servers. He also has a good understanding of replication in general and a respectable understanding of InnoDB, MySQL, Linux, and TCP/IP. Before Booking.com, he worked as a System/Network/Storage Administrator in a Linux/VMWare environment, as an Architect for a Mobile Network and Service Provider, and as a C and Java Programmer in an IT Service Company. Even before that, when he was learning computer science, Jeff studied cache consistency in distributed systems and network group communication protocols.

BibTeX
@conference {239466,
author = {Jean-Fran{\c c}ois Gagn{\'e}},
title = {Autopsy of a MySQL Automation Disaster},
year = {2019},
address = {Dublin},
publisher = {{USENIX} Association},
month = oct,
}