Autopsy of a MySQL Automation Disaster

Friday, 4 October, 2019 - 11:4512:30

Jean-François Gagné, MessageBird


You deployed automation, enabled automatic database master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such a surprise.

Once upon a time, a failure brought down a MySQL master database. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, an edge-case recovery, and a lack of oversight in tooling and scripting lead to a split-brain and data corruption. This talk will go into details about the convoluted—but still real-world—sequence of events that lead to this disaster. I cover what could have avoided the split-brain and what could have made data reconciliation easier.

Jean-François Gagné, MessageBird

Jean-François is a System/Infrastructure Engineer and MySQL Expert. One year ago, he joined MessageBird, an IT telco startup in Amsterdam, with the mission of scaling the MySQL infrastructure. Before that, J-F worked on growing the MySQL and MariaDB installations (he also works on many other non-MySQL related projects). Some of his latest projects are finding the best way to automate MySQL master failover, making Parallel Replication run faster and promoting Binlog Servers. He also has a good understanding of replication in general and a respectable understanding of InnoDB, MySQL, Linux, and TCP/IP. Before, he worked as a System/Network/Storage Administrator in a Linux/VMWare environment, as an Architect for a Mobile Network and Service Provider, and as a C and Java Programmer in an IT Service Company. Even before that, when he was learning computer science, Jeff studied cache consistency in distributed systems and network group communication protocols.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {239466,
author = {Jean-Fran{\c c}ois Gagn{\'e}},
title = {Autopsy of a {MySQL} Automation Disaster},
year = {2019},
address = {Dublin},
publisher = {USENIX Association},
month = oct

Presentation Video