From Exceptional Maintenance to Automated Routine Operation: A Story of the Datacenter Switchover for Wikipedia

Giuseppe Lavagetto

Wednesday, 11 October, 2023 - 16:00–16:40

Giuseppe Lavagetto, Wikimedia Foundation

This is a tale about reducing toil by automating it away.

Once upon a time, there was a single core datacenter for wikipedia. Later, we added a second core datacenter as a failover; the first time we moved traffic between them, it was a multiple-day operation undertaken by multiple engineers which required about 40 minutes of read-only time and an extended period of degraded performance.

Nowadays, we switch traffic multiple times per year. The newest team member is the conductor of the process, and along with a senior member as their sidekick, they carry out the process that requires less than two minutes of read only time and virtually no degraded performance.

I'll introduce the tools we built (all open source) as well as the architectural approach we took to get to that point, which should be also be applicable to other problems/architectures in the same space.

An astrophysicist by trade, I have been disguising myself for the last decade as an SRE for the Wikimedia Foundation, the non-profit that runs your favourite free online encyclopedia. My work focuses mostly on making our application layer flexible, automated and dynamic.

Connect:

Mastodon

BibTeX

@conference {292244,
author = {Giuseppe Lavagetto},
title = {From Exceptional Maintenance to Automated Routine Operation: A Story of the Datacenter Switchover for Wikipedia},
year = {2023},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Download

View the slides

From Exceptional Maintenance to Automated Routine Operation: A Story of the Datacenter Switchover for Wikipedia

Presentation Video