When One Line Took Thousands of Websites Offline

Wednesday, 11 October, 2023 - 11:5012:30

Francisco Borges Aurindo Barros and Jack Henschel, CERN


This talk describes an incident where an innocuous change in a configuration management system caused a highly-visible unavailability of thousands of websites, which was followed by an intense recovery procedure. The talk covers the part of the infrastructure that prevented more widespread damage, the lessons learned (in terms of infrastructure design and operational procedures) as well as improvements significant improvements that have been implemented since then. All of this happened on Kubernetes infrastructure, therefore the talk will dive into the topics of Kubernetes operators, automation, manual intervention, configuration management and backups.

Francisco Borges Aurindo Barros, CERN

Francisco Barros is an SRE at CERN. He likes to specialize on automating the repetitive, working with open source technologies, and helping to develop and maintain reliable and modern solutions. Currently he manages a Kubernetes flavored cluster that handles all the CMS websites at CERN. He lives near Geneva and in winter likes to snowboard.

Jack Henschel, CERN

Jack Henschel is a Cloud Computing Engineer at CERN where he develops and administrates several Kubernetes cluster, ensuring all components integrate smoothly with the rest of CERN's computing environment. His special areas of interest are systems performance, observability and efficiency. In his free time he likes exploring the French and Swiss Alps by foot and bike.

