You are here
Going Off the Rails: Infrastructure Outage Planning
Matt Provost, Avere Systems
Over the 2014 Christmas holiday there was major engineering work scheduled on some of London's main train lines into King's Cross Station. This overran the outage window by a day and significantly disrupted the travel plans of tens of thousands of passengers. In January, Network Rail published a detailed report with their findings detailing the causes of the overrun.
What can SREs learn from physical infrastructure maintenance and outage procedures? Even in this new era of cloud computing, someone still has to build and maintain the infrastructure that keeps the cloud working. With redundant systems and even data centres, there are still going to be times when the underlying infrastructure needs an outage for preventative maintenance to replace older equipment. or planned work on electrical or HVAC systems.
What lessons can be learned from the Christmas King's Cross outage that we can apply to data centre infrastructure? Even with cloud providers this is still relevant: Verizon Cloud had a 40-hour scheduled outage in January. Although there were underlying technical problems during the railway maintenance which added up and caused the delays, the larger problems were around planning, staff rotation, and communication. Teams in the field were focused on solving technical problems as they came up, and not escalating these problems to managers who could see the bigger picture and communicate with other teams to form an overall strategy and make go/no-go decisions based on accurate information. This situation can be even worse in a data centre where time estimation is notoriously unreliable.
In this talk, I will break down and analyse the King's Cross report and relate each finding back to the data centre environment.
Matt Provost started as a systems administrator in 1998. Before moving to London in 2014 he was the Systems Manager at Weta Digital in Wellington, New Zealand, where he oversaw the commissioning of a new water cooled data centre which hosted the render infrastructure for Avatar, Rise of the Planet of the Apes, Tintin, and the Hobbit films. During the production of Avatar, this hosted 7 systems in the Top 500 Supercomputer list. Matt has presented at LISA conferences about storage performance management, monitoring, and complex systems failure analysis.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.