Catch Fire and Halt: Fire in the Datacenter

Thursday, December 8, 2016 - 4:45pm5:30pm

Jon Kuroda, University of California, Berkeley

Abstract: 

What do you do when you have a fire in the datacenter that takes your entire organization down until you can recover? Well, we found out the hard way when, on Friday, September 18, 2015, one of our research group’s servers caught fire at the UC Berkeley campus datacenter, thus activating the facility fire suppression and emergency power-off systems and causing the outage of nearly all campus-hosted online services with recovery efforts lasting through the weekend. We will detail the circumstances surrounding the incident itself, examine the post-mortem process that followed the incident, and compare our experiences with those of other engineering disciplines after the occurrence of a critical incident.

Jon Kuroda, University of California, Berkeley

Jon is a sysadmin and research engineer at the Department of Electrical Engineering at the University of California, Berkeley where he spends his days (and nights) puzzling over HDFS/Spark clusters, debugging business process, and trying to keep datacenter spaces clean(er) and more usable all while trying to keep up with dozens of computer science researchers.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {201535,
author = {Jon Kuroda},
title = {Catch Fire and Halt: Fire in the Datacenter},
year = {2016},
address = {Boston, MA},
publisher = {{USENIX} Association},
month = dec,
}

Presentation Video

Presentation Audio