Case Study: Lessons Learned from Our First Worldwide Outage

Thursday, 31 August 2017 - 10:50am11:45am

Yoav Cohen, Imperva Incapsula

Abstract: 

Last year, on March 10, Incapsula experienced the first worldwide outage in its history… While relatively short in duration, it affected thousands of websites that rely on our security and acceleration every day.

Rooted in a 3-year old dormant bug in our IncapRules code, this outage made us realize there were changes we needed to make in the way we write and qualify code. As VP of Engineering, the faulty code and our testing procedures are my responsibility, and it was up to me to lead the team to achieve an order of magnitude higher reliability.

One of the key things we were missing was a way to propagate customer configuration across our network in a way that is fast but without compromising on safety. The result was a new configuration sandbox system which achieved that.

In this talk I’ll present the process we took to analyze the true reliability of our system and the framework we use to reason about it, to prioritize tasks across teams and to design a more reliable service.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

Presentation Audio

BibTeX
@conference {205520,
author = {Yoav Cohen},
title = {Case Study: Lessons Learned from Our First Worldwide Outage},
year = {2017},
address = {Dublin},
publisher = {{USENIX} Association},
}