Case Study: Lessons Learned from Our First Worldwide Outage

Yoav Cohen

Thursday, 31 August, 2017 - 10:50–11:45

Yoav Cohen, Imperva Incapsula

Last year, on March 10, Incapsula experienced the first worldwide outage in its history… While relatively short in duration, it affected thousands of websites that rely on our security and acceleration every day.

Rooted in a 3-year old dormant bug in our IncapRules code, this outage made us realize there were changes we needed to make in the way we write and qualify code. As VP of Engineering, the faulty code and our testing procedures are my responsibility, and it was up to me to lead the team to achieve an order of magnitude higher reliability.

One of the key things we were missing was a way to propagate customer configuration across our network in a way that is fast but without compromising on safety. The result was a new configuration sandbox system which achieved that.

In this talk I’ll present the process we took to analyze the true reliability of our system and the framework we use to reason about it, to prioritize tasks across teams and to design a more reliable service.

Yoav is VP of Engineering for Imperva Incapsula, and has been with the company since they made their first sale. In between meetings you will find him working on build systems or nasty performance bugs. When not doing so he tries to sneak a few minutes on his guitar or doing laps in the pool. Yoav holds a M.Sc in Computer Science from Tel-Aviv University where he studied multi-core programming.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {205520,
author = {Yoav Cohen},
title = {Case Study: Lessons Learned from Our First Worldwide Outage},
year = {2017},
address = {Dublin},
publisher = {USENIX Association},
month = aug
}

Case Study: Lessons Learned from Our First Worldwide Outage

Open Access Media

Presentation Video

Presentation Audio