Skip to main content
Back to USENIX
  • Conferences
  • Students
Sign in

USENIX Conference Policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

Reliability at Massive Scale: Lessons Learned at Facebook

As the Facebook Web site and platform grow to an ever larger scale, one of the most difficult challenges is running reliably while constantly changing our product. Over the years we have developed a number of principles around avoiding large failures while making frequent, small changes to our system. These principles have allowed us to run with a low rate of serious incidents, but they still do occur. I'll be walking through the details of a recent site outage to illustrate the way these principles work and how things can go wrong when they aren't followed.

Robert Johnson, Director of Engineering, Facebook, Inc.

Sanjeev Kumar, Engineering Manager, Facebook, Inc.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {267059,
author = {Robert Johnson and Sanjeev Kumar},
title = {Reliability at Massive Scale: Lessons Learned at Facebook},
year = {2010},
address = {San Jose, CA},
publisher = {USENIX Association},
month = nov
}
Download

Presentation Video

Presentation Audio

MP3 Download OGG Download

Download Audio

Links

Paper: 
Paper (HTML): 
Slides: 

© USENIX
EIN 13-3055038

LISA is a registered trademark of the USENIX Association.

  • Privacy Policy
  • Contact Us