Laura Nolan, Google
Processes crash or may need to be restarted. Hard drives fail. Natural disasters can take out several datacenters in a region. Site Reliability Engineers need to anticipate these sorts of failures and develop strategies to keep systems running in spite of them.
This usually means running systems across multiple sites, and this means that you need to make tradeoffs between availability and consistency of your system state.
This talk explores distributed consensus algorithms, such as RAFT and Paxos in production: how they work, how they perform, what can go wrong, when to use them and not to use them.
Laura Nolan, Google
Laura Nolan has been a Site Reliability Engineer at Google for four years, working on large data infrastructure projects and most recently, networking. Her background is in software engineering and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly SRE book, and is co-chair of SRECon EMEA 2017.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Laura Nolan},
title = {Diistributed Consensus Algorithms},
year = {2017},
publisher = {USENIX Association},
month = may
}