Commas Save Lives, or at Least LinkedIn

Wednesday, 26 October, 2022 - 11:4512:30 CEST

Todd Palino, LinkedIn

Abstract: 

What happens when the only good thing you can say about a site outage is that it was detected quickly? In February of 2021, LinkedIn was taken down by the smallest of things - a comma (or the lack thereof). Through a number of contributing factors, including challenges with the incident response process itself, this knocked out the public site and significant amounts of internal tooling, spiraling into a much longer time to mitigation.

Not all is bleak, however. All problems can be fixed when there is a solid foundation to build on. For LinkedIn, this includes bricks laid down by our former head of SRE where he clearly states "we are here to attack the problem, not the person." And it includes the culture and values that not only focus us on getting things done, but on having fun while we do it.

Todd Palino, LinkedIn

Todd Palino is a Principal Staff Engineer in Site Reliability at LinkedIn, focused on Efficiency Engineering, Resilience, and Incident Response. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. He is also the co-author of Kafka: The Definitive Guide (O’Reilly Media). Out of the office, you can find him sharing his experience from years in SRE technical leadership with conference audiences. Or out on the trails, training for the next marathon.

BibTeX
@conference {284623,
author = {Todd Palino},
title = {Commas Save Lives, or at Least {LinkedIn}},
year = {2022},
address = {Amsterdam},
publisher = {USENIX Association},
month = oct
}

Presentation Video