How LinkedIn Performs Maintenances at Scale

Note: Presentation times are in Coordinated Universal Time (UTC).

Wednesday, 13 October, 2021 - 02:3003:00

Akash Vacher, LinkedIn

Abstract: 

LinkedIn runs on a fleet of hundreds of thousands of dedicated servers and network devices distributed across the globe which are used to serve the website. Any downtime for these devices may result in disruption to the applications running on top of this infrastructure. Hence, it's crucial to ensure that all maintenances on these devices, such as firmware upgrades or hardware replacements, are performed without impacting the overall availability of services running on top of these devices.

This talk describes the various SRE principles that guided the inception and development of a platform to schedule and execute infrastructure maintenances and share the learnings we had along the way.

Akash Vacher, LinkedIn

Akash Vacher is a Site Reliability Engineer at LinkedIn. He worked on large-scale streaming data infrastructure services such as Kafka, Samza, and Brooklin before transitioning over to help facilitate infrastructure maintenance at scale at LinkedIn.

SREcon21 Open Access Sponsored by Indeed

BibTeX
@conference {276717,
author = {Akash Vacher},
title = {How {LinkedIn} Performs Maintenances at Scale},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Presentation Video