How {LinkedIn} Performs Maintenances at Scale

Akash Vacher

Wednesday, 13 October, 2021 - 02:30–03:00

Akash Vacher, LinkedIn

LinkedIn runs on a fleet of hundreds of thousands of dedicated servers and network devices distributed across the globe which are used to serve the website. Any downtime for these devices may result in disruption to the applications running on top of this infrastructure. Hence, it's crucial to ensure that all maintenances on these devices, such as firmware upgrades or hardware replacements, are performed without impacting the overall availability of services running on top of these devices.

This talk describes the various SRE principles that guided the inception and development of a platform to schedule and execute infrastructure maintenances and share the learnings we had along the way.

Akash Vacher is a Site Reliability Engineer at LinkedIn. He worked on large-scale streaming data infrastructure services such as Kafka, Samza, and Brooklin before transitioning over to help facilitate infrastructure maintenance at scale at LinkedIn.

Connect:

@AkashVacher

BibTeX

@conference {276717,
author = {Akash Vacher},
title = {How {LinkedIn} Performs Maintenances at Scale},
year = {2021},
publisher = {USENIX Association},
month = oct
}

Download

View the slides

How LinkedIn Performs Maintenances at Scale

Presentation Video