Deep Dive: Azure Resource Manager Outage

Wednesday, 26 October, 2022 - 11:0011:40 CEST

Benjamin Pannell and Brendan Burns, Microsoft

Abstract: 

Microsoft Azure Resource Manager is the globally distributed system through which customers purchase, configure, and maintain their Azure workloads. When services were recently interrupted, a temporary outage impacted the ability of some customers to manage their workloads. This situation was particularly interesting because the system architecture both helped reduce customer impact and simultaneously made it challenging for engineers to understand and mitigate the issue.

Through this talk, we’ll share our post-incident findings to give audience members insight into how this specific system operates and how emergent behavior in a socio-technical system can evade a system’s defenses. We’ll also demonstrate an incident report structure which helps prevent common problems like after-the-fact reasoning and instead helps readers effectively learn from the investigation.

Benjamin Pannell, Microsoft

Ben is the technical lead of Microsoft Azure's Control Plane SRE team. Based out of Dublin, Ireland; he has helped guide improvements to the operability, resiliency, and performance of mission-critical control plane services including Azure Resource Manager. Prior to his role at Microsoft, he worked as an SRE on a global gaming platform where he helped automate away the role of a NOC, and as a software engineer building an agricultural GIS platform in South Africa.

Brendan Burns, Microsoft

Brendan Burns is a co-founder of the Kubernetes open source project and a corporate vice president at Microsoft where his teams are responsible for Microsoft Azure APIs, governance and management, as well as the Azure Kubernetes Service and cloud-native open source. He built and has run high-scale, mission-critical distributed systems for more than a decade. Prior to working on distributed systems, he was a professor of computer science at Union College in Schenectady, New York. He received a Ph.D. in computer science with a specialty in robotics from the University of Massachusetts Amherst and an undergraduate degree in computer science and studio art from Williams College.

BibTeX
@conference {284621,
author = {Benjamin Pannell and Brendan Burns},
title = {Deep Dive: Azure Resource Manager Outage},
year = {2022},
address = {Amsterdam},
publisher = {USENIX Association},
month = oct
}

Presentation Video