Challenges, Best Practices, and Solutions for Monitoring and Alerting with Big Data

Thursday, December 08, 2022 - 3:20 pm3:50 pm AEDT

Daniel O'Dea, Atlassian

Abstract: 

Imagine you run an automated ice cream gigafactory. You notice your vanilla ice cream output has dropped. You check your monitoring system—it can't find the problem because the error is hidden somewhere in your thousands of ice cream machines (you make too much ice cream).

After days of manually searching the factory:

  1. You find a pond of vanilla ice cream next to a row of faulty machines, and
  2. You realise that 30% of your ice cream melts on its way to be packaged.

In cloud SaaS products (no, not Sugar as a Service), good system monitoring is similarly crucial. We'll discuss how we monitor Jira at Atlassian for millions of customers, and explore some challenges we've faced. We'll go over best practices and solutions to help you maintain the sweetest systems, with a particular focus on managing big, high-cardinality data.

Daniel O'Dea, Atlassian

Daniel O'Dea is a Site Reliability Engineer at Atlassian, where he is a key player in major incidents involving Jira. Daniel is the co-founder of Thorial, a startup building software tools for managers. Daniel is also a classical pianist, composer, and artist.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {284925,
author = {Daniel O{\textquoteright}Dea},
title = {Challenges, Best Practices, and Solutions for Monitoring and Alerting with Big Data},
year = {2022},
address = {Sydney},
publisher = {USENIX Association},
month = dec
}

Presentation Video