How a Single API Endpoint Saved Us 3000 CPU

Website Maintenance Alert

Due to scheduled maintenance, the USENIX website may not be available on Monday, March 17, from 10:00 am–6:00 pm Pacific Daylight Time (UTC -7). We apologize for the inconvenience and thank you for your patience.

If you would like to register for NSDI '25, SREcon25 Americas, or PEPR '25, please complete your registration before or after this time period.

Wednesday, 30 October, 2024 - 15:1015:30 GMT

Lasse Hels, Maersk

Abstract: 

How do you run a time series database exclusively on spot nodes? With great difficulty!

Grafana Mimir is the centrepiece of our observability platform at Maersk. For a long time, rollouts of Mimir's most crucial component would consistently trigger significant performance degradations in the platform. Getting to the root cause of the issue proved laborious and took us deep into the internals of Mimir.

Join us as we go through the issue postmortem and reflect on how to create consistency in a chaotic environment. The talk touches on topics such as CPU throttling, hash rings, compute utilisation analysis and metric series cardinality.

Lasse Hels, Maersk

Lasse is a software engineer at Maersk. As a member of the telemetry team, he took part in building the Maersk Observability Platform, and now spends much of his time keeping it running. Outside of work, his interests include speedrunning, powerlifting, etymology, and camels.

BibTeX
@conference {302189,
author = {Lasse Hels},
title = {How a Single {API} Endpoint Saved Us 3000 {CPU}},
year = {2024},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Presentation Video