A Dashboard Is Worth a Thousand Words: Better Monitoring for Better Ops

Thursday, June 13, 2019 - 9:00 am10:00 am

Luca Magnoni, CERN

Abstract: 

Not everyone is doing SRE. Consider a large scale scientific organisation with decades of experience in distributed systems and IT service operations, it may have a solid well-established ops culture and still benefit from the adoption of some of the new concepts and practises that SRE defined in the recent years. This is the story on how the creation of a new monitoring system, gathering together metrics and logs for infrastructure and services, based on a well known technologies stack (e.g. Kafka, Grafana, InfluxDB, Elasticsearch) lead not only to better service operations but also to raise awareness toward SRE practises and culture among service managers. The talk will discuss the design decisions, the operational challenges in building and scaling the system up to tens of thousands of hosts and the strategy adopted to enhance the monitoring practises, introducing concepts as SLI/SLO and the benefits derived.

Luca Magnoni, CERN

Luca is a Senior Software Engineer with more than ten years of experience in designing and operating distributed systems. He is currently a computing engineer and solution architect at the CERN IT Department working on monitoring infrastructures for the data centre and IT services.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {233263,
author = {Luca Magnoni},
title = {A Dashboard Is Worth a Thousand Words: Better Monitoring for Better Ops},
year = {2019},
address = {Singapore},
publisher = {USENIX Association},
month = jun
}

Presentation Video