A Dashboard Is Worth a Thousand Words: Better Monitoring for Better Ops

Thursday, June 13, 2019 - 9:00 am10:00 am

Luca Magnoni, CERN


Not everyone is doing SRE. Consider a large scale scientific organisation with decades of experience in distributed systems and IT service operations, it may have a solid well-established ops culture and still benefit from the adoption of some of the new concepts and practises that SRE defined in the recent years. This is the story on how the creation of a new monitoring system, gathering together metrics and logs for infrastructure and services, based on a well known technologies stack (e.g. Kafka, Grafana, InfluxDB, Elasticsearch) lead not only to better service operations but also to raise awareness toward SRE practises and culture among service managers. The talk will discuss the design decisions, the operational challenges in building and scaling the system up to tens of thousands of hosts and the strategy adopted to enhance the monitoring practises, introducing concepts as SLI/SLO and the benefits derived.

Luca Magnoni, CERN

Luca is a Senior Software Engineer with more than ten years of experience in designing and operating distributed systems. He is currently a computing engineer and solution architect at the CERN IT Department working on monitoring infrastructures for the data centre and IT services.

