Operating within Normal Parameters: Monitoring Kubernetes

Monday, March 25, 2019 - 5:00 pm5:30 pm

Elana Hashman, Two Sigma


After Kubernetes takes over your data centers, how can you be sure that it's operating within normal parameters? What does "normal" even mean? By formalizing your expected quality of service, you can measure and compare against known targets with open source tools like Prometheus. In this talk, we'll use Kubernetes as a case study for introducing service level objectives (SLOs) to guide monitoring efforts. Come learn the how and why of metric selection for monitoring Kubernetes quality of service, what gaps exist in the open source Kubernetes monitoring ecosystem, how to use Prometheus and its exporters to establish predictability and "normal" baselines, and how to use this telemetry to debug service degradations in a Kubernetes cluster.

Elana Hashman, Two Sigma

Elana Hashman currently works as a Reliability Engineer at Two Sigma, wrangling Kubernetes clusters and automating operations. She is a currently a member of the Kubernetes Instrumentation SIG, where she focuses on benchmarking and metrics usability. In the wider FOSS community, she is a Debian Developer, maintaining the Clojure package ecosystem in Debian and Ubuntu, and a Python Packaging Authority committer, hacking on portable binary Python wheels for Linux.

SREcon19 Americas Open Access Videos Sponsored by

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

@conference {229517,
author = {Elana Hashman},
title = {Operating within Normal Parameters: Monitoring Kubernetes},
year = {2019},
address = {Brooklyn, NY},
publisher = {USENIX Association},
month = mar

Presentation Video