Operating within Normal Parameters: Monitoring Kubernetes

Monday, March 25, 2019 - 5:00 pm5:30 pm

Elana Hashman, Two Sigma


After Kubernetes takes over your data centers, how can you be sure that it's operating within normal parameters? What does "normal" even mean? By formalizing your expected quality of service, you can measure and compare against known targets with open source tools like Prometheus. In this talk, we'll use Kubernetes as a case study for introducing service level objectives (SLOs) to guide monitoring efforts. Come learn the how and why of metric selection for monitoring Kubernetes quality of service, what gaps exist in the open source Kubernetes monitoring ecosystem, how to use Prometheus and its exporters to establish predictability and "normal" baselines, and how to use this telemetry to debug service degradations in a Kubernetes cluster.

Elana Hashman currently works as a Reliability Engineer at Two Sigma, wrangling Kubernetes clusters and automating operations. She is a currently a member of the Kubernetes Instrumentation SIG, where she focuses on benchmarking and metrics usability. In the wider FOSS community, she is a Debian Developer, maintaining the Clojure package ecosystem in Debian and Ubuntu, and a Python Packaging Authority committer, hacking on portable binary Python wheels for Linux.

