Mastering Near-Real-Time Telemetry and Big Data: Invaluable Superpowers for Ordinary SREs

Website Maintenance Alert

Due to scheduled maintenance, the USENIX website will not be available on Tuesday, December 17, from 10:00 am to 2:00 pm Pacific Daylight Time (UTC -7). We apologize for the inconvenience.

If you are trying to register for Enigma 2020, please complete your registration before or after this time period.

Wednesday, October 31, 2018 - 2:00 pm2:30 pm

Ivan Ivanov, Netflix

Abstract: 

One of fundamental requirements for typical SRE team is the capability to have solid operational insights into the holistic state of the supported system. This usually involves collecting, aggregating, correlating, visualizing and reacting on data generated by diverse set of data sources.

As part of this talk I will go through the high level approach, sample data sources, and some implementation details, used by the Netflix Open Connect CDN Reliability engineering team while supporting the infrastructure and services hosted on thousands of physical servers and providing streaming video delivery for hundreds of millions of clients.

I will cover some samples of the usage of Hive, Presto, Spark, Elasticsearch, Tableau, and Netflix developed tools for monitoring, alerting, debugging, long term analysis, planning, etc. I will also speak about the benefits of correlating detailed data from server and client reported telemetry.

For clarity—this is not a talk about how to build, implement, develop or support Big Data or near real time telemetry systems. This is all about how you can use them as a platform and a powerful toolset for making your operations team stronger.

Ivan Ivanov, Netflix

Ivan is a Senior CDN Reliability Engineer on the Netflix Open Connect team. He has been designing, deploying, supporting, and optimizing online services on a global scale in various operations roles for the last 17+ years. He is focusing on service reliability, availability, scalability, quality of experience (QoE), service optimizations, investigating and troubleshooting server/client/network/application related issues. Prior to joining Netflix Ivan was Principal Service Engineer at Microsoft working on Windows Update, Microsoft.com and Windows Store.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {221700,
author = {Ivan Ivanov},
title = {Mastering Near-Real-Time Telemetry and Big Data: Invaluable Superpowers for Ordinary SREs},
year = {2018},
address = {Nashville, TN},
publisher = {{USENIX} Association},
month = oct,
}

Presentation Video 

Presentation Audio