Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the SRE Team's Value

Monday, March 25, 2019 - 11:40 am12:10 pm

Aaron Wieczorek, United States Digital Service

Abstract: 

How do you monitor systems that don't want to be monitored or ones that you don't have internal access to? Why monitor these systems at all? The United States Digital Service finds the truth and tells the truth, and fights fires across government, even when those fires don't want to be found. We put together a system to black box monitor all 25,000 .GOV domains and then expanded to perform more robust monitoring of important citizen-facing, government-provided services so we can go where the work is and restore services. In the process, we're hoping to change the culture and prove the value of SRE teams across government. This is how we're doing it.

Aaron Wieczorek, United States Digital Service

Aaron is a Site Reliability Engineer at the United States Digital Service Headquarters team. He works on hard technical problems and hard bureaucratic problems, from infrastructure to CI/CD pipelines, to network engineering.

SREcon19 Americas Open Access Videos Sponsored by
Salesforce

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX
@conference {229537,
author = {Aaron Wieczorek},
title = {Lessons Learned in Black Box Monitoring 25,000 Endpoints and Proving the {SRE} Team{\textquoteright}s Value},
year = {2019},
address = {Brooklyn, NY},
publisher = {{USENIX} Association},
month = mar,
}

Presentation Video