Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems

Ramin Keene

Wednesday, 2 October, 2019 - 16:45–17:30

Ramin Keene, fuzzbox.io

Operators are increasingly being asked to release and manage services that behave in ways that are increasingly difficult to reason about compared to traditional application services. Data products, model based machine learning services, ensemble models, and large microservices architectures are founded on deliberate complexity in such a way that their availability is only correctly measured via an SLA/QOS around their behavior, but also threatened by the unknown unknowns emergent behavior from their interactions.

Incidents move from being about general service availability, to behavioral.

Safely operating these types of service in production presents a host of challenges that even the most experienced SRE may not expect. Severe incidents with stable infrastructure, invisible errors rates, IMPROVING response times, but the business failing catastrophically losing millions of dollars? Absolutely!

Ramin has helped enterprises large and small to put machine learning, a/b testing, and data science products into production. He’s made ALL the mistakes and then some, helping companies lose thousands, if not millions, of dollars along the way. He is currently based in Los Angeles and spends his time working on adversarial experimentation tools that target infrastructure and release artifacts to help teams inspect and learn about how their systems behave AFTER it has been baked and released.

Connect:

@rmn

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@conference {239547,
author = {Ramin Keene},
title = {Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems},
year = {2019},
address = {Dublin},
publisher = {USENIX Association},
month = oct
}

Download

View the slides

Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems

Open Access Media

Presentation Video