Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems

Wednesday, 2019, October 2 - 16:4517:30

Ramin Keene, fuzzbox.io

Abstract: 

Operators are increasingly being asked to release and manage services that behave in ways that are increasingly difficult to reason about compared to traditional application services. Data products, model based machine learning services, ensemble models, and large microservices architectures are founded on deliberate complexity in such a way that their availability is only correctly measured via an SLA/QOS around their behavior, but also threatened by the unknown unknowns emergent behavior from their interactions.

Incidents move from being about general service availability, to behavioral.

Safely operating these types of service in production presents a host of challenges that even the most experienced SRE may not expect. Severe incidents with stable infrastructure, invisible errors rates, IMPROVING response times, but the business failing catastrophically losing millions of dollars? Absolutely!

Ramin Keene, fuzzbox.io

Ramin has helped enterprises large and small to put machine learning, a/b testing, and data science products into production. He’s made ALL the mistakes and then some, helping companies lose thousands, if not millions, of dollars along the way. He is currently based in Los Angeles and spends his time working on adversarial experimentation tools that target infrastructure and release artifacts to help teams inspect and learn about how their systems behave AFTER it has been baked and released.

BibTeX
@conference {239547,
author = {Ramin Keene},
title = {Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems},
year = {2019},
address = {Dublin},
publisher = {{USENIX} Association},
month = oct,
}