Fast, Reliable, Yet Catastrophically {Failing!?}! Safely Avoiding Incidents When Putting Machine Learning into Production

Ramin Keene

Ramin Keene, Fuzzbox

Safely releasing machine learning based services into production presents a host of challenges that even the most experienced SRE may not expect. We'll outline some severe outages seen in the wild, their causes, and detail how emergent cutting edge techniques from the DevOps and SRE world around "testing in prod", progressive delivery, and deterministic simulation are the PERFECT solution for increasing safety, resilience, and confidence for SREs operating and managing ML-based services at scale.

Ramin has spent the last 5 years working with data teams and large enterprises to put machine learning, a/b testing, and data science products into production. He’s made ALL the mistakes and then some, helping companies lose thousands, if not millions, of dollars along the way. He is currently based in Seattle and spends his time working on adversarial experimentation tools that target infrastructure and release artifacts to help teams inspect and learn about their software AFTER it has been baked and released.

Connect:

@rmn

BibTeX

@conference {232947,
author = {Ramin Keene},
title = {Fast, Reliable, Yet Catastrophically {Failing!?}! Safely Avoiding Incidents When Putting Machine Learning into Production},
year = {2019},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = may
}

Download

Fast, Reliable, Yet Catastrophically Failing!?! Safely Avoiding Incidents When Putting Machine Learning into Production