Deployment Automation: Releasing Quickly and Reliably

Monday, March 13, 2017 - 11:25am11:50am

Sebastian Yates, Uber SRE


We’ve always encouraged engineers to push code quickly to production. Developers at Uber have had full control over how they deploy (which instances, datacenters, etc. they deploy to) and a full production upgrade could complete in minutes. This helped our business achieve incredible growth but has impacted reliability. As a business we need to remain fast but take smarter risks and reduce the potential impact of changes.

We set out building automated deployment workflows that all services could use to make deploying code safer. As we on-boarded services, the biggest battle wasn’t the technical challenges but convincing teams that for our most critical services we needed to trade some deployment velocity for availability. Uber emphasizes moving fast, so slowing down deployments was controversial.

Collecting TTD and TTM statistics for our most significant outages allowed us to make principled decisions about ideal deploy lengths and enabled principled discussions. We added features to the workflows to improve deployment experience; automated canary, deployment batch metrics, automated rollbacks, continuous deployment and load testing.

Changing the way we deploy has improved reliability at Uber. It’s required a culture change, and we hope the lessons we learned about ourselves and our systems will benefit your work.

An SRE at Uber for 2 years, responsible for the end to end availability of Uber's marketplace systems. Loves all things distributed and should never be allowed to name systems of any kind.

