sponsors
help promote
general information
Registration Fee: $400
Register Now
Thanks to generous sponsorship, early bird pricing is now permanent for SREcon15!
Venue:
Hyatt Regency Santa Clara
5101 Great America Pkwy
Santa Clara, CA 95054
Questions?
About SREcon?
About the Call for Participation?
About the Hotel/Registration?
About Sponsorship?
usenix conference policies
Netflix RaaS: Reliability as a Service
Coburn Watson, Netflix, Inc.
The Netflix architecture is based on hundreds of microservices running in the cloud at massive scale across numerous AWS regions. Achieving excellent availability of such a complex system requires a capable operations methodology. At Netflix we have a shared services team which seeks to lower operational barriers for individual service teams in order to improve both aggregate and microservice-level reliability. The challenge lies in finding the right balance of responsibility between a shared service support team and the devops engineers on the microservice team itself. We have taken an approach in which tooling and associated methodologies developed by our Operations Engineering organization tackle the following subset of operational activities at a platform-level:
- Continuous integration and deployment
- automated staggered deployment of microservice code across cloud regions
- automated analysis of canary versus baseline code
- Tuning of curcuits in the system which respond to localized failures
- Improved observability for both macro and micro performance dimensions
- Identification and termination of server instances which are outliers
Through elimination of such undifferentiated heavy lifting, the teams can shift their focus onto product development versus being mired in operational complexity. The key benefit is the improvement of engineering velocity alongside reliability. As an organization. a direction needs to be taken on where to draw the line for operational responsibilities. This is no different in the Netflix "Freedom and Responsibility" culture.
This presentation will cover the operational complexities we have abstracted away from our microservice engineering teams, the associated decision factors, and future direction of the program.
Coburn leads the Cloud Performance and Reliability Engineering team at Netflix. His team works to optimize the use of massive cloud resources with a keen focus on system performance and reliability. Prior to Netflix. he was at Rearden Commerce, HP, and numerous other companies. working to improve the performance of large scale distributed systems.
Open Access Media
USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.
author = {Coburn Watson},
title = {Netflix {RaaS}: Reliability as a Service},
year = {2015},
address = {Santa Clara, CA},
publisher = {USENIX Association},
month = mar
}
connect with us