USENIX Conference Policies

Program

To access a presentation's content, please click on its title below.

Videos are now available for the SREcon14 Keynote Address and the Talks 1 track.

Download the SRECon14 Attendee Lists (Conference Attendees only)

Attendee Files

application/zip

srecon14_attendee_list.zip

All sessions will take place at the Hyatt Regency Santa Clara.
7:30 a.m.–5:00 p.m.
Registration and Badge Pickup Mezzanine
8:30 a.m.–9:00 a.m.
Continental Breakfast Mezzanine
9:00 a.m.–10:00 a.m.
Keynote Address Keys to SRE Ben Treynor Ben Treynor joined Google as Site Reliability Tsar in 2003. He is the founder of Google's Site Reliability team and grew it organically from an original core of 7 "production" engineers to its current ranks of >1200 software engineers. SRE is responsible for everything from Google's internal software infrastructure, to user services like Search, Gmail, Ads, and to the burgeoning Cloud Platform. Additionally, Ben is responsible for Google's worldwide internal and external network (since 2004), its data centers and hardware operations (since 2009), and is part of the Google Cloud Platform management team (since late 2013). Ben Treynor joined Google as Site Reliability Tsar in 2003. He is the founder of Google's Site Reliability team and grew it organically from an original core of 7 "production" engineers to its current ranks of >1200 software engineers. SRE is responsible for everything from Google's internal software infrastructure, to user services like Search, Gmail, Ads, and to the burgeoning Cloud Platform. Additionally, Ben is responsible for Google's worldwide internal and external network (since 2004), its data centers and hardware operations (since 2009), and is part of the Google Cloud Platform management team (since late 2013). Prior to Google, Ben held engineering management roles at Seven Networks, E.piphany, and Versant Object Technology, and started his career as a software engineer at Oracle in 1986. Ben holds a B.S. and M.S. in Computer Science from Stanford, and an M.B.A. from U.C. Berkeley's Haas School of Business. Available Media Read more about Keys to SRE
10:00 a.m.–10:30 a.m.
Break with Refreshments Mezzanine

10:30 a.m.–11:25 a.m.
Talks 1 Winchester and Stevens Creek Design Review Best Practices Mandi Walls, Chef Design reviews are the foundation for a successful product or feature launch. In this session we will broach a few of the critical questions an SRE asks during the design review process to ensure the design and deployment will result in a sustainable system. We will cover real world examples of the pitfalls of not engaging the operations/infrastructure team early in the process. Mandi Walls is Technical Practice Manager at Chef. For Chef, she travels the world helping organizations increase their effectiveness using configuration management and modernizing IT practices. Prior to joining Chef, she ran large web properties for AOL, including AOL.com, Games.com and Moviefone. Mandi has a Master's degree in Computer Science from George Washington University and an M.B.A. from UNC Kenan-Flagler. She is a regular speaker at technical conferences and is the author of Building a DevOps Culture, published by O'Reilly. Design reviews are the foundation for a successful product or feature launch. In this session we will broach a few of the critical questions an SRE asks during the design review process to ensure the design and deployment will result in a sustainable system. We will cover real world examples of the pitfalls of not engaging the operations/infrastructure team early in the process. Mandi Walls is Technical Practice Manager at Chef. For Chef, she travels the world helping organizations increase their effectiveness using configuration management and modernizing IT practices. Prior to joining Chef, she ran large web properties for AOL, including AOL.com, Games.com and Moviefone. Mandi has a Master's degree in Computer Science from George Washington University and an M.B.A. from UNC Kenan-Flagler. She is a regular speaker at technical conferences and is the author of Building a DevOps Culture, published by O'Reilly. Available Media Read more about Design Review Best Practices	Talks 2 San Tomas and Lawrence Proactive Monitoring @Twitter Joe Smith, Twitter Most systems provide statistics for tracking usage, performance, and failures. However, some do not fit into pre-existing monitoring infrastructure, or provide no metrics at all. This talk discusses a proactive approach to monitoring services at Twitter. Using case studies from multiple services at Twitter, we illustrate the value of a holistic approach to monitoring from design review up to tuning alerts in production. Joe is one of the founding members of Twitter’s Aurora/Mesos SRE team, and collaborated to build a highly available tiered fleet at Twitter. He has contributed to everything from the Puppet infrastructure that orchestrates Twitter’s Mesos cluster all the way up to automation of maintenance, deploys, and upgrades. Before that, he was part of Google's Internal Technology Residency Program in Mountain View, Boston, and Zurich. He holds a B.S. in Computer Science from Chapman University. Most systems provide statistics for tracking usage, performance, and failures. However, some do not fit into pre-existing monitoring infrastructure, or provide no metrics at all. This talk discusses a proactive approach to monitoring services at Twitter. Using case studies from multiple services at Twitter, we illustrate the value of a holistic approach to monitoring from design review up to tuning alerts in production. Joe is one of the founding members of Twitter’s Aurora/Mesos SRE team, and collaborated to build a highly available tiered fleet at Twitter. He has contributed to everything from the Puppet infrastructure that orchestrates Twitter’s Mesos cluster all the way up to automation of maintenance, deploys, and upgrades. Before that, he was part of Google's Internal Technology Residency Program in Mountain View, Boston, and Zurich. He holds a B.S. in Computer Science from Chapman University. Read more about Proactive Monitoring @Twitter	Panels Lafayette Releasing at Scale Panelists: Chuck Rossi, Facebook; Helena Tian, Google; Jos Boumans, Krux Digital; Daniel Schauenberg, Etsy A product’s ability to release quickly can be critical to its success of exceeding the expectations of its users. In the current environment there is more and more pressure to speed up a team’s release cycle and the SRE and Release Team are innovating to meet this demand. Hear from some experts of some of the challenges they’ve had to overcome and the success they’ve found in the process. Chuck Rossi leads Release Engineering at Facebook and started as Facebook's first release engineer in 2008. He manages both the web frontend releases (facebook.com: updated twice a day, every day) and the iOS and Android products (the #1 mobile application on both platforms). Chuck has worked as a software engineer for over 20 years and has held software engineering and release positions at IBM Almaden Research, Silicon Graphics, VMware, and Google. Chuck holds a Bachelor of Science in Computer Science from Rochester Institute of Technology. A product’s ability to release quickly can be critical to its success of exceeding the expectations of its users. In the current environment there is more and more pressure to speed up a team’s release cycle and the SRE and Release Team are innovating to meet this demand. Hear from some experts of some of the challenges they’ve had to overcome and the success they’ve found in the process. Chuck Rossi leads Release Engineering at Facebook and started as Facebook's first release engineer in 2008. He manages both the web frontend releases (facebook.com: updated twice a day, every day) and the iOS and Android products (the #1 mobile application on both platforms). Chuck has worked as a software engineer for over 20 years and has held software engineering and release positions at IBM Almaden Research, Silicon Graphics, VMware, and Google. Chuck holds a Bachelor of Science in Computer Science from Rochester Institute of Technology. Daniel Schauenberg is a Senior Software Engineer at Etsy's infrastructure and development tools team. Automation, documentation and simplicity are his usual tools for improving the status quo. He previously worked in systems and network administration, on connecting chemical plants to IT systems, and as an embedded systems networking engineer. Things he thoroughly enjoys when not writing code include coffee, breakfast, TV shows, and basketball. Read more about Releasing at Scale	Open Space Camino Real SREcon14 will have an open space for discussion. Submit your topics today!

11:25 a.m.–11:30 a.m.
Short Break

11:30 a.m.–12:30 p.m.
Talks 1 Winchester and Stevens Creek Grilled Cheese at Scale: A Recipe for Stability Practices Connie-Lynne Villani, Groupon Long-term site stability requires a fundamental shift in perspective: planning for the future vs reacting to catastrophe. Whether it's growing a food festival from 1,500 attendees to 11,000, or dramatically increasing the scale of your production operations, it can be hard to foster acceptance of a new approach while soothing fears about cultural change. In this talk, Connie-Lynne will discuss the techniques of consensus-building, team autonomy, and fast failure she learned while running the Grilled Cheese Invitational, and how she transferred those to her work in Site Reliability Engineering. Long-term site stability requires a fundamental shift in perspective: planning for the future vs reacting to catastrophe. Whether it's growing a food festival from 1,500 attendees to 11,000, or dramatically increasing the scale of your production operations, it can be hard to foster acceptance of a new approach while soothing fears about cultural change. In this talk, Connie-Lynne will discuss the techniques of consensus-building, team autonomy, and fast failure she learned while running the Grilled Cheese Invitational, and how she transferred those to her work in Site Reliability Engineering. Connie-Lynne joined Groupon in 2012, and founded Groupon's Site Reliability Engineering team. Building on her successes with SRE, in January of 2014 she helped start the Engineering Standards and Culture team as the Sr. Manager for Global Engineering Education. With degrees in both Electrical Engineering and Theater Management, Connie-Lynne brings not only 20 years of System Engineering experience to the table, but also a keen understanding of how to connect technical and creative people to each other. Connie-Lynne has worked at Linden Lab, Yahoo, and Caltech, but admits that her most fun position is as a board member for the Grilled Cheese Invitational. Available Media Read more about Grilled Cheese at Scale: A Recipe for Stability Practices	Talks 2 San Tomas and Lawrence Cascading Failures Mike Ulrich, Google Along with success comes the risk that early design decisions can be traps just waiting to cause you the worse case outage, often referred to as a cascading failure. Mike will talk about his experience with identifying the root causes and strategies developed over time to avoid common problems. His experience in Gmail has been shared across Google’s product and SRE teams and now the SRE community. Mike Ulrich is a Staff Site Reliability Engineer at Google where he has worked since 2006. He began his career as a System Engineer but transitioned into Software Engineer while working on the Gmail SRE team. He earned a bachelor’s degree in Chemical Engineering from the Univeristy of Minnesota. He is currently the tech lead of the Gmail SRE team at Google but consults across the organization in the areas of design reviews, failover, and load balancing strategies. Along with success comes the risk that early design decisions can be traps just waiting to cause you the worse case outage, often referred to as a cascading failure. Mike will talk about his experience with identifying the root causes and strategies developed over time to avoid common problems. His experience in Gmail has been shared across Google’s product and SRE teams and now the SRE community. Mike Ulrich is a Staff Site Reliability Engineer at Google where he has worked since 2006. He began his career as a System Engineer but transitioned into Software Engineer while working on the Gmail SRE team. He earned a bachelor’s degree in Chemical Engineering from the Univeristy of Minnesota. He is currently the tech lead of the Gmail SRE team at Google but consults across the organization in the areas of design reviews, failover, and load balancing strategies. Read more about Cascading Failures	Panels Lafayette Disaster Preparedness Moderator: Bethanye Blount, Facebook Panelists: Kripa Krishnan, Google; Richard Waid, LinkedIn; Mat Schaffer, Netflix While a good SRE team is proactive during the design phase many things can affect the reliability of the service. A critical aspect of planning is practicing what happens in the worst-case scenario and if the product and team can respond and recover quickly. Many teams do continuous testing, spot-checking, or large scale annual tests. This panel will discuss the how to do this safely, justify to the company, and the followup that is required to really leverage the advantages of these tests. Richard Waid joined LinkedIn in 2012, moving to Mountain View from a diverse background in Software Engineering and Operations in New Zealand. As a Staff SRE at LinkedIn, Richard faces the daily paradox of SRE everywhere: figuring out what went wrong while making sure that it never happens again. His responsibilities at LinkedIn include leading the SRE team responsible for the LinkedIn profile and engagement. While a good SRE team is proactive during the design phase many things can affect the reliability of the service. A critical aspect of planning is practicing what happens in the worst-case scenario and if the product and team can respond and recover quickly. Many teams do continuous testing, spot-checking, or large scale annual tests. This panel will discuss the how to do this safely, justify to the company, and the followup that is required to really leverage the advantages of these tests. Richard Waid joined LinkedIn in 2012, moving to Mountain View from a diverse background in Software Engineering and Operations in New Zealand. As a Staff SRE at LinkedIn, Richard faces the daily paradox of SRE everywhere: figuring out what went wrong while making sure that it never happens again. His responsibilities at LinkedIn include leading the SRE team responsible for the LinkedIn profile and engagement. Read more about Disaster Preparedness	Open Space Camino Real SREcon14 will have an open space for discussion. Submit your topics today!

12:30 p.m.–1:30 p.m.

Luncheon, sponsored by Google

Terra Courtyard

1:30 p.m.–2:30 p.m.
Talks 1 Winchester and Stevens Creek Monitoring at “Systems” Mark Smith, Dropbox Monitoring is more than just looking at averages and up/down states, this session will focus on how and what to monitor as well as compare and contrast whitebox and blackbox monitoring. This talk will also touch on best practices Mark has developed over time for setting up monitoring when launching a new service. Monitoring is more than just looking at averages and up/down states, this session will focus on how and what to monitor as well as compare and contrast whitebox and blackbox monitoring. This talk will also touch on best practices Mark has developed over time for setting up monitoring when launching a new service. Available Media Read more about Monitoring at “Systems”	Talks 2 San Tomas and Lawrence Mobile Reliability David Barshow, Mailbox Mobile imposes a different set of challenges: platforms, different hardware and OS types, different bandwidth requirements, and more. Mailbox launched as an iOS only client to Gmail in 2013. During the course of launching we learned how different maintaining and launching mobile backends is than your traditional "refresh" browser based model. We will also cover how different this experience is than maintaining a traditional website. David Barshow is the operations tech lead for Mailbox. He works on building and maintaining the infrastructure to support the mobile Mailbox products. Mobile imposes a different set of challenges: platforms, different hardware and OS types, different bandwidth requirements, and more. Mailbox launched as an iOS only client to Gmail in 2013. During the course of launching we learned how different maintaining and launching mobile backends is than your traditional "refresh" browser based model. We will also cover how different this experience is than maintaining a traditional website. David Barshow is the operations tech lead for Mailbox. He works on building and maintaining the infrastructure to support the mobile Mailbox products. Read more about Mobile Reliability	Panels Lafayette Load Shedding Moderator: Akhil Gupta, Dropbox Panelists: Manjot Pahwa, Google; Jos Boumans, Krux Digital; Nick Berry, LinkedIn; Bruce Wong, Netflix When your service starts to get overloaded a strategy for graceful degradation is to start to offload some of the traffic/load such that you don’t experience a complete meltdown. This panel will discuss the various strategies for how to handle this problem. Akhil Gupta is responsible for the infrastructure at Dropbox. Prior to joining Dropbox in 2012, he was a principal engineer at Google responsible for the search ads serving infrastructure. He earned a Bachelor's Degree in Computer Science from the Indian Institute of Technology, Kanpur. He received his Master's Degree in Computer Science from University of Maryland, College Park. When your service starts to get overloaded a strategy for graceful degradation is to start to offload some of the traffic/load such that you don’t experience a complete meltdown. This panel will discuss the various strategies for how to handle this problem. Akhil Gupta is responsible for the infrastructure at Dropbox. Prior to joining Dropbox in 2012, he was a principal engineer at Google responsible for the search ads serving infrastructure. He earned a Bachelor's Degree in Computer Science from the Indian Institute of Technology, Kanpur. He received his Master's Degree in Computer Science from University of Maryland, College Park. Nick Berry joined SRE at LinkedIn in 2011 and is responsible for Infrastructure: internal/external traffic routing and load balancing, presentation, security, and monitoring/alerting/automation. Prior to LinkedIn, Nick has a colorful background, from Yahoo’s global home pages to data centers (Wavve, RagingWire) to dialup ISPs (JPSNet, OneMain, Internet49). Nick dropped out of American River College while pursuing a degree in Mathematics and is a Cho Dan in Tang Soo Do Martial Way Association. Read more about Load Shedding	Open Space Camino Real SREcon14 will have an open space for discussion. Submit your topics today!

2:30 p.m.–2:35 p.m.
Short Break

2:35 p.m.–3:30 p.m.			Friday
Talks 1 Winchester and Stevens Creek Finding and Resolving Problems at Scale Ben Maurer, Facebook Large scale systems fail in surprising ways. In order to stay on top of the game, the detection and data visualization systems have to evolve together with the underlying systems and you need to learn lessons to build long term solutions that can avoid the same failure from happening over and over again in different parts of your infrastructure. This talk is about breakage, debugging and building long term scalable solutions from those experiences. Large scale systems fail in surprising ways. In order to stay on top of the game, the detection and data visualization systems have to evolve together with the underlying systems and you need to learn lessons to build long term solutions that can avoid the same failure from happening over and over again in different parts of your infrastructure. This talk is about breakage, debugging and building long term scalable solutions from those experiences. Available Media Read more about Finding and Resolving Problems at Scale	Talks 2 San Tomas and Lawrence SRE in the Cloud Rich Adams, Gracenote The future of the cloud changes the role of the SRE. In a large company where you are deploying your service on infrastructure built/managed in-house the SRE has the home field advantage of understanding the intricacies of that infrastructure. With more and more startups launching in the cloud which are maintained by the vendor, the local SREs role offers different challenges. Rich has launched several businesses on AWS and he will talk about his journey towards incorporating reliability into products and ensuring the development team had access to the information needed to improve their services. He will share the highlights of what he’s learned about Amazon’s Web Services and what it took for him to make it work for his companies. The future of the cloud changes the role of the SRE. In a large company where you are deploying your service on infrastructure built/managed in-house the SRE has the home field advantage of understanding the intricacies of that infrastructure. With more and more startups launching in the cloud which are maintained by the vendor, the local SREs role offers different challenges. Rich has launched several businesses on AWS and he will talk about his journey towards incorporating reliability into products and ensuring the development team had access to the information needed to improve their services. He will share the highlights of what he’s learned about Amazon’s Web Services and what it took for him to make it work for his companies. Read more about SRE in the Cloud	Panels Lafayette SRE Culture Fundamentals Moderator: Tom Cook, Dropbox Panelists: Grier Johnson, Square; Kevin Park, Dropbox; Avleen Vig, Etsy; Pedro Canahuati, Facebook SRE, Production Engineer, DevOps—Regardless of what you call the role at your company, its the engineering that truly defines the work of the engineer who is focused on the reliability of their product, but without a culture that defines the role appropriately in the company, you could just be another Operations team. This panel will discuss how to make sure SRE is part of engineering and the team is focused on the right work. SRE, Production Engineer, DevOps—Regardless of what you call the role at your company, its the engineering that truly defines the work of the engineer who is focused on the reliability of their product, but without a culture that defines the role appropriately in the company, you could just be another Operations team. This panel will discuss how to make sure SRE is part of engineering and the team is focused on the right work. Read more about SRE Culture Fundamentals	Open Space Camino Real SREcon14 will have an open space for discussion. Submit your topics today!

3:30 p.m.–4:00 p.m.

Break with Refreshments, Sponsored by Twitter

Mezzanine

4:00 p.m.–5:00 p.m.
Closing Talk Santa Clara Ballroom How Silicon Valley’s SREs Saved Healthcare.gov Michael “Mikey” Dickerson, U.S. Citizen Healthcare.gov was at the brink of failure, and a team of SREs from Silicon Valley joined the effort and helped the website undo some decisions that cause it to buckle under the scale of the American People. This talk will discuss how the team was recruited to volunteer their skills in order to ensure Healthcare.gov accomplished its goal of allowing millions of Americans to sign up for healthcare. Mikey will discuss how he and the team approached the stability problems and the four major engineering challenges they had to overcome. Healthcare.gov was at the brink of failure, and a team of SREs from Silicon Valley joined the effort and helped the website undo some decisions that cause it to buckle under the scale of the American People. This talk will discuss how the team was recruited to volunteer their skills in order to ensure Healthcare.gov accomplished its goal of allowing millions of Americans to sign up for healthcare. Mikey will discuss how he and the team approached the stability problems and the four major engineering challenges they had to overcome. Read more about How Silicon Valley’s SREs Saved Healthcare.gov
5:00 p.m.–6:30 p.m.
Happy Hour, Sponsored by Facebook Mezzanine

Gold Sponsors

Silver Sponsors