Skip to main content
USENIX
  • Conferences
  • Students
Sign in

sponsors

Gold Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Silver Sponsor

general information

Date:
Friday, May 30, 2014
9:00 a.m.–5:00 p.m.

Thank you for your interest in SREcon14. Due to reaching maximum capacity, registration is now closed. Registered attendees can pick up their badges beginning at 7:30 a.m. on May 30 in the Mezzanine at the Hyatt Regency Santa Clara.

Program:
The SREcon14 Program is
now online!

Venue:
Hyatt Regency Santa Clara
5101 Great America Pkwy
Santa Clara, CA 95054
(408) 200-1234
Book your room!

Questions?
About SREcon?
About the Hotel/Registration?
About Sponsorship?

help promote

SREcon14 button

Size: 125x125 pixels
Copy Embed Code:

twitter

Tweets by @SREcon

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » Program
Tweet

connect with us

http://twitter.com/usenix
https://www.facebook.com/usenixassociation
http://www.linkedin.com/groups/USENIX-Association-49559/about
https://plus.google.com/108588319090208187909/posts
http://www.youtube.com/user/USENIXAssociation

Program

To access a presentation's content, please click on its title below.

Videos are now available for the SREcon14 Keynote Address and the Talks 1 track.

 

Download the SRECon14 Attendee Lists (Conference Attendees only)

Attendee Files 

(Registered attendees: Sign in to your USENIX account to download this file.)

srecon14_attendee_list.zip

 

All sessions will take place at the Hyatt Regency Santa Clara.

7:30 a.m.–5:00 p.m.  

Registration and Badge Pickup

Mezzanine

8:30 a.m.–9:00 a.m.  

Continental Breakfast

Mezzanine

9:00 a.m.–10:00 a.m.  

Keynote Address

Keys to SRE

Ben Treynor

Ben TreynorBen Treynor joined Google as Site Reliability Tsar in 2003. He is the founder of Google's Site Reliability team and grew it organically from an original core of 7 "production" engineers to its current ranks of >1200 software engineers. SRE is responsible for everything from Google's internal software infrastructure, to user services like Search, Gmail, Ads, and to the burgeoning Cloud Platform. Additionally, Ben is responsible for Google's worldwide internal and external network (since 2004), its data centers and hardware operations (since 2009), and is part of the Google Cloud Platform management team (since late 2013).

Ben TreynorBen Treynor joined Google as Site Reliability Tsar in 2003. He is the founder of Google's Site Reliability team and grew it organically from an original core of 7 "production" engineers to its current ranks of >1200 software engineers. SRE is responsible for everything from Google's internal software infrastructure, to user services like Search, Gmail, Ads, and to the burgeoning Cloud Platform. Additionally, Ben is responsible for Google's worldwide internal and external network (since 2004), its data centers and hardware operations (since 2009), and is part of the Google Cloud Platform management team (since late 2013).

Prior to Google, Ben held engineering management roles at Seven Networks, E.piphany, and Versant Object Technology, and started his career as a software engineer at Oracle in 1986. Ben holds a B.S. and M.S. in Computer Science from Stanford, and an M.B.A. from U.C. Berkeley's Haas School of Business.

Available Media

  • Read more about Keys to SRE
10:00 a.m.–10:30 a.m.  

Break with Refreshments

Mezzanine

10:30 a.m.–11:25 a.m.  

Talks 1

Winchester and Stevens Creek

Design Review Best Practices

Mandi Walls, Chef

Design reviews are the foundation for a successful product or feature launch. In this session we will broach a few of the critical questions an SRE asks during the design review process to ensure the design and deployment will result in a sustainable system. We will cover real world examples of the pitfalls of not engaging the operations/infrastructure team early in the process.

Mandi Walls is Technical Practice Manager at Chef. For Chef, she travels the world helping organizations increase their effectiveness using configuration management and modernizing IT practices. Prior to joining Chef, she ran large web properties for AOL, including AOL.com, Games.com and Moviefone. Mandi has a Master's degree in Computer Science from George Washington University and an M.B.A. from UNC Kenan-Flagler. She is a regular speaker at technical conferences and is the author of Building a DevOps Culture, published by O'Reilly.

Design reviews are the foundation for a successful product or feature launch. In this session we will broach a few of the critical questions an SRE asks during the design review process to ensure the design and deployment will result in a sustainable system. We will cover real world examples of the pitfalls of not engaging the operations/infrastructure team early in the process.

Mandi Walls is Technical Practice Manager at Chef. For Chef, she travels the world helping organizations increase their effectiveness using configuration management and modernizing IT practices. Prior to joining Chef, she ran large web properties for AOL, including AOL.com, Games.com and Moviefone. Mandi has a Master's degree in Computer Science from George Washington University and an M.B.A. from UNC Kenan-Flagler. She is a regular speaker at technical conferences and is the author of Building a DevOps Culture, published by O'Reilly.

Available Media

  • Read more about Design Review Best Practices

Talks 2

San Tomas and Lawrence

Proactive Monitoring @Twitter

Joe Smith, Twitter

Most systems provide statistics for tracking usage, performance, and failures. However, some do not fit into pre-existing monitoring infrastructure, or provide no metrics at all. This talk discusses a proactive approach to monitoring services at Twitter. Using case studies from multiple services at Twitter, we illustrate the value of a holistic approach to monitoring from design review up to tuning alerts in production.

Joe is one of the founding members of Twitter’s Aurora/Mesos SRE team, and collaborated to build a highly available tiered fleet at Twitter. He has contributed to everything from the Puppet infrastructure that orchestrates Twitter’s Mesos cluster all the way up to automation of maintenance, deploys, and upgrades. Before that, he was part of Google's Internal Technology Residency Program in Mountain View, Boston, and Zurich. He holds a B.S. in Computer Science from Chapman University.

Most systems provide statistics for tracking usage, performance, and failures. However, some do not fit into pre-existing monitoring infrastructure, or provide no metrics at all. This talk discusses a proactive approach to monitoring services at Twitter. Using case studies from multiple services at Twitter, we illustrate the value of a holistic approach to monitoring from design review up to tuning alerts in production.

Joe is one of the founding members of Twitter’s Aurora/Mesos SRE team, and collaborated to build a highly available tiered fleet at Twitter. He has contributed to everything from the Puppet infrastructure that orchestrates Twitter’s Mesos cluster all the way up to automation of maintenance, deploys, and upgrades. Before that, he was part of Google's Internal Technology Residency Program in Mountain View, Boston, and Zurich. He holds a B.S. in Computer Science from Chapman University.

  • Read more about Proactive Monitoring @Twitter

Panels

Lafayette

Releasing at Scale

Panelists: Chuck Rossi, Facebook; Helena Tian, Google; Jos Boumans, Krux Digital; Daniel Schauenberg, Etsy

A product’s ability to release quickly can be critical to its success of exceeding the expectations of its users. In the current environment there is more and more pressure to speed up a team’s release cycle and the SRE and Release Team are innovating to meet this demand. Hear from some experts of some of the challenges they’ve had to overcome and the success they’ve found in the process.

Chuck Rossi leads Release Engineering at Facebook and started as Facebook's first release engineer in 2008. He manages both the web frontend releases (facebook.com: updated twice a day, every day) and the iOS and Android products (the #1 mobile application on both platforms). Chuck has worked as a software engineer for over 20 years and has held software engineering and release positions at IBM Almaden Research, Silicon Graphics, VMware, and Google. Chuck holds a Bachelor of Science in Computer Science from Rochester Institute of Technology.

A product’s ability to release quickly can be critical to its success of exceeding the expectations of its users. In the current environment there is more and more pressure to speed up a team’s release cycle and the SRE and Release Team are innovating to meet this demand. Hear from some experts of some of the challenges they’ve had to overcome and the success they’ve found in the process.

Chuck Rossi leads Release Engineering at Facebook and started as Facebook's first release engineer in 2008. He manages both the web frontend releases (facebook.com: updated twice a day, every day) and the iOS and Android products (the #1 mobile application on both platforms). Chuck has worked as a software engineer for over 20 years and has held software engineering and release positions at IBM Almaden Research, Silicon Graphics, VMware, and Google. Chuck holds a Bachelor of Science in Computer Science from Rochester Institute of Technology.

Daniel Schauenberg is a Senior Software Engineer at Etsy's infrastructure and development tools team. Automation, documentation and simplicity are his usual tools for improving the status quo. He previously worked in systems and network administration, on connecting chemical plants to IT systems, and as an embedded systems networking engineer. Things he thoroughly enjoys when not writing code include coffee, breakfast, TV shows, and basketball.

  • Read more about Releasing at Scale

Open Space

Camino Real

SREcon14 will have an open space for discussion. Submit your topics today!

11:25 a.m.–11:30 a.m.  

Short Break

11:30 a.m.–12:30 p.m.  

Talks 1

Winchester and Stevens Creek

Grilled Cheese at Scale: A Recipe for Stability Practices

Connie-Lynne Villani, Groupon

Long-term site stability requires a fundamental shift in perspective: planning for the future vs reacting to catastrophe. Whether it's growing a food festival from 1,500 attendees to 11,000, or dramatically increasing the scale of your production operations, it can be hard to foster acceptance of a new approach while soothing fears about cultural change. In this talk, Connie-Lynne will discuss the techniques of consensus-building, team autonomy, and fast failure she learned while running the Grilled Cheese Invitational, and how she transferred those to her work in Site Reliability Engineering.

Long-term site stability requires a fundamental shift in perspective: planning for the future vs reacting to catastrophe. Whether it's growing a food festival from 1,500 attendees to 11,000, or dramatically increasing the scale of your production operations, it can be hard to foster acceptance of a new approach while soothing fears about cultural change. In this talk, Connie-Lynne will discuss the techniques of consensus-building, team autonomy, and fast failure she learned while running the Grilled Cheese Invitational, and how she transferred those to her work in Site Reliability Engineering.

Connie-Lynne joined Groupon in 2012, and founded Groupon's Site Reliability Engineering team. Building on her successes with SRE, in January of 2014 she helped start the Engineering Standards and Culture team as the Sr. Manager for Global Engineering Education. With degrees in both Electrical Engineering and Theater Management, Connie-Lynne brings not only 20 years of System Engineering experience to the table, but also a keen understanding of how to connect technical and creative people to each other. Connie-Lynne has worked at Linden Lab, Yahoo, and Caltech, but admits that her most fun position is as a board member for the Grilled Cheese Invitational.

Available Media

  • Read more about Grilled Cheese at Scale: A Recipe for Stability Practices

Talks 2

San Tomas and Lawrence

Cascading Failures

Mike Ulrich, Google

Along with success comes the risk that early design decisions can be traps just waiting to cause you the worse case outage, often referred to as a cascading failure. Mike will talk about his experience with identifying the root causes and strategies developed over time to avoid common problems. His experience in Gmail has been shared across Google’s product and SRE teams and now the SRE community.

Mike Ulrich is a Staff Site Reliability Engineer at Google where he has worked since 2006. He began his career as a System Engineer but transitioned into Software Engineer while working on the Gmail SRE team. He earned a bachelor’s degree in Chemical Engineering from the Univeristy of Minnesota. He is currently the tech lead of the Gmail SRE team at Google but consults across the organization in the areas of design reviews, failover, and load balancing strategies.

Along with success comes the risk that early design decisions can be traps just waiting to cause you the worse case outage, often referred to as a cascading failure. Mike will talk about his experience with identifying the root causes and strategies developed over time to avoid common problems. His experience in Gmail has been shared across Google’s product and SRE teams and now the SRE community.

Mike Ulrich is a Staff Site Reliability Engineer at Google where he has worked since 2006. He began his career as a System Engineer but transitioned into Software Engineer while working on the Gmail SRE team. He earned a bachelor’s degree in Chemical Engineering from the Univeristy of Minnesota. He is currently the tech lead of the Gmail SRE team at Google but consults across the organization in the areas of design reviews, failover, and load balancing strategies.

  • Read more about Cascading Failures

Panels

Lafayette

Disaster Preparedness

Moderator: Bethanye Blount, Facebook

Panelists: Kripa Krishnan, Google; Richard Waid, LinkedIn; Mat Schaffer, Netflix

While a good SRE team is proactive during the design phase many things can affect the reliability of the service. A critical aspect of planning is practicing what happens in the worst-case scenario and if the product and team can respond and recover quickly. Many teams do continuous testing, spot-checking, or large scale annual tests. This panel will discuss the how to do this safely, justify to the company, and the followup that is required to really leverage the advantages of these tests.

Richard Waid joined LinkedIn in 2012, moving to Mountain View from a diverse background in Software Engineering and Operations in New Zealand. As a Staff SRE at LinkedIn, Richard faces the daily paradox of SRE everywhere: figuring out what went wrong while making sure that it never happens again. His responsibilities at LinkedIn include leading the SRE team responsible for the LinkedIn profile and engagement.

While a good SRE team is proactive during the design phase many things can affect the reliability of the service. A critical aspect of planning is practicing what happens in the worst-case scenario and if the product and team can respond and recover quickly. Many teams do continuous testing, spot-checking, or large scale annual tests. This panel will discuss the how to do this safely, justify to the company, and the followup that is required to really leverage the advantages of these tests.

Richard Waid joined LinkedIn in 2012, moving to Mountain View from a diverse background in Software Engineering and Operations in New Zealand. As a Staff SRE at LinkedIn, Richard faces the daily paradox of SRE everywhere: figuring out what went wrong while making sure that it never happens again. His responsibilities at LinkedIn include leading the SRE team responsible for the LinkedIn profile and engagement.

  • Read more about Disaster Preparedness

Open Space

Camino Real

SREcon14 will have an open space for discussion. Submit your topics today!

12:30 p.m.–1:30 p.m.  

Luncheon, sponsored by Google

Terra Courtyard

1:30 p.m.–2:30 p.m.  

Talks 1

Winchester and Stevens Creek

Monitoring at “Systems”

Mark Smith, Dropbox

Monitoring is more than just looking at averages and up/down states, this session will focus on how and what to monitor as well as compare and contrast whitebox and blackbox monitoring. This talk will also touch on best practices Mark has developed over time for setting up monitoring when launching a new service.

Monitoring is more than just looking at averages and up/down states, this session will focus on how and what to monitor as well as compare and contrast whitebox and blackbox monitoring. This talk will also touch on best practices Mark has developed over time for setting up monitoring when launching a new service.

Available Media

  • Read more about Monitoring at “Systems”

Talks 2

San Tomas and Lawrence

Mobile Reliability

David Barshow, Mailbox

Mobile imposes a different set of challenges: platforms, different hardware and OS types, different bandwidth requirements, and more. Mailbox launched as an iOS only client to Gmail in 2013. During the course of launching we learned how different maintaining and launching mobile backends is than your traditional "refresh" browser based model. We will also cover how different this experience is than maintaining a traditional website.

David Barshow is the operations tech lead for Mailbox. He works on building and maintaining the infrastructure to support the mobile Mailbox products.

Mobile imposes a different set of challenges: platforms, different hardware and OS types, different bandwidth requirements, and more. Mailbox launched as an iOS only client to Gmail in 2013. During the course of launching we learned how different maintaining and launching mobile backends is than your traditional "refresh" browser based model. We will also cover how different this experience is than maintaining a traditional website.

David Barshow is the operations tech lead for Mailbox. He works on building and maintaining the infrastructure to support the mobile Mailbox products.

  • Read more about Mobile Reliability

Panels

Lafayette

Load Shedding

Moderator: Akhil Gupta, Dropbox

Panelists: Manjot Pahwa, Google; Jos Boumans, Krux Digital; Nick Berry, LinkedIn; Bruce Wong, Netflix

When your service starts to get overloaded a strategy for graceful degradation is to start to offload some of the traffic/load such that you don’t experience a complete meltdown. This panel will discuss the various strategies for how to handle this problem.

Akhil Gupta is responsible for the infrastructure at Dropbox. Prior to joining Dropbox in 2012, he was a principal engineer at Google responsible for the search ads serving infrastructure. He earned a Bachelor's Degree in Computer Science from the Indian Institute of Technology, Kanpur. He received his Master's Degree in Computer Science from University of Maryland, College Park.

When your service starts to get overloaded a strategy for graceful degradation is to start to offload some of the traffic/load such that you don’t experience a complete meltdown. This panel will discuss the various strategies for how to handle this problem.

Akhil Gupta is responsible for the infrastructure at Dropbox. Prior to joining Dropbox in 2012, he was a principal engineer at Google responsible for the search ads serving infrastructure. He earned a Bachelor's Degree in Computer Science from the Indian Institute of Technology, Kanpur. He received his Master's Degree in Computer Science from University of Maryland, College Park.

Nick Berry joined SRE at LinkedIn in 2011 and is responsible for Infrastructure: internal/external traffic routing and load balancing, presentation, security, and monitoring/alerting/automation. Prior to LinkedIn, Nick has a colorful background, from Yahoo’s global home pages to data centers (Wavve, RagingWire) to dialup ISPs (JPSNet, OneMain, Internet49). Nick dropped out of American River College while pursuing a degree in Mathematics and is a Cho Dan in Tang Soo Do Martial Way Association.

  • Read more about Load Shedding

Open Space

Camino Real

SREcon14 will have an open space for discussion. Submit your topics today!

2:30 p.m.–2:35 p.m.  

Short Break

2:35 p.m.–3:30 p.m. Friday

Talks 1

Winchester and Stevens Creek

Finding and Resolving Problems at Scale

Ben Maurer, Facebook

Large scale systems fail in surprising ways. In order to stay on top of the game, the detection and data visualization systems have to evolve together with the underlying systems and you need to learn lessons to build long term solutions that can avoid the same failure from happening over and over again in different parts of your infrastructure. This talk is about breakage, debugging and building long term scalable solutions from those experiences.

Large scale systems fail in surprising ways. In order to stay on top of the game, the detection and data visualization systems have to evolve together with the underlying systems and you need to learn lessons to build long term solutions that can avoid the same failure from happening over and over again in different parts of your infrastructure. This talk is about breakage, debugging and building long term scalable solutions from those experiences.

Available Media

  • Read more about Finding and Resolving Problems at Scale

Talks 2

San Tomas and Lawrence

SRE in the Cloud

Rich Adams, Gracenote

The future of the cloud changes the role of the SRE. In a large company where you are deploying your service on infrastructure built/managed in-house the SRE has the home field advantage of understanding the intricacies of that infrastructure. With more and more startups launching in the cloud which are maintained by the vendor, the local SREs role offers different challenges. Rich has launched several businesses on AWS and he will talk about his journey towards incorporating reliability into products and ensuring the development team had access to the information needed to improve their services. He will share the highlights of what he’s learned about Amazon’s Web Services and what it took for him to make it work for his companies.

The future of the cloud changes the role of the SRE. In a large company where you are deploying your service on infrastructure built/managed in-house the SRE has the home field advantage of understanding the intricacies of that infrastructure. With more and more startups launching in the cloud which are maintained by the vendor, the local SREs role offers different challenges. Rich has launched several businesses on AWS and he will talk about his journey towards incorporating reliability into products and ensuring the development team had access to the information needed to improve their services. He will share the highlights of what he’s learned about Amazon’s Web Services and what it took for him to make it work for his companies.

  • Read more about SRE in the Cloud

Panels

Lafayette

SRE Culture Fundamentals

Moderator: Tom Cook, Dropbox

Panelists: Grier Johnson, Square; Kevin Park, Dropbox; Avleen Vig, Etsy; Pedro Canahuati, Facebook

SRE, Production Engineer, DevOps—Regardless of what you call the role at your company, its the engineering that truly defines the work of the engineer who is focused on the reliability of their product, but without a culture that defines the role appropriately in the company, you could just be another Operations team. This panel will discuss how to make sure SRE is part of engineering and the team is focused on the right work.

SRE, Production Engineer, DevOps—Regardless of what you call the role at your company, its the engineering that truly defines the work of the engineer who is focused on the reliability of their product, but without a culture that defines the role appropriately in the company, you could just be another Operations team. This panel will discuss how to make sure SRE is part of engineering and the team is focused on the right work.

  • Read more about SRE Culture Fundamentals

Open Space

Camino Real

SREcon14 will have an open space for discussion. Submit your topics today!

3:30 p.m.–4:00 p.m.  

Break with Refreshments, Sponsored by Twitter

Mezzanine

4:00 p.m.–5:00 p.m.  

Closing Talk

Santa Clara Ballroom

How Silicon Valley’s SREs Saved Healthcare.gov

Michael “Mikey” Dickerson, U.S. Citizen

Healthcare.gov was at the brink of failure, and a team of SREs from Silicon Valley joined the effort and helped the website undo some decisions that cause it to buckle under the scale of the American People. This talk will discuss how the team was recruited to volunteer their skills in order to ensure Healthcare.gov accomplished its goal of allowing millions of Americans to sign up for healthcare. Mikey will discuss how he and the team approached the stability problems and the four major engineering challenges they had to overcome.

Healthcare.gov was at the brink of failure, and a team of SREs from Silicon Valley joined the effort and helped the website undo some decisions that cause it to buckle under the scale of the American People. This talk will discuss how the team was recruited to volunteer their skills in order to ensure Healthcare.gov accomplished its goal of allowing millions of Americans to sign up for healthcare. Mikey will discuss how he and the team approached the stability problems and the four major engineering challenges they had to overcome.

  • Read more about How Silicon Valley’s SREs Saved Healthcare.gov
5:00 p.m.–6:30 p.m.  

Happy Hour, Sponsored by Facebook

Mezzanine

Gold Sponsors

Silver Sponsors

© USENIX

  • Privacy Policy
  • Contact Us