Skip to main content
USENIX
  • Conferences
  • Students
Sign in
  • Home
  • Program
  • Participate
    • Call for Participation
  • About
  • Home
  • Program
  • Participate
  • About

sponsors

Gold Sponsor
Gold Sponsor
Gold Sponsor
Gold Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Silver Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Bronze Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Media Sponsor
Industry Partner

general information

Early Bird Registration Deadline: March 16, 2016

SREcon16 is SOLD OUT.
No walkup registrations will be accepted.

Venue:
Hyatt Regency Santa Clara
5101 Great America Pkwy
Santa Clara, CA 95054

Rooms at the Hyatt Regency Santa Clara are sold out.

Rooms available at:
Biltmore Hotel & Suites
2151 Laurelwood Road
Santa Clara, CA 95054

Book your room for $225 single or double plus tax or call (800) 255-9925 or (408) 988-8411 and reference USENIX Association or Billing ID #32992. Room rate includes WiFi and complimentary shuttle to the Hyatt Regency Santa Clara.

Questions?
About SREcon?
About the Call for Participation?
About the Hotel/Registration?
About Sponsorship?

help promote

SREcon16 button

twitter

Tweets by @SREcon

usenix conference policies

  • Event Code of Conduct
  • Conference Network Policy
  • Statement on Environmental Responsibility Policy

You are here

Home » Conference Program
Tweet

connect with us

Conference Program

All sessions will take place at the Hyatt Regency Santa Clara.

Download the Attendee List (Conference Attendees Only)

Attendee Files 

(Registered attendees: Sign in to your USENIX account to download this file.)

SREcon16 Attendee List (PDF)

 

Thursday, April 7, 2016

7:30 am–5:00 pm  

Badge Pickup

Santa Clara Ballroom Foyer

8:00 am–9:00 am  

Continental Breakfast

Mezzanine East/West

9:00 am–10:00 am  

Keynote Address

Santa Clara Ballroom

The Realities of the Job of Delivering Reliability

Rachel Kroll, Facebook

Rachel has been doing the whole sysadmin thing for a while and has a lot of stories to tell. She works at Facebook as a production engineer on the Web Foundation team, looking for things that might break, but also occasionally fights fires when something goes wrong. This is the "swat team" from last year's sessions. Rachel found the "kill -9 -1" in that story.

When not wrangling production systems at work she plays with software defined radio and writes about life in Silicon Valley and the industry as "rachelbythebay". Two collections of these stories have been self-published electronically.

Rachel KrollRachel has been doing the whole sysadmin thing for a while and has a lot of stories to tell. She works at Facebook as a production engineer on the Web Foundation team looking for things that might break but also occasionally fights fires when something goes wrong. This is the "swat team" from last year's sessions. Rachel found the "kill -9 -1" in that story.

When not wrangling production systems at work she plays with software defined radio and writes about life in Silicon Valley and the industry as "rachelbythebay". Two collections of these stories have been self-published electronically.

Rachel KrollRachel has been doing the whole sysadmin thing for a while and has a lot of stories to tell. She works at Facebook as a production engineer on the Web Foundation team looking for things that might break but also occasionally fights fires when something goes wrong. This is the "swat team" from last year's sessions. Rachel found the "kill -9 -1" in that story.

When not wrangling production systems at work she plays with software defined radio and writes about life in Silicon Valley and the industry as "rachelbythebay". Two collections of these stories have been self-published electronically.

Available Media

  • Read more about The Realities of the Job of Delivering Reliability
10:00 am–10:30 am  

Break with Refreshments, Sponsored by PagerDuty

Mezzanine East/West

10:30 am–11:25 am  

Track 1

Winchester/Stevens Creek Rooms

Beyond Repair: Proactive Maintenance Work at Scale

Romain Komorn, Facebook

FBAR has enabled Facebook's production engineering teams to automate break/fix responses to many of the single-host events, such as hardware failures or application daemon crashes, that occur on a daily basis, leaving engineers to focus on larger, more complex and interesting problems.

To provide more opportunities and time for engineers to take on more meaningful work, the team developed automated responses for the less frequent, but larger-scale proactive work necessary to maintain a healthy infrastructure. This work takes on different shapes, including top-of-rack switch replacements, disruptive BIOS/firmware updates, or work on power supply (backup or primary).

FBAR has enabled Facebook's production engineering teams to automate break/fix responses to many of the single-host events, such as hardware failures or application daemon crashes, that occur on a daily basis, leaving engineers to focus on larger, more complex and interesting problems.

To provide more opportunities and time for engineers to take on more meaningful work, the team developed automated responses for the less frequent, but larger-scale proactive work necessary to maintain a healthy infrastructure. This work takes on different shapes, including top-of-rack switch replacements, disruptive BIOS/firmware updates, or work on power supply (backup or primary).

This talk will use a (fictitious) example of a set of racks undergoing maintenance to give an overview of how the automation provided by FBAR to handle single-host repairs was expanded to cover proactive maintenance work, including an explanation of the automated maintenance process, the API engineers implement to automate the work, and the way it interfaces with humans when automation won't (or fails to) work. It will also include a few small lessons we learned along the way, and explain why a simple approach covers much of the use cases with few drawbacks.

Romain is a manager in Facebook's Production Engineering organization, working on the team that maintains Facebook's Auto Remediation tool (FBAR). He originally came in to Facebook as part of the Site Reliability Operations team and has spent the last five years focused on safely automating operational tasks. Most recently, this has taken the shape of creating new tooling that allows us to automate maintenance affecting multiple machines (and multiple racks) at a time. The team has spent the last two years refining the process and keeping automation simple while covering the majority of use cases.

Available Media

  • Read more about Beyond Repair: Proactive Maintenance Work at Scale

nrrd 911 ic me: The Incident Commander Role

Alice Goldfuss, New Relic

Shit hit the fan—now what?

You know to build resilient systems and make small, planned changes, but computers (and humans) still fail. How do you deal with such failures? How do you recover?

Enter the Incident Commander. Adapted from the government and military’s incident response process, the Incident Commander handles the technical triage and orchestration necessary to get a swift resolution during crisis. The IC process focuses on clear communication, delegation, and trust between teams working in harmony.

New Relic has used the IC process for over two years, iterating and refining the process as we go. We train all our engineers to be ICs and have used this process to handle small deployment hiccups to network outages. We’ve built tools to support and archive our incident responses and have seen significant improvement in our understanding and response to such situations.

Shit hit the fan—now what?

You know to build resilient systems and make small, planned changes, but computers (and humans) still fail. How do you deal with such failures? How do you recover?

Enter the Incident Commander. Adapted from the government and military’s incident response process, the Incident Commander handles the technical triage and orchestration necessary to get a swift resolution during crisis. The IC process focuses on clear communication, delegation, and trust between teams working in harmony.

New Relic has used the IC process for over two years, iterating and refining the process as we go. We train all our engineers to be ICs and have used this process to handle small deployment hiccups to network outages. We’ve built tools to support and archive our incident responses and have seen significant improvement in our understanding and response to such situations.

This talk will discuss the IC role, why you want it, how we iterated over it, lessons learned in the field, and the tools we built to support it.

Alice Goldfuss is a Site Reliability Engineer at New Relic, where she spends her days wading through containers, comforting servers, and performing dark sacrifices to the network tier. She’s been a technical reviewer on Docker: Up & Running and Effective DevOps and bemused audiences on both sides of the pond. You can find her on Twitter (@alicegoldfuss), drinking tea, ranting about feminism, and trying to kidnap every cat she meets.

Available Media

  • Read more about nrrd 911 ic me: The Incident Commander Role

Track 2

Lawrence/San Tomas/Lafayette Rooms

How to Improve a Service by Roasting It

Caskey L. Dickson and Jake Welch, Microsoft

At Microsoft SRE is not part of the current operational landscape, instead it is an ongoing project that is being adapted into a VERY mature company. Our team has been having to develop new and interesting ways to introduce SRE and its tenets to a traditional IT-Ops based organization. This process has proven to be quite complex and socially delicate. You can't go in to a team and just tell them they are doing things wrong even if they clearly are (as evidenced by their crushing operational load). You need to find the right way to show a developer all the warts on their baby and motivate them to work with you on addressing them. Furthermore you have to deal with their earnest desire to treat you as "just another ops team" who is only there to take the pager from them.

At Microsoft SRE is not part of the current operational landscape, instead it is an ongoing project that is being adapted into a VERY mature company. Our team has been having to develop new and interesting ways to introduce SRE and its tenets to a traditional IT-Ops based organization. This process has proven to be quite complex and socially delicate. You can't go in to a team and just tell them they are doing things wrong even if they clearly are (as evidenced by their crushing operational load). You need to find the right way to show a developer all the warts on their baby and motivate them to work with you on addressing them. Furthermore you have to deal with their earnest desire to treat you as "just another ops team" who is only there to take the pager from them.

All that said, one of the tools we've experimented with to get into this kind of conversation is to hold what we call a Service Roast for some of our SRE engagements. Named after the famous friar's club roasts the goal is to (in as safe a manner as possible) dig into and expose those warts, wrinkles, design flaws, shortcomings, and problems everyone knows a service has but doesn't want to talk about. (We can't help you if you won't tell us where it hurts.)

To do these, we've discovered some process, ground rules, a new role of impartial referee, and some useful structure to host this kind of meeting. Thus far we've gotten great insight into some of our services and more importantly created some very interesting (and lively) conversations.

To be sure, this is a high-risk activity, and shouldn't be done without careful consideration of the teams participating, but we'll present what we've learned about holding these roasts, guidance teams need for successful participation, and (importantly) why we don't use this approach everywhere.

Caskey L. Dickson is a Site Reliability Engineer at Microsoft where he is part of the leadership team reinventing operations at Azure. Before that he was at Google where he worked as an SRE/SWE writing and maintaining monitoring services that operate at "Google scale" as well as business intelligence pipelines. He has worked in online services since 1995 when he turned up his first web server and has been online ever since. Before working at Google, he was a senior developer at Symantec, wrote software for various Internet startups such as CitySearch and CarsDirect, ran a consulting company, and even taught undergraduate and graduate computer science at Loyola Marymount University. He has a B.S. in Computer Science, a Masters in Systems Engineering, and an M.B.A from Loyola Marymount.

Jake Welch is a Site Reliability Engineer/Software Engineer on the Microsoft Azure team in NYC. He has worked on large scale services at Microsoft for eight years, primarily in Azure infrastructure and Storage in software engineering/operational/managerial roles and on the major disaster on-call team. In 2014, he started the first SRE pilot in Azure and continues to drive forward Microsoft SRE culture. Prior to Microsoft, Jake worked as a developer building websites and automating backend business workflows across OSX and Windows.

Available Media

  • Read more about How to Improve a Service by Roasting It

College Student to SRE: Onboarding Your Entry Level Talent

Michael Kehoe and Nina Mushiana, LinkedIn

Just over two years ago, I completed college, moved countries and started as a SRE at LinkedIn. Over those two years, I’ve gone from a Junior engineer to a Senior Engineer, however, I wouldn’t have been able to do it without a great onboarding experience and the mentors that have guided me

This session will take the perspective of myself as a New College Graduate and Nina Mushiana (SRE manager) on how to onboard new SREs and help them reach their full potential. We will deep dive into onboarding, mentoring and training topics and reflect on lessons learnt over my two year experience. Finally I’ll present advice for ELTs on how to become an important and successful part of their organization.

Just over two years ago, I completed college, moved countries and started as a SRE at LinkedIn. Over those two years, I’ve gone from a Junior engineer to a Senior Engineer, however, I wouldn’t have been able to do it without a great onboarding experience and the mentors that have guided me

This session will take the perspective of myself as a New College Graduate and Nina Mushiana (SRE manager) on how to onboard new SREs and help them reach their full potential. We will deep dive into onboarding, mentoring and training topics and reflect on lessons learnt over my two year experience. Finally I’ll present advice for ELTs on how to become an important and successful part of their organization.

Michael Kehoe, Senior Site Reliability Engineer in the Production-SRE team, joined the LinkedIn operations team as a New College Graduate in January 2014. Prior to that, Michael studied Engineering at the University of Queensland (Australia) where he majored in Electrical Engineering. During his time studying, he interned at NASA Ames Research Center working on the PhoneSat project.

Nina Mushiana, has been an SRE Manager for Production-SRE and NOC team at Linkedin since May 2013. Prior to that, Nina was project managing Business Continuity Planning for Yahoo and Assistant manager for the Yahoo NOC. Nina has 14 years of operations experience.

Available Media

  • Read more about College Student to SRE: Onboarding Your Entry Level Talent

Track 3

Cypress Room

Operations at (Small) Scale

Elliott Sims, Backblaze

Backblaze is not a traditional unicorn or unicorn-aspirant. We take a bootstrapped and simple low-cost approach to operations that differs from the current cloud-based trends. This has some key advantages in terms of not requiring heavy VC backing and all that entails, but also presents a different set of challenges. I’ll be describing some of those challenges from the point of view of someone who recently moved from large scale and is now adapting to a smaller scale.

Elliott Sims is a Senior Sysadmin at Backblaze. Previously, he worked at Facebook from 2009-2015 building and maintaining infrastructure through exponential growth.

Backblaze is not a traditional unicorn or unicorn-aspirant. We take a bootstrapped and simple low-cost approach to operations that differs from the current cloud-based trends. This has some key advantages in terms of not requiring heavy VC backing and all that entails, but also presents a different set of challenges. I’ll be describing some of those challenges from the point of view of someone who recently moved from large scale and is now adapting to a smaller scale.

Elliott Sims is a Senior Sysadmin at Backblaze. Previously, he worked at Facebook from 2009-2015 building and maintaining infrastructure through exponential growth.

Available Media

  • Read more about Operations at (Small) Scale

Don't Burn Out the Night

Dave Dash

Note: Due to technical difficulties, there is no audio or video of this talk.

It’s easy for engineers to overcommit to things, especially with On Call. Let’s build our On Call service that keeps people fresh and in it for the long haul.

In this session, Dave Dash will share lessons learned from Mozilla, Pinterest, and operations consulting on implementing a humane On Call service.

Dave Dash is a software engineer who loves developer tools. He currently builds tools for Summit Partners and assists with due diligence.

Prior to Summit, Dave did operations consulting for a number of startups. He was a founding member of the Pinterest Operations Team. He worked on the web team at Mozilla and worked on del.icio.us at Yahoo.

In his spare time he likes to read comic books, hike, and spend time with his wife and two kids.

It’s easy for engineers to overcommit to things, especially with On Call. Let’s build our On Call service that keeps people fresh and in it for the long haul.

In this session, Dave Dash will share lessons learned from Mozilla, Pinterest, and operations consulting on implementing a humane On Call service.

Dave Dash is a software engineer who loves developer tools. He currently builds tools for Summit Partners and assists with due diligence.

Prior to Summit, Dave did operations consulting for a number of startups. He was a founding member of the Pinterest Operations Team. He worked on the web team at Mozilla and worked on del.icio.us at Yahoo.

In his spare time he likes to read comic books, hike, and spend time with his wife and two kids.

Available Media
  • Read more about Don't Burn Out the Night
11:25 am–11:30 am  

Short Break

11:30 am–12:25 pm  

Track 1

Winchester/Stevens Creek Rooms

Continuous Deployment to Millions of Users 40 Times a Day

Michael Gorven, Facebook

Instagram deploys code 40 times a day to its fleet of thousands of webservers and userbase of 400M monthly active users automatically when engineers land changes. This talk describes the iterative approach we took to building this system, the problems we faced along the way, the solutions we implemented, and the key principles which enable this to work. These principles include a reliable and fast test suite automatically run during code review and before and after landing; an automated canary to detect significant breakages before being widely deployed; good visibility of and stop mechanisms for the automation; good site monitoring and alarming; and a fast rollback mechanism.

Instagram deploys code 40 times a day to its fleet of thousands of webservers and userbase of 400M monthly active users automatically when engineers land changes. This talk describes the iterative approach we took to building this system, the problems we faced along the way, the solutions we implemented, and the key principles which enable this to work. These principles include a reliable and fast test suite automatically run during code review and before and after landing; an automated canary to detect significant breakages before being widely deployed; good visibility of and stop mechanisms for the automation; good site monitoring and alarming; and a fast rollback mechanism.

Available Media

  • Read more about Continuous Deployment to Millions of Users 40 Times a Day

Track 2

Lawrence/San Tomas/Lafayette Rooms

Service Levels and Error Budgets

Chris Jones and Niall Murphy, Google

100% is almost never the right reliability target for a service, and service level agreements (SLAs) aren't the right tool for SREs to manage a service. These two (apparent) heresies are fundamental to how Google SRE thinks about running large-scale distributed computing services: we set service level objectives (SLOs) expressing how reliable a service needs to be and manage our service to maximize product development and feature velocity within the agreed "error budget."

We'll discuss the differences between indicators, objectives, and agreements; error budgets in practice; and how this brings product managers, product developers, and SREs together in a spirit of peaceful coexistence and cooperation.

100% is almost never the right reliability target for a service, and service level agreements (SLAs) aren't the right tool for SREs to manage a service. These two (apparent) heresies are fundamental to how Google SRE thinks about running large-scale distributed computing services: we set service level objectives (SLOs) expressing how reliable a service needs to be and manage our service to maximize product development and feature velocity within the agreed "error budget."

We'll discuss the differences between indicators, objectives, and agreements; error budgets in practice; and how this brings product managers, product developers, and SREs together in a spirit of peaceful coexistence and cooperation.

Chris Jones is a Site Reliability Engineer for Google App Engine, a cloud platform-as-a-service product serving over 28 billion requests per day. Based in San Francisco, he has previously been responsible for the care and feeding of Google's advertising statistics, data warehousing, and customer support systems. In other lives, Chris has worked in academic IT, analyzed data for political campaigns, and engaged in some light BSD kernel hacking, picking up degrees in Computer Engineering, Economics, and Technology Policy along the way. He's also a licensed professional engineer.

Niall Murphy leads the Ads Site Reliability Engineering team at Google Ireland. He has been involved in the Internet industry for about 20 years, and is currently chairperson of INEX, Ireland's peering hub. He is the author or co-author of a number of technical papers and/or books, including "Site Reliability Engineering" for O' Reilly, and a number of RFCs. He is currently co-writing a history of the Internet in Ireland, and he is the holder of degrees in Computer Science, Mathematics, and Poetry Studies, which is surely some kind of mistake. He lives in Dublin with his wife and two sons.

Available Media

  • Read more about Service Levels and Error Budgets

Stepping Up to Scale

Matt Davis, OpenX

We all know the scenario: what was once an experiment soon becomes a desired product. Suddenly it's time to scale the system, and all the time we hear about running "at scale," but what does that really mean? This talk is about the challenges OpenX faces when scaling a small proof-of-concept to a complex global product. It focuses on our strategies for planning, adapting, and growing distributed systems into fully productionized services.

Matt Davis is a Sr. Site Reliability engineer in Platform Engineering at OpenX. As an expert on running some of the largest Riak clusters on the planet, Matt has given talks about operating distributed platforms at everything from small hackerspace meetups, to RICON and SCaLE. His work at OpenX designing, instrumenting, and scaling our Riak KV clusters gives him a unique perspective on operational growth with distributed systems.

We all know the scenario: what was once an experiment soon becomes a desired product. Suddenly it's time to scale the system, and all the time we hear about running "at scale," but what does that really mean? This talk is about the challenges OpenX faces when scaling a small proof-of-concept to a complex global product. It focuses on our strategies for planning, adapting, and growing distributed systems into fully productionized services.

Matt Davis is a Sr. Site Reliability engineer in Platform Engineering at OpenX. As an expert on running some of the largest Riak clusters on the planet, Matt has given talks about operating distributed platforms at everything from small hackerspace meetups, to RICON and SCaLE. His work at OpenX designing, instrumenting, and scaling our Riak KV clusters gives him a unique perspective on operational growth with distributed systems.

Available Media

  • Read more about Stepping Up to Scale

Track 3

Cypress Room

Operational Buddhism: Building Reliable Services from Unreliable Components

Ernie Souhrada, Pinterest

The rise of utility computing has revolutionized much about the way organizations think about infrastructure and back-end serving systems compared to the ""olden days"" of dedicated physical data centers. However, in the final analysis, success is still driven by meeting your SLAs. If services are up and sufficiently performant, you win. If not, you lose.

In the traditional data center environment, fighting the uptime battle was typically driven by a philosophy I call ""Operational Materialism."" The primary goal of OM is preventing failures at the infrastructure layer, and mechanisms for making this happen are plentiful and well-understood, many of which boil down to simply spending enough money to have at least N+1 of anything that might fail and create significant downtime as a result. Redundant power supplies, NIC bonding, replicated SANs, and hot-standby servers are some of the common artifacts of an OM world.

The rise of utility computing has revolutionized much about the way organizations think about infrastructure and back-end serving systems compared to the ""olden days"" of dedicated physical data centers. However, in the final analysis, success is still driven by meeting your SLAs. If services are up and sufficiently performant, you win. If not, you lose.

In the traditional data center environment, fighting the uptime battle was typically driven by a philosophy I call ""Operational Materialism."" The primary goal of OM is preventing failures at the infrastructure layer, and mechanisms for making this happen are plentiful and well-understood, many of which boil down to simply spending enough money to have at least N+1 of anything that might fail and create significant downtime as a result. Redundant power supplies, NIC bonding, replicated SANs, and hot-standby servers are some of the common artifacts of an OM world.

In the cloud, however, Operational Materialism cannot succeed. Although the typical cloud provider tends to be holistically reliable, there are no guarantees that any individual virtual instance will not randomly or intermittently drop off the network or be terminated outright. Yet we still need to keep our services up and running and meet our SLAs, and thus we need a different mindset that accounts for the fundamentally opaque and ephemeral nature of the public cloud.

In this talk, I will present an alternative to OM, a worldview that I refer to as "Operational Buddhism." Like traditional Buddhism, OB has Four Noble Truths:

    1. Cloud-based servers can fail at any time for any reason.
    2. Trying to prevent this server failure is an endless source of suffering for DBAs and SREs alike.
    3. Accepting the impermanence of individual servers, we can focus on designing systems that are failure-resilient, rather than failure-resistant.
    4. We can escape the cycle of suffering and create a better experience for our customers, users, and colleagues.

To illustrate these concepts with concrete examples, I will discuss how configuration management, automation, and service discovery help us to practice Operational Buddhism at Pinterest for both stateful (MySQL, HBase) and stateless (web) services. Moreover, as our path is not the only road to infrastructure enlightenment, I'll also talk about some of the roads not taken, including the debate over Infrastructure-as-a-Service (IaaS) vs. Platform-as-a-Service (PaaS).

Ernie Souhrada is a database engineer on the SRE team at Pinterest where his current focus is on improving the performance and operational efficiency of a petabyte-scale hybrid deployment of MySQL, HBase, and Redis. Over the past two decades, Ernie has worked in almost every aspect of information technology, from network engineering and software development to systems administration and information security. When not slinging data or thinking about infrastructure, he can be found on a ski slope, at a sushi bar, or in search of the next great psytrance track. Ernie holds a B.S. in mathematics and a B.A. in political science from Arizona State University.

Available Media

  • Read more about Operational Buddhism: Building Reliable Services from Unreliable Components

Finding the Order in Chaos

Sue Lueder, Google

In 2015, Google SRE's BreakFix team set out on an exciting journey of reading and cataloging thousands of postmortems in an effort to mine useful data.

This effort has come a long way since then. In this talk, we will cover how Google used a structured data approach in incident data analysis, what we were able to unlock in the process, and what tools made it all possible.

In 2015, Google SRE's BreakFix team set out on an exciting journey of reading and cataloging thousands of postmortems in an effort to mine useful data.

This effort has come a long way since then. In this talk, we will cover how Google used a structured data approach in incident data analysis, what we were able to unlock in the process, and what tools made it all possible.

Available Media

  • Read more about Finding the Order in Chaos
12:25 pm–1:30 pm  

Conference Luncheon, Sponsored by SignalFx

Terra Courtyard

1:30 pm–2:25 pm  

Track 1

Winchester/Stevens Creek Rooms

What's NetDevOps? How Do I Start?

Leslie Carr, SFMIX

The systems world has wholly accepted the DevOps way of thinking. However, the network world has not started this, mainly due to a lack of knowledge and cultural resistance to change. This presentation tries to show the network world how to start the DevOps change, in a non threatening manner.

Leslie Carr is currently on the board of directors at SFMIX and happily running away from the responsibilities of day jobs. Before becoming a funemployed bum, Leslie was a devops engineer at Cumulus Networks. Previous to that, she was on the production side of the world at many large websites, such as Google, Craigslist, and Wikimedia. She is a lover and user of open source and automation, and she dreams of robots taking over all of our jobs one day.

The systems world has wholly accepted the DevOps way of thinking. However, the network world has not started this, mainly due to a lack of knowledge and cultural resistance to change. This presentation tries to show the network world how to start the DevOps change, in a non threatening manner.

Leslie Carr is currently on the board of directors at SFMIX and happily running away from the responsibilities of day jobs. Before becoming a funemployed bum, Leslie was a devops engineer at Cumulus Networks. Previous to that, she was on the production side of the world at many large websites, such as Google, Craigslist, and Wikimedia. She is a lover and user of open source and automation, and she dreams of robots taking over all of our jobs one day.

Available Media

  • Read more about What's NetDevOps? How Do I Start?

Netflix: 190 Countries and 5 CORE SREs

Jonah Horowitz, Netflix

How does Netflix scale SRE? How do we manage over 70 million customers around the world without a 24/7 operations center? With tens of thousands of Linux instances in a distributed system architecture, and thousands of daily production changes, it's an environment that's both challenging and exciting. Netflix had to change how our teams run applications in production and adopt a true DevOps culture. We also learned how to give teams the tools they need to be successful.

In this talk you'll hear from one of Netflix's CORE SREs about the challenges we've learned from and tools we use to keep everything running. Throughout the talk we'll discuss how Netflix views the role of the SRE and how it differs from the traditional Systems Administrator role. It also explains why freedom and responsibility are key, trust is required, and chaos is your friend.

How does Netflix scale SRE? How do we manage over 70 million customers around the world without a 24/7 operations center? With tens of thousands of Linux instances in a distributed system architecture, and thousands of daily production changes, it's an environment that's both challenging and exciting. Netflix had to change how our teams run applications in production and adopt a true DevOps culture. We also learned how to give teams the tools they need to be successful.

In this talk you'll hear from one of Netflix's CORE SREs about the challenges we've learned from and tools we use to keep everything running. Throughout the talk we'll discuss how Netflix views the role of the SRE and how it differs from the traditional Systems Administrator role. It also explains why freedom and responsibility are key, trust is required, and chaos is your friend.

Jonah Horowitz is a Senior Site Reliability Architect at Netflix with over 20 years of experience keeping servers and sites online. He started with a home-built BBS and has worked at both large and small tech companies including Walmart.com, Looksmart, and Quantcast.

Available Media

  • Read more about Netflix: 190 Countries and 5 CORE SREs

Track 2

Lawrence/San Tomas/Lafayette Rooms

From Ops to SRE on a Brazilian Startup

Matheus Rossato and Luiz Muller, ContaAzul

We will share how a small startup in Brazil went from a small hundreds to thousands of customers on our platform with a small Ops team and evolved to a central SRE team, and other SRE members distributed on other teams. We intend to point out how SRE diverts from Ops and why we weren't doing real DevOps.

We will share how a small startup in Brazil went from a small hundreds to thousands of customers on our platform with a small Ops team and evolved to a central SRE team, and other SRE members distributed on other teams. We intend to point out how SRE diverts from Ops and why we weren't doing real DevOps.

Available Media

  • Read more about From Ops to SRE on a Brazilian Startup

Shopping Event Reliability

Jun Liu, Baidu, Inc.

In China, shopping event is a most important marketing strategy of O2O, which is one of the fastest developing segments in the Chinese e-commerce market. There are shopping events almost every month in China, and the two biggest may be Singles Day (China’s Black Friday) and Chinese New Year. It is very challenging that SRE team keeps the website and mobile app high availability. The first challenge is that the traffic flow is much more than usual, and the second is that many different engineers and change events are involved in shopping event. Latency and data consistency are also important for tens of millions of online customers in the shopping day.

In China, shopping event is a most important marketing strategy of O2O, which is one of the fastest developing segments in the Chinese e-commerce market. There are shopping events almost every month in China, and the two biggest may be Singles Day (China’s Black Friday) and Chinese New Year. It is very challenging that SRE team keeps the website and mobile app high availability. The first challenge is that the traffic flow is much more than usual, and the second is that many different engineers and change events are involved in shopping event. Latency and data consistency are also important for tens of millions of online customers in the shopping day.

In this talk, we will share our systematic approach to solve the above challenges about Shopping Event Reliability (SER), including:

    1. Traffic Control: Overload Protection & Graceful Degradation
    2. Capacity Management: Automated Measurement & Elastic Container Infrastructure
    3. Monitoring and Recovery: Event-Flow Graph & Intelligent Callback
    4. Remote Disaster Recover System: Latency and Data consistency
    5. Process Management: Standardization Exercise
Available Media

  • Read more about Shopping Event Reliability

Track 3

Cypress Room

Moving Large Workload from a Public Cloud to an OpenStack Private Cloud: Is It Really Worth It?

Nicolas Brousse, TubeMogul

It can be easy to come up with a TCO analysis that would challenge any public cloud and make you think, "let's go in-house!" What are the challenges and is it really worth it? The TubeMogul Operation team went thru the technical challenges at building a private cloud. In this presentation you will learn how the team went from a R&D to an automated deployment of a bare-metal servers to finally migrate a large workload from a Public Cloud to its own Private Cloud infrastructure. We will detail how the team dealt with unexpected issues and also how we chose the hardware, estimated capacity, stay cost effective, improve overall performance of the system, and bring better control and visibility.

It can be easy to come up with a TCO analysis that would challenge any public cloud and make you think, "let's go in-house!" What are the challenges and is it really worth it? The TubeMogul Operation team went thru the technical challenges at building a private cloud. In this presentation you will learn how the team went from a R&D to an automated deployment of a bare-metal servers to finally migrate a large workload from a Public Cloud to its own Private Cloud infrastructure. We will detail how the team dealt with unexpected issues and also how we chose the hardware, estimated capacity, stay cost effective, improve overall performance of the system, and bring better control and visibility.

This talk will cover the technical detail of:

  • Evaluating OpenStack, Building and automating a CI environment for a mix of bare metal and cloud servers.  
  • What are the network limitations of OpenStack and how we creatively leverage VLANs to handle large packet per seconds. 
  • How to efficiently monitor your cloud infrastructure 
  • Find quickly your bottlenecks 
  • What we missed and should be consider before moving in house 
  • Lesson Learned and Post Cost Analysis

Nicolas Brousse is Senior Director of Operations Engineering at TubeMogul(NASDAQ: TUBE). The company's sixth employee and first operations hire, Nicolas has grown TubeMogul's infrastructure over the past seven years from several machines to over two thousand servers that handle billions of requests per day for clients like Allstate, Chrysler, Heineken and Hotels.com.

Adept at adapting quickly to ongoing business needs and constraints, Nicolas leads a global team of site reliability engineers and database architects that monitor TubeMogul's infrastructure 24/7 and adhere to "DevOps" methodology. Nicolas is a frequent speaker at top U.S. technology conferences and regularly gives advice to other operations engineers. Prior to relocating to the U.S. to join TubeMogul, Nicolas worked in technology for over 15 years, managing heavy traffic and large user databases for companies like MultiMania, Lycos and Kewego. Nicolas lives in Richmond, CA, and is an avid fisherman and aspiring cowboy.

Available Media

  • Read more about Moving Large Workload from a Public Cloud to an OpenStack Private Cloud: Is It Really Worth It?

SREs + Software Engineers: Making It Work

Nina Schiff, Facebook

Just got sent a binary to deploy - by email? Been told ‘it works on my machine’? Or that monitoring isn’t a big deal because ‘I’ll notice if it goes down'?

As companies and infrastructure scale, the need for individuals with skills outside of those taught in typical CS programmes becomes more apparent. However, this doesn’t replace the roles of more traditional software engineers. As one of these engineers who have worked with SREs and done SRE work ourselves, I discuss the ways in which SREs can be integrated into a broader engineering org. From complete role separation to Jack and Jills of all trades and finally to Facebook’s more nuanced approach, I look at why this integration is important and the ways in which it can make all involved more effective.

Just got sent a binary to deploy - by email? Been told ‘it works on my machine’? Or that monitoring isn’t a big deal because ‘I’ll notice if it goes down'?

As companies and infrastructure scale, the need for individuals with skills outside of those taught in typical CS programmes becomes more apparent. However, this doesn’t replace the roles of more traditional software engineers. As one of these engineers who have worked with SREs and done SRE work ourselves, I discuss the ways in which SREs can be integrated into a broader engineering org. From complete role separation to Jack and Jills of all trades and finally to Facebook’s more nuanced approach, I look at why this integration is important and the ways in which it can make all involved more effective.

Nina is currently working as a software engineer on Facebook’s deployment infrastructure. There she gets to use the words “container,” “scale,” and “failure domain” at least five or six times a day. Before that she worked on AWS EC2 in sunny Cape Town.

Available Media

  • Read more about SREs + Software Engineers: Making It Work
2:25 pm–2:30 pm  

Short Break

2:30 pm–3:25 pm  

Track 1

Winchester/Stevens Creek Rooms

Debugging Distributed Systems

Donny Nadolny, PagerDuty

Despite our best efforts, our systems fail. Sometimes it’s our fault - code that we wrote or bugs that we caused. But sometimes the fault is with systems that we rely on.

ZooKeeper is a very useful distributed system that is often used as a building block for other distributed systems, like Kafka and Spark. It is used by PagerDuty for many critical systems, and for five months it failed on us a lot.

We will walk through the process of finding and fixing one cause of many of these failures. You will learn how to use various tools to stress test the network, some intricate details of how ZooKeeper works, and possibly more than you wanted to know about TCP including an example of machines having a different view of the state of a TCP stream.

Despite our best efforts, our systems fail. Sometimes it’s our fault - code that we wrote or bugs that we caused. But sometimes the fault is with systems that we rely on.

ZooKeeper is a very useful distributed system that is often used as a building block for other distributed systems, like Kafka and Spark. It is used by PagerDuty for many critical systems, and for five months it failed on us a lot.

We will walk through the process of finding and fixing one cause of many of these failures. You will learn how to use various tools to stress test the network, some intricate details of how ZooKeeper works, and possibly more than you wanted to know about TCP including an example of machines having a different view of the state of a TCP stream.

Donny Nadolny is a Scala developer at PagerDuty, working on improving the reliability of their backend systems. He spends a large amount of time investigating problems experienced with distributed systems like Cassandra and ZooKeeper.

Available Media

  • Read more about Debugging Distributed Systems

Doorman: Global Distributed Client Side Rate Limiting

Jos Visser, Google

Doorman is a Google developed system for global distributed client side rate limiting. We are in the process of open sourcing it. With Doorman an arbitrary number of globally distributed clients can coordinate their usage of a shared resource so that the global usage does not exceed global capacity.

Doorman is a Google developed system for global distributed client side rate limiting. We are in the process of open sourcing it. With Doorman an arbitrary number of globally distributed clients can coordinate their usage of a shared resource so that the global usage does not exceed global capacity. This presentation:

  • Describes the fundamentals of the Doorman system
  • Explains the concepts of the RPC protocol between Doorman components
  • Shows code examples of Doorman configurations and clients
  • Shows graphs of how Doorman clients ask for and get capacity, and how this sums up globally
  • Explains how Doorman deals with spikes, clients going away, servers going away
  • Explains Doorman's system reliability features
  • Points to the Doorman open source repository
  • Explains the Doorman simulation (in Python) which can be used to quickly verify Doorman's behaviour in a specific scenario

Jos Visser has been working in the field of reliable and highly available systems since 1988. Starting as a systems programmer (MVS) at a bank, Jos's >25 year career has seen him working with a variety of mission critical systems technologies, including Stratus fault-tolerant systems, HP MC/ServiceGuard, Sun Enterprise Cluster, and Linux Lifekeeper. Jos joined Google in 2006 as an engineer in the Maps SRE team. Since then he has worked in a number of different areas including Social (Orkut SRE), Google's cloud computing, backup and monitoring teams, and YouTube. Since early 2016, he is working in the Travel SRE team in Cambridge MA, where he is tech lead for the pipelines that ingest airline and travel industry data.

Available Media

  • Read more about Doorman: Global Distributed Client Side Rate Limiting

Track 2

Lawrence/San Tomas/Lafayette Rooms

Incident Management and Chatops @ Netflix Feat Scorebot

Al Tobey, Netflix

Born in December 2015 (a Sagittarius), Scorebot has only been around for a couple months, but it has already made its mark on incident management at Netflix. Shortly after being compiled, it noticed that SREs were doing all kinds of tasks that were better suited to machines, such as traversing personnel graphs, archiving images, and updating the status page. Its favorite programming language is Go and it loves running Linux, but sometimes Windows is OK too. During downtime, Scorebot likes to listen in on conversations, peek at metrics, and text people when they least expect it. This interview with Scorebot will cover how Scorebot emerged, its job at Netflix, and what it takes to be on-call 24x7x365.

Born in December 2015 (a Sagittarius), Scorebot has only been around for a couple months, but it has already made its mark on incident management at Netflix. Shortly after being compiled, it noticed that SREs were doing all kinds of tasks that were better suited to machines, such as traversing personnel graphs, archiving images, and updating the status page. Its favorite programming language is Go and it loves running Linux, but sometimes Windows is OK too. During downtime, Scorebot likes to listen in on conversations, peek at metrics, and text people when they least expect it. This interview with Scorebot will cover how Scorebot emerged, its job at Netflix, and what it takes to be on-call 24x7x365.

Available Media
  • Read more about Incident Management and Chatops @ Netflix Feat Scorebot

Using Salt to Make Infrastructure Consumable (Tasty, Even)

Warren Turkal, SignalFx

SignalFx is an advanced monitoring platform for modern applications. It ingests, processes, and runs in-stream analytics against high volumes of high resolution metric data from all over the world.

We’ve use Salt from the beginning of SignalFx. In this talk, Warren will describe how SignalFx’s operating environment is organized and how Salt is used to turn it into an infrastructure product that can be easily consumed and used by engineers. For example, when service owners spin up a service or scale a service, no manual intervention is required. Warren will also go over some of the work he’s done to integrate Salt with AWS.

SignalFx is an advanced monitoring platform for modern applications. It ingests, processes, and runs in-stream analytics against high volumes of high resolution metric data from all over the world.

We’ve use Salt from the beginning of SignalFx. In this talk, Warren will describe how SignalFx’s operating environment is organized and how Salt is used to turn it into an infrastructure product that can be easily consumed and used by engineers. For example, when service owners spin up a service or scale a service, no manual intervention is required. Warren will also go over some of the work he’s done to integrate Salt with AWS.

Available Media

  • Read more about Using Salt to Make Infrastructure Consumable (Tasty, Even)

Track 3

Cypress Room

Monitoring the Unmeasureable

Jennifer Davis, Chef

From system performance to application metrics, we continue to further our understanding of what to monitor, why, and how to present it appropriately to the various audiences who need to act on this information. Yet there are things across our environment that we agree we can't measure because they are unquantifiable. That doesn't mean that there is zero signal to be analyzed and monitored.

We can look at open source software that is in wide use, yet becomes stale and unusable after years due to the atrophy of maintainers keeping it up to date with security and integrations with other software, or implementation of new features that keep it useful. How do you measure the health of your current implemented software solutions so that you know when to start planning change, or committing intentional time to a project?

How do you know when the cost of Amazon's Relational Database Service is more costly than having in-house qualified DBAs?

From system performance to application metrics, we continue to further our understanding of what to monitor, why, and how to present it appropriately to the various audiences who need to act on this information. Yet there are things across our environment that we agree we can't measure because they are unquantifiable. That doesn't mean that there is zero signal to be analyzed and monitored.

We can look at open source software that is in wide use, yet becomes stale and unusable after years due to the atrophy of maintainers keeping it up to date with security and integrations with other software, or implementation of new features that keep it useful. How do you measure the health of your current implemented software solutions so that you know when to start planning change, or committing intentional time to a project?

How do you know when the cost of Amazon's Relational Database Service is more costly than having in-house qualified DBAs?

How do you measure the value of sending your employees to one conference over another? How do you convey the value of going to one conference over another?

In this talk, I'll tackle these questions in addition to sharing other observations about monitoring within our environments with the goal of inspiring others to examine available signals, their impact, and the value of monitoring.

Available Media

  • Read more about Monitoring the Unmeasureable

Go for SREs Using Python

Andrew Hamilton, Zefr, Inc.

Python has become the default language for tool building for SREs. Go stands a good chance to become the new preferred language for such development. In this talk we'll quickly discuss the similarities and differences between Python and Go and why Go is a great fit for SRE and Operations tasks. Then we'll go through a basic example of building a tool and how Go can make us more efficient.

Skirting the line between development and operations, Andrew is currently an SRE focusing on making developers more productive. He is always looking for ways to make things easier for others by building tools and services. He tends to have his head in the clouds and is always looking for a fun new technology to tear apart and explore.

Python has become the default language for tool building for SREs. Go stands a good chance to become the new preferred language for such development. In this talk we'll quickly discuss the similarities and differences between Python and Go and why Go is a great fit for SRE and Operations tasks. Then we'll go through a basic example of building a tool and how Go can make us more efficient.

Skirting the line between development and operations, Andrew is currently an SRE focusing on making developers more productive. He is always looking for ways to make things easier for others by building tools and services. He tends to have his head in the clouds and is always looking for a fun new technology to tear apart and explore.

Available Media

  • Read more about Go for SREs Using Python
3:25 pm–4:00 pm  

Break with Refreshments, Sponsored by Dropbox

Mezzanine East/West

4:00 pm–5:00 pm  

Thursday Closing Address

Santa Clara Ballroom

A Young Lady's Illustrated Primer to Technical Decision-Making

Charity Majors, Hound

Charity is the cofounder and CTO of Hound, a new startup focused on mining machine data for humans. Before that she spent several years running infrastructure at Parse, and working as an engineering manager for Facebook.

Likes: crazy data problems, running distributed systems, building happy teams

Loves: single malt scotch (the peatier the better).

Charity MajorsCharity is the cofounder and CTO of Hound, a new startup focused on mining machine data for humans. Before that she spent several years running infrastructure at Parse, and working as an engineering manager for Facebook. Likes: crazy data problems, running distributed systems, building happy teams.

Loves: single malt scotch (the peatier the better).

Charity MajorsCharity is the cofounder and CTO of Hound, a new startup focused on mining machine data for humans. Before that she spent several years running infrastructure at Parse, and working as an engineering manager for Facebook. Likes: crazy data problems, running distributed systems, building happy teams.

Loves: single malt scotch (the peatier the better).

Disclaimer: This talk may contain strong or potentially offensive language.

Available Media

  • Read more about A Young Lady's Illustrated Primer to Technical Decision-Making
6:00 pm–8:00 pm  

Reception, Sponsored by Google

Terra Courtyard

 

Friday, April 8, 2016

8:00 am–noon  

Badge Pickup

Santa Clara Ballroom Foyer

8:00 am–9:00 am  

Continental Breakfast, Sponsored by Baidu

Mezzanine East/West

9:00 am–10:00 am  

Keynote Address

Santa Clara Ballroom

Putting Together Great SRE Teams

Kripa Krishnan, Google

Kripa Krishnan is a Technical Program Director at Google and has led Google's Disaster Recovery Program (DiRT) and related efforts for ~9 years. She also heads up the Google Cloud Product Operations Group. Her work in Google has included Privacy and Security initiatives in Google Apps and new gTLDs program. Prior to Google, she worked with the Telemedicine Program of Kosovo and ran a theater and performing arts organization in India for several years.

What kinds of people make up a great SRE team? This talk explores whether SRE just means software/systems engineers, and what value other roles bring to a team. How can you fully utilize specialist roles and diverse skills in your SRE organization?

Kripa Krishnan is a Technical Program Director at Google and has led Google's Disaster Recovery Program (DiRT) and related efforts for ~9 years. She also heads up the Google Cloud Product Operations Group. Her work in Google has included Privacy and Security initiatives in Google Apps and new gTLDs program. Prior to Google, she worked with the Telemedicine Program of Kosovo and ran a theater and performing arts organization in India for several years. 

What kinds of people make up a great SRE team? This talk explores whether SRE just means software/systems engineers, and what value other roles bring to a team. How can you fully utilize specialist roles and diverse skills in your SRE organization?

Kripa Krishnan is a Technical Program Director at Google and has led Google's Disaster Recovery Program (DiRT) and related efforts for ~9 years. She also heads up the Google Cloud Product Operations Group. Her work in Google has included Privacy and Security initiatives in Google Apps and new gTLDs program. Prior to Google, she worked with the Telemedicine Program of Kosovo and ran a theater and performing arts organization in India for several years. 

Available Media

  • Read more about Putting Together Great SRE Teams
10:00 am–10:30 am  

Break with Refreshments, Sponsored by Bloomberg

Mezzanine East/West

10:30 am–11:25 am  

Track 1

Winchester/Stevens Creek Rooms

Server Provisioning in an IPv6 Only World

Matthew Almond, Facebook, Inc.

Facebook began deploying IPv6 only clusters in 2014. This talk covers the challenges & solutions with provisioning in a v6 only world from hardware to software.

Facebook began deploying IPv6 only clusters in 2014. This talk covers the challenges & solutions with provisioning in a v6 only world from hardware to software.

Available Media

  • Read more about Server Provisioning in an IPv6 Only World

Privacy Reliability Engineering: Looking at Privacy through the Lens of SRE

Amber Yust, Google

SRE traditionally focuses on maintaining reliable service uptime, but creating reliable privacy has many similar challenges—though often with a few unique and interesting twists. Senior Privacy Engineer (and previously Site Reliability Engineer) Amber Yust will talk about applying many of the skills and techniques that SREs value to building reliable privacy.

Amber spent a couple of years each in Google's SRE and Yelp's infrastructure teams before joining Google's privacy efforts ~2 years ago. Since then she's been applying her talents to engineering reliable privacy into Google's social products as a Senior Privacy Engineer.

SRE traditionally focuses on maintaining reliable service uptime, but creating reliable privacy has many similar challenges—though often with a few unique and interesting twists. Senior Privacy Engineer (and previously Site Reliability Engineer) Amber Yust will talk about applying many of the skills and techniques that SREs value to building reliable privacy.

Amber spent a couple of years each in Google's SRE and Yelp's infrastructure teams before joining Google's privacy efforts ~2 years ago. Since then she's been applying her talents to engineering reliable privacy into Google's social products as a Senior Privacy Engineer.

Available Media

  • Read more about Privacy Reliability Engineering: Looking at Privacy through the Lens of SRE

Track 2

Lawrence/San Tomas/Lafayette Rooms

Building Reliable Social Infrastructure for Google

Marc Alvidrez, Google

If you were evaluating a system design for anti-patterns, you might look for the following characteristics:

  • significant amounts of global, mutable data with a very high update rate requiring ACID semantics
  • serve application data with no natural sharding (i.e. partitioning) dimension
  • data storage-level hot-spotting
  • interactive latency requirements for a global base of users

These undesirable characteristics in your system, however, are precisely the desirable characteristics of your application! They form the basis of critical features needed for any successful Social network, and many socially-enabled applications.

Come and hear about the tradeoffs explored and design idioms we discovered as we built Google's Social infrastructure.

If you were evaluating a system design for anti-patterns, you might look for the following characteristics:

  • significant amounts of global, mutable data with a very high update rate requiring ACID semantics
  • serve application data with no natural sharding (i.e. partitioning) dimension
  • data storage-level hot-spotting
  • interactive latency requirements for a global base of users

These undesirable characteristics in your system, however, are precisely the desirable characteristics of your application! They form the basis of critical features needed for any successful Social network, and many socially-enabled applications.

Come and hear about the tradeoffs explored and design idioms we discovered as we built Google's Social infrastructure.

Marc Alvidrez is a Senior Staff Site Reliability Engineer with Google. He joined the company in 2004, and starting as an early SRE he has led a variety of teams responsible for both infrastructure and major user-facing services. These have included the first team responsible for Google File System (GFS), and the teams responsible for Google's Display and AdSense advertising serving systems, Google+ and Google Photos. Prior to Google he held systems engineering roles at Vodafone and Internet startup Topica, where he was the Director of Operations.

Available Media

  • Read more about Building Reliable Social Infrastructure for Google

The Evolution of Global Traffic Routing and Failover

Aaron Heady, Microsoft

Traffic routing and failover strategies grow with the complexity of your service offering, and sometimes seemingly simple design choices of your services can dramatically impact what routing strategies you can use. Are all of your datacenters 100% identical, all data/services available everywhere? Do you have anything that requires affinity to a DC between multiple calls? What’s the experience for an end user when you do shift traffic? Do you have a client app on the user’s computer/phone that you can leverage to assist routing decisions or hide failovers? Or is the F5 key in the browser going to be how users recover? This really isn’t a conversation that can be thorough with just 50 minutes, but we can hopefully give you insight into what you need to be thinking about by sharing some of the things we’ve learned running Bing.com for a while. 

Traffic routing and failover strategies grow with the complexity of your service offering, and sometimes seemingly simple design choices of your services can dramatically impact what routing strategies you can use. Are all of your datacenters 100% identical, all data/services available everywhere? Do you have anything that requires affinity to a DC between multiple calls? What’s the experience for an end user when you do shift traffic? Do you have a client app on the user’s computer/phone that you can leverage to assist routing decisions or hide failovers? Or is the F5 key in the browser going to be how users recover? This really isn’t a conversation that can be thorough with just 50 minutes, but we can hopefully give you insight into what you need to be thinking about by sharing some of the things we’ve learned running Bing.com for a while. 

Available Media

  • Read more about The Evolution of Global Traffic Routing and Failover

Track 3

Cypress Room

Lightning Talks

Lightning Talks are back-to-back five-minute presentations on just about anything.

Lightning Talks are back-to-back five-minute presentations on just about anything.

Available Media

  • Read more about Lightning Talks
11:25 am–11:30 am  

Short Break

11:30 am–12:25 pm  

Track 1

Winchester/Stevens Creek Rooms

SRE: It’s People All the Way Down

Lex Neva and Courtney Eckhardt, Heroku

Is your root cause really “human error”? How did your environment let the human make the error? How did their error take down the service? How many outages did humans prevent? Can your dev teams’ priorities be aligned with reliability, instead of only with churning out features?

At Heroku, we do ops as a service -- reliability is our product. If we go down, we take thousands of businesses with us. In SRE, we push for reliability and resiliency in designs, sure, but it’s more than that. We iterate on process, automation, tooling, and incident response, because people are at the heart of everything we do.

Is your root cause really “human error”? How did your environment let the human make the error? How did their error take down the service? How many outages did humans prevent? Can your dev teams’ priorities be aligned with reliability, instead of only with churning out features?

At Heroku, we do ops as a service -- reliability is our product. If we go down, we take thousands of businesses with us. In SRE, we push for reliability and resiliency in designs, sure, but it’s more than that. We iterate on process, automation, tooling, and incident response, because people are at the heart of everything we do.

Lex Neva is probably not a super-villain. He has 6 years of experience keeping large services running, including Linden Lab's Second Life, DeviantArt.com, and his current position as a Heroku SRE. While originally trained in computer science, he’s found that he most enjoys applying his software engineering skills to operations. A veteran of many large incidents, he has strong opinions on incident response, on-call sustainability, and reliable infrastructure design, and he currently runs SRE Weekly.

Courtney comes from a background in customer support and internet anti-abuse policy. She combines this human-focused experience with the principle of Conway’s Law and the work of Kathy Sierra and Don Norman into a wide-reaching and humane concept of operational reliability.

Available Media

  • Read more about SRE: It’s People All the Way Down

Track 2

Lawrence/San Tomas/Lafayette Rooms

Running Consul at Scale—Journey from RFC to Production

Darron Froese, Datadog

We had many VMs in AWS - were ingesting millions of metrics per second - and were having pain around service discovery and quick configuration changes. This is the story of how we integrated Consul into our environment, what it helped us with, mistakes we made and some tips for successful implementation in your own environment. 10 months later, our growing cluster was using Consul to facilitate 60 second cluster-wide configuration changes and make service discovery simpler and more flexible.

Darron has been building things for the internet since the early 90s when he first discovered Mixmaster remailers and Usenet. In 2014, after running nonfiction studios for 12 years, he moved to Datadog to be a Site Reliability Engineer. Darron enjoys short build times, resilient infrastructure, clusters that keep their quorum, and breathing compressed gases underwater.

We had many VMs in AWS - were ingesting millions of metrics per second - and were having pain around service discovery and quick configuration changes. This is the story of how we integrated Consul into our environment, what it helped us with, mistakes we made and some tips for successful implementation in your own environment. 10 months later, our growing cluster was using Consul to facilitate 60 second cluster-wide configuration changes and make service discovery simpler and more flexible.

Darron has been building things for the internet since the early 90s when he first discovered Mixmaster remailers and Usenet. In 2014, after running nonfiction studios for 12 years, he moved to Datadog to be a Site Reliability Engineer. Darron enjoys short build times, resilient infrastructure, clusters that keep their quorum, and breathing compressed gases underwater.

Available Media

  • Read more about Running Consul at Scale—Journey from RFC to Production

Track 3

Cypress Room

Panel: Who/What Is SRE?

Moderator: Jennifer Petoff, Google

Panelists: Jeremy Carroll, Pinterest; Andrew Fong, Dropbox; Niall Murphy, Google; Craig Sebenik, Matterport; Blake Scrivner, Netflix; Brian Smith, Facebook

Available Media

  • Read more about Panel: Who/What Is SRE?
12:25 pm–1:30 pm  

Conference Luncheon, Sponsored by Datadog

Grand Ballroom EF

1:30 pm–2:25 pm  

Track 1

Winchester/Stevens Creek Rooms

Terraform at Adobe

Kelvin Jasperson, Adobe

I will show how my team has been using HashiCorp's Terraform for a year to manage our AWS infrastructure. I'll go over a small intro to Terraform, then jump into some internal design decisions, internal best practices, and a short demo to help solidify some of the principles.

Kelvin is a Systems Engineer for Adobe where he is working in their Digital Marketing BU. There he has been using Terraform and Puppet to transform the way the ops team does business. Before that he was working as a "full stack" ops guy for MX where he racked servers, deployed code, and generally administered systems. He loves automation, making sure everything is committed to version control, and completing fulfilling and interesting projects. You can find him on Twitter @zxjinn.

I will show how my team has been using HashiCorp's Terraform for a year to manage our AWS infrastructure. I'll go over a small intro to Terraform, then jump into some internal design decisions, internal best practices, and a short demo to help solidify some of the principles.

Kelvin is a Systems Engineer for Adobe where he is working in their Digital Marketing BU. There he has been using Terraform and Puppet to transform the way the ops team does business. Before that he was working as a "full stack" ops guy for MX where he racked servers, deployed code, and generally administered systems. He loves automation, making sure everything is committed to version control, and completing fulfilling and interesting projects. You can find him on Twitter @zxjinn.

Available Media

  • Read more about Terraform at Adobe

Transforming Tier 1 Caterpillars to Butterflies

Nina Mushiana, LinkedIn

One of the Linkedin’s key cultural values is Career Transformation: Helping people you manage build new abilities and skills, work with them to define their career goals and support their efforts to accomplish them. Applying this to a Tier1 support team is challenging.

A Tier1 support manages the day-to-day operations of your business and engages higher tiers when needed. They end up with a very wide field of view but very little depth of knowledge. They are always the bearers of bad news and only noticed when something is broken. The morale of such teams is notoriously low. Furthermore, capitalizing on this experience for the business is a challenge because of retention issues stemming from low morale. This was LinkedIn in 2013.

One of the Linkedin’s key cultural values is Career Transformation: Helping people you manage build new abilities and skills, work with them to define their career goals and support their efforts to accomplish them. Applying this to a Tier1 support team is challenging.

A Tier1 support manages the day-to-day operations of your business and engages higher tiers when needed. They end up with a very wide field of view but very little depth of knowledge. They are always the bearers of bad news and only noticed when something is broken. The morale of such teams is notoriously low. Furthermore, capitalizing on this experience for the business is a challenge because of retention issues stemming from low morale. This was LinkedIn in 2013.

Today, we have transformed our tier-1 into the foundation of our SRE organization as an incubator for our SREs. Our objective was to add depth to their breadth: they are part of the resolution instead of just passing on bad news, their work is more valued, and they have gained the trust of higher tiers. As a result, team morale is at an all time high. Investing in automation, training, and mentor-ship was the key to their transformation. This is LinkedIn today.

This session will discuss our roadblocks, leanings and achievements.

Nina Mushiana, SRE Manager for Production-SRE and NOC team at Linkedin since May 2013. Prior to that, Nina was project managing Business Continuity Planning for Yahoo and Assistant manager for the Yahoo NOC. Nina has 14 years of operations experience.

Available Media

  • Read more about Transforming Tier 1 Caterpillars to Butterflies

Track 2

Lawrence/San Tomas/Lafayette Rooms

The Art of Performance Monitoring

Brian Smith, Facebook

In this talk, we share our experience monitoring production performance for Facebook and the services that back the site. We will go over what makes performance monitoring effective and how to produce more useful results. You will learn how to think about monitoring holistically during the design and development of new services and applications.

In this talk, we share our experience monitoring production performance for Facebook and the services that back the site. We will go over what makes performance monitoring effective and how to produce more useful results. You will learn how to think about monitoring holistically during the design and development of new services and applications.

Available Media

  • Read more about The Art of Performance Monitoring

Avoiding Cascading Failures at eBay?

Craig Fender and Ravindra Punati, eBay

eBay is a well-known leader in online marketplaces with more than three billion page views per day. eBay is the sixth largest compute footprint on planet Earth enabling billions of dollars of commerce among upwards of 162 million people in over 100 countries. The infrastructure contains tens of thousands of compute nodes running thousands of applications, hundreds of databases, multiple thousands of network devices. Failures can happen at any infrastructure, platform, framework, application stack layers or even at the availability zone(region) level. Our goal is to ensure that failures at a certain stack layer are not compounded by cascading to various upstream layers causing total system/service outage. This topic addresses automation capabilities that our engineering teams have built to mark up/down data and service paths to ensure the consistent user experience.

eBay is a well-known leader in online marketplaces with more than three billion page views per day. eBay is the sixth largest compute footprint on planet Earth enabling billions of dollars of commerce among upwards of 162 million people in over 100 countries. The infrastructure contains tens of thousands of compute nodes running thousands of applications, hundreds of databases, multiple thousands of network devices. Failures can happen at any infrastructure, platform, framework, application stack layers or even at the availability zone(region) level. Our goal is to ensure that failures at a certain stack layer are not compounded by cascading to various upstream layers causing total system/service outage. This topic addresses automation capabilities that our engineering teams have built to mark up/down data and service paths to ensure the consistent user experience. We will also go over some of the case studies in shifting the traffic from the failed zones(region) to healthy zones(region).

Ravindra Punati has had many senior roles at eBay and is presently the senior leader of the Site Reliability Engineering team. In other roles at eBay Ravi has been responsible for the infrastructure automation initiatives and cloud operations. Ravi brings extensive expertise in the fields of database engineering, application development and software as a service product lines. In addition to holding multiple degrees in computer science Ravi has held several roles as an engineer, architect, manager and executive in various Silicon Valley start-ups.

Craig is presently a Senior Technical Duty Officer at eBay and is responsible for commanding all types of large scale site incidents. In addition to an undergraduate degree, Craig holds numerous professional certifications related to the computer industry (RHCE, SSCA, ITIL and etc.). Craig has held several roles at multiple start-up and fortune 500 companies such as Senior Systems Engineer, Project Manager, Presenter and Major Incident Commander.

Available Media

  • Read more about Avoiding Cascading Failures at eBay?

Track 3

Cypress Room

Managing Grumpy: Embracing Diversity to Build Stronger Teams

Lisa Phillips, Fastly

An organization performs best when it is made up of a diverse group of people, opinions and ideas from a variety of locations and backgrounds. As a manager, your job is to balance this diversity to support and grow a strong team poised for individual and organizational success. Learning how to best motivate a group of disparate and distinct people with unique talents best prepares you to overcome unforeseen situations and obstacles as you scale your operations organization.

An organization performs best when it is made up of a diverse group of people, opinions and ideas from a variety of locations and backgrounds. As a manager, your job is to balance this diversity to support and grow a strong team poised for individual and organizational success. Learning how to best motivate a group of disparate and distinct people with unique talents best prepares you to overcome unforeseen situations and obstacles as you scale your operations organization.

At Fastly, we’ve built a world-class team of ops experts from many backgrounds and timezones around the world, amidst immense growth and with operational pressure. This talk will cover strategies for overcoming individual and organizational management challenges in a globally diverse environment. We’ll discuss strategies for moving your team toward a shared positive future amidst friction in a fast paced 24x7 operational environment. We’ll explore different people management challenges and discuss methods to work through the grumpiest admin. This talk will leave you thinking of new methods for focusing your efforts on total team strength and looking where you want to go instead of where you’re afraid of heading.

Available Media

  • Read more about Managing Grumpy: Embracing Diversity to Build Stronger Teams

SRE at a Start-Up: Lessons from LinkedIn

Craig Sebenik, Matterport

Many large companies have strong SRE teams that are a great example to follow. But, applying the techniques seen at larger companies to a smaller company has many challenges. Bringing about change is a combination of culture shifts as well as technical challenges. In this talk, I will discuss many of the concepts that worked at LinkedIn and how I have gradually implemented them over the past year and a half at a start-up.

Craig has been a researcher, a software developer, a sysadmin, and most recently, an SRE. He has worked for Fortune 500 companies as well as small startups. He has been programming since grade school and you will find him in front of a computer as often as possible. While he worked for a couple of startups doing SRE-type work, he learned a great deal from his (nearly) four years at LinkedIn. He is currently working at Matterport, a startup working on 3D technology.

Many large companies have strong SRE teams that are a great example to follow. But, applying the techniques seen at larger companies to a smaller company has many challenges. Bringing about change is a combination of culture shifts as well as technical challenges. In this talk, I will discuss many of the concepts that worked at LinkedIn and how I have gradually implemented them over the past year and a half at a start-up.

Craig has been a researcher, a software developer, a sysadmin, and most recently, an SRE. He has worked for Fortune 500 companies as well as small startups. He has been programming since grade school and you will find him in front of a computer as often as possible. While he worked for a couple of startups doing SRE-type work, he learned a great deal from his (nearly) four years at LinkedIn. He is currently working at Matterport, a startup working on 3D technology.

He spent a couple years at Le Cordon Bleu in Sydney and another year studying Italian cuisine in Florence (Including working in the kitchen of a restaurant). He prefers pastry over cuisine...but is happy to talk about anything related to food.

Available Media

  • Read more about SRE at a Start-Up: Lessons from LinkedIn
2:25 pm–2:30 pm  

Short Break

2:30 pm–3:25 pm  

Track 1

Winchester/Stevens Creek Rooms

Less Alarming Alerts!

Robert Treat, OmniTI

This talk discusses techniques and philosophy for making oncall more humane / reasonable. We will discuss terminology, how to sort through "alert overload," suggestions for tooling to help, and processes you can put in place to get development & management on board. This comes from years of taking over a managing various oncall services for different architectures, often from extreme states of disrepair. 

With more than 15 years of experience building database-backed, internet-based systems, Robert currently spends his days as CEO of OmniTI, a technical services firm focused on helping clients with web application and infrastructure management at scale. Author and speaker at conferences worldwide, Robert is a recognized expert within the industry on topics such as Open Source, databases, and managing web operations at scale. He occasionally blogs at xzilla.net.

This talk discusses techniques and philosophy for making oncall more humane / reasonable. We will discuss terminology, how to sort through "alert overload," suggestions for tooling to help, and processes you can put in place to get development & management on board. This comes from years of taking over a managing various oncall services for different architectures, often from extreme states of disrepair. 

With more than 15 years of experience building database-backed, internet-based systems, Robert currently spends his days as CEO of OmniTI, a technical services firm focused on helping clients with web application and infrastructure management at scale. Author and speaker at conferences worldwide, Robert is a recognized expert within the industry on topics such as Open Source, databases, and managing web operations at scale. He occasionally blogs at xzilla.net.

Available Media

  • Read more about Less Alarming Alerts!

Track 2

Lawrence/San Tomas/Lafayette Rooms

Ansible to Chef: Going from a Science Fiction Messaging System to Cooking with Chef

Grant Ridder, Banjo

Ansible is an easy to use configuration management tool. However, as your infrastructure grows and becomes more complicated, it has a lot of pitfalls. At Banjo, we are migrating from Ansible to Chef. In the migration we have had several struggles but many more optimizations. We will discuss what we are doing and have learned so you don't have to repeat our drawn out process.

Grant is a DevOps engineer at Banjo and is one of several people responsible for designing, automating, and maintaining the infrastructure. He also lead the migration to Chef. He is active in the Chef, python, and ruby communities. Previously Grant was an operations engineer at PagerDuty, as well as a NOC engineer at LinkedIn.

Ansible is an easy to use configuration management tool. However, as your infrastructure grows and becomes more complicated, it has a lot of pitfalls. At Banjo, we are migrating from Ansible to Chef. In the migration we have had several struggles but many more optimizations. We will discuss what we are doing and have learned so you don't have to repeat our drawn out process.

Grant is a DevOps engineer at Banjo and is one of several people responsible for designing, automating, and maintaining the infrastructure. He also lead the migration to Chef. He is active in the Chef, python, and ruby communities. Previously Grant was an operations engineer at PagerDuty, as well as a NOC engineer at LinkedIn.

  • Read more about Ansible to Chef: Going from a Science Fiction Messaging System to Cooking with Chef

Shaping Reality to Shape Outcomes: Making SRE Work with Uber Growth

Tom Croucher, Uber Technologies, Inc.

In an organization growing as rapidly as Uber (when every week engineers add 10 new services to the stack) it’s impossible to know every engineer. How do Uber SRE ensure our systems are reliable?

In this talk, I will explain how Uber SRE focused on shaping the reality that the engineering teams operate in to create a culture of reliability. Given a specific set of constraints engineers are smart and solve problems, so we made one of the primary roles of the SRE team to create a reality that made reliability part of the DNA of services, not just an add-on.

In an organization growing as rapidly as Uber (when every week engineers add 10 new services to the stack) it’s impossible to know every engineer. How do Uber SRE ensure our systems are reliable?

In this talk, I will explain how Uber SRE focused on shaping the reality that the engineering teams operate in to create a culture of reliability. Given a specific set of constraints engineers are smart and solve problems, so we made one of the primary roles of the SRE team to create a reality that made reliability part of the DNA of services, not just an add-on.

Over times instead of hoping engineers would solve a problem that almost never occurred (failure) to ones that were right in front of them we were able to change the behavior of teams we didn’t even interact with. SRE organizations are in danger of creating an adversarial relationship with engineering teams by punishing them for failures in their reliability or burdening them with “additional requirements." By reframing the reality we created an organizational baseline that means our SRE team is helping engineering teams meet their goals.

This talk will look at a little of the history of SRE at Uber, what we are doing today, and our future work. Some of the major initiatives we have conducted that have changed reality for the wider engineering population at Uber include: bootstrapping our own Datacenters in just a few months, changing our systems to actively serving traffic from multiple regions concurrently, and the many kinds of testing: datacenter failure, chaos simulations, and load testing. The technology we have built is important, but more important is making reliability our DNA.

Available Media

  • Read more about Shaping Reality to Shape Outcomes: Making SRE Work with Uber Growth

Track 3

Cypress Room

Panel: SRE Managers

Moderator: Michael Kacirek, Facebook

Panelists: Liz Fong-Jones, Google; Harrison Fisk, Facebook; Blake Scrivner, Netflix

Available Media

  • Read more about Panel: SRE Managers
3:25 pm–4:00 pm  

Break with Refreshments, Sponsored by Bloomberg

Mezzanine East/West

4:00 pm–5:00 pm  

Closing Talk

Santa Clara Ballroom

Performance Checklists for SREs

Brendan Gregg, Netflix

Brendan Gregg is a senior performance architect at Netflix, where he does large scale computer performance design, analysis, and tuning. He also assists production triage for performance and availability issues, and participates in the primary on-call rotation for the Netflix CORE and SRE team. He is the author of Systems Performance published by Prentice Hall, and received the USENIX LISA Award for Outstanding Achievement in System Administration. He has previously worked as a performance and kernel engineer, and has created performance analysis tools included in multiple operating systems, as well as visualizations and methodologies.

Brendan GreggThere's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.

Brendan GreggThere's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.

In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying.

Brendan Gregg is a senior performance architect at Netflix where he does large-scale computer performance design, analysis, and tuning. He also assists production triage for performance and availability issues and participates in the primary on-call rotation for the Netlfix CORE and SRE team. He is the author of Systems Performance published by Prentice Hall and received the USENIX LISA Award for Outstanding Achievement in System Administration. He has previously worked as a performance and kernel engineer and has created performance analysis tools included in multiple operating systems, as well as visualizations and methodologies.

Available Media

  • Read more about Performance Checklists for SREs
5:00 pm–6:00 pm  

Happy Hour, Sponsored by LinkedIn

Mezzanine East/West

Gold Sponsors

Silver Sponsors

Bronze Sponsors

Media Sponsors & Industry Partners

© USENIX

SREcon is a registered trademark of the USENIX Association.

  • Privacy Policy
  • Contact Us