Architecting and Launching the Halo 4 Services

Caitie McCaffrey; David Mah; Andrew Widdowson; Mark Risher; Cory Scott; Jamie Tomasello

general information

Registration Fee: $400
Register Now
Thanks to generous sponsorship, early bird pricing is now permanent for SREcon15!

Venue:
Hyatt Regency Santa Clara
5101 Great America Pkwy
Santa Clara, CA 95054

Questions?
About SREcon?
About the Call for Participation?
About the Hotel/Registration?
About Sponsorship?

SREcon15 Program

Download the SRECon15 Attendee Lists (Conference Attendees only)

Attendee Files

SREcon15 Attendee List

All sessions will take place at the Hyatt Regency Santa Clara.

Monday, March 16, 2015

7:30 am–5:00 pm
Registration and Badge Pickup Santa Clara Ballroom Foyer
8:30 am–9:00 am
Continental Breakfast Mezzanine East/West
9:00 am–10:00 am
Keynote Address Santa Clara Ballroom Notes from Production Engineering Pedro Canahuati, Director, Production Engineering, Facebook More than 1.39 billion people hit Facebook's infrastructure per month—more than 1.19 billion on mobile alone. Nearly 1 billion photos are shared and more than 3 billion videos are viewed every day. Facebook's services run on top of hundreds of thousands of servers spread across multiple geographically separated data centers. To balance the need for constant availability with a fast-moving and experimental engineering culture, the Facebook operations team has evolved over time to be as nimble as possible. More than 1.39 billion people hit Facebook's infrastructure per month—more than 1.19 billion on mobile alone. Nearly 1 billion photos are shared and more than 3 billion videos are viewed every day. Facebook's services run on top of hundreds of thousands of servers spread across multiple geographically separated data centers. To balance the need for constant availability with a fast-moving and experimental engineering culture, the Facebook operations team has evolved over time to be as nimble as possible. This talk will describe the evolution of a small, centralized, and sometimes marginalized operations team overwhelmed by production issues into a decentralized, high-performing, and well-regarded engineering team that enables Facebook's product and infrastructure teams to move fast and whose role is to say "yes" to ever-changing demands. Topics include culture, hiring practices, prioritization, and philosophies that bring developers closer to ops and even get them doing ops themselves. Pedro Canahuati is director of the production engineering team at Facebook, leading the teams that scale Facebook's infrastructure and making sure Facebook's products are available 24x7. Prior to this, Pedro was director of operations at SpinMedia and Qloud. He previously leveraged his network and systems knowledge to build data centers and scale web operations at companies like NameMedia, Relera and Verio/NTT. Available Media Read more about Notes from Production Engineering
10:00 am–10:30 am
Break with Refreshments Mezzanine East/West

10:30 am–11:25 am
Track 1 Winchester/Stevens Creek Rooms Monitoring without Infrastructure @ Airbnb Igor Serebryany, Airbnb Millions of requests flow through Airbnb’s systems in a given day. Interruptions in traffic can be extremely costly for us, and leave our users stranded in foreign countries with no recourse. We needed a comprehensive suite of introspection tools which would allow us to prevent or quickly identify and remediate any issues. However, we didn’t have the engineering resources to build our own in-house monitoring system. Instead, we levered a combination of open-source software, third-party vendors, and just enough glue code to make it all stick together. All all of these tools are available to you, too! Come hear how we manage tens of thousands of metrics and billions of log lines per day from thousands of machines, all operated by less then a full-time engineer. We will cover logstash, statsd, NewRelic, Datadog, and our own open-sourced configuration-as-code alerting framework. Millions of requests flow through Airbnb’s systems in a given day. Interruptions in traffic can be extremely costly for us, and leave our users stranded in foreign countries with no recourse. We needed a comprehensive suite of introspection tools which would allow us to prevent or quickly identify and remediate any issues. However, we didn’t have the engineering resources to build our own in-house monitoring system. Instead, we levered a combination of open-source software, third-party vendors, and just enough glue code to make it all stick together. All all of these tools are available to you, too! Come hear how we manage tens of thousands of metrics and billions of log lines per day from thousands of machines, all operated by less then a full-time engineer. We will cover logstash, statsd, NewRelic, Datadog, and our own open-sourced configuration-as-code alerting framework. Igor Serebryany is an engineer on Airbnb’s Developer Happiness team. He is the author of SmartStack, an open-source distributed service discovery framework. Prior to building the backend infrastructure at Airbnb, he has worked on running Hadoop clusters, automating datacenters, and running scientific computing simulations in biology and astrophysics. Available Media Read more about Monitoring without Infrastructure @ Airbnb	Track 2 Lawrence/San Tomas/Lafayette Rooms Smart Monitor System For Automatic Anomaly Detection @Baidu Xianping Qu, Baidu, Inc. Billions of requests are supported by hundreds of thousands of servers in Baidu. So many servers and modules bring a huge challenge to engineers for anomaly detection. When an anomaly occurs, various alarms and incidents are sent to engineers. It is very difficult to find the root cause based on large non-organized monitoring data and alarms. Thus, we tried to build a smarter monitoring system named BIMS (Baidu Intelligent Monitoring System) to help engineers to analyze the problems and give the most possible reasons for important anomaly such as revenue loss. In this talk, we demonstrate the core procedure of BIMS by actual cases in the productive environment of the core products at Baidu. The following technologies will be involved and mentioned: data model of incidents, proactive anomaly detection algorithms, correlation analysis, and visualization. Based on BIMS, we’ll also share some ideas about intelligent systems for the SRE team. Billions of requests are supported by hundreds of thousands of servers in Baidu. So many servers and modules bring a huge challenge to engineers for anomaly detection. When an anomaly occurs, various alarms and incidents are sent to engineers. It is very difficult to find the root cause based on large non-organized monitoring data and alarms. Thus, we tried to build a smarter monitoring system named BIMS (Baidu Intelligent Monitoring System) to help engineers to analyze the problems and give the most possible reasons for important anomaly such as revenue loss. In this talk, we demonstrate the core procedure of BIMS by actual cases in the productive environment of the core products at Baidu. The following technologies will be involved and mentioned: data model of incidents, proactive anomaly detection algorithms, correlation analysis, and visualization. Based on BIMS, we’ll also share some ideas about intelligent systems for the SRE team. Xianping Qu is a senior software engineer on SRE team at Baidu. He is now working on utilizing monitoring and operating data to automate SRE’s work, and has experiences on trend analysis, abnormal detection and root cause analysis. Previous to this, he worked on the monitor system at Baidu. Available Media Read more about Smart Monitor System For Automatic Anomaly Detection @Baidu	Track 3 Cypress Room Case Study: Adopting SRE Principles at StackOverflow Tom Limoncelli, Stack Exchange, Inc. Adopting SRE principles at sites other than "unicorn companies" can be a challenge. In this talk I’ll review our experience trying to adopt SRE principles at StackExchange.com/ StackOverflow.com. The failures are as educational as the successes. I’ll cover a number of tools that are publicly available, and techniques that work well at smaller companies. These include monitoring solutions like Boson, and sections of The Practice of Cloud System Administration that can be used to educate others about the SRE ways. Tom Limoncelli works in New York City at Stack Exchange, home of ServerFault.com and StackOverflow.com. He tweets and blogs at everythingsysadmin.com. His new book, The Practice of Cloud System Administration, is an SRE/DevOps look at system administration). http://the-cloud-book.com. Adopting SRE principles at sites other than "unicorn companies" can be a challenge. In this talk I’ll review our experience trying to adopt SRE principles at StackExchange.com/ StackOverflow.com. The failures are as educational as the successes. I’ll cover a number of tools that are publicly available, and techniques that work well at smaller companies. These include monitoring solutions like Boson, and sections of The Practice of Cloud System Administration that can be used to educate others about the SRE ways. Tom Limoncelli works in New York City at Stack Exchange, home of ServerFault.com and StackOverflow.com. He tweets and blogs at everythingsysadmin.com. His new book, The Practice of Cloud System Administration, is an SRE/DevOps look at system administration). http://the-cloud-book.com. Tom is a frequent speaker/keynote at both enterprise and web-scale conferences (SpiceWorks, LISA, LOPSA-East, CascadiaIT, NLUUG). Available Media Read more about Case Study: Adopting SRE Principles at StackOverflow

11:25 am–11:30 am
Short Break

11:30 am–12:30 pm
Track 1 Winchester/Stevens Creek Rooms SRE Hiring Andrew Fong, Dropbox Hiring SREs can be one of the most challenging tasks an organization can undertake. This talk will discuss best practices around recruiting, evaluating and hiring SREs into your organization. During this talk we will explore how to do this from first principles, what best practices tend to be, and how we applied them when building the foundation of SRE at Dropbox. Andrew leads the SRE teams at Dropbox. Prior to Dropbox, he worked at YouTube, helping to scale their infrastructure. He was previously at AOL running proxy/cache and video search infrastructure. Hiring SREs can be one of the most challenging tasks an organization can undertake. This talk will discuss best practices around recruiting, evaluating and hiring SREs into your organization. During this talk we will explore how to do this from first principles, what best practices tend to be, and how we applied them when building the foundation of SRE at Dropbox. Andrew leads the SRE teams at Dropbox. Prior to Dropbox, he worked at YouTube, helping to scale their infrastructure. He was previously at AOL running proxy/cache and video search infrastructure. Available Media Read more about SRE Hiring	Track 2 Lawrence/San Tomas/Lafayette Rooms Scaling Networks through Software Joao Taveira, Fastly While networking is a crucial component to ensuring site reliability, reacting to network events such as outages or DDoS attacks has traditionally been constrained by the capabilities of closed platform network devices. Certain vendors, however, allow anyone to redefine how their network behaves programmatically, enabling network designers to disregard conventional protocol usage entirely. This talk will cover how Fastly did just that — relying on software to redirect traffic reliably at various layers — and how such fine-grained control allowed us to scale a global network while minimizing operational costs. Joao Taveira is a network engineer at Fastly, where he is responsible for making dumb switches do clever things. In addition to writing software for network orchestration, Joao works on protocol design and performance, and holds a PhD from University College London on something to that effect. While networking is a crucial component to ensuring site reliability, reacting to network events such as outages or DDoS attacks has traditionally been constrained by the capabilities of closed platform network devices. Certain vendors, however, allow anyone to redefine how their network behaves programmatically, enabling network designers to disregard conventional protocol usage entirely. This talk will cover how Fastly did just that — relying on software to redirect traffic reliably at various layers — and how such fine-grained control allowed us to scale a global network while minimizing operational costs. Joao Taveira is a network engineer at Fastly, where he is responsible for making dumb switches do clever things. In addition to writing software for network orchestration, Joao works on protocol design and performance, and holds a PhD from University College London on something to that effect. Available Media Read more about Scaling Networks through Software	Track 3 Cypress Room Ensuring Success During Disaster Doug Barth, PagerDuty Surviving a large scale outage requires more than just standing up a few extra servers. Validation and capacity planning can mean the difference between proper mitigation, or just a bunch of wasted effort. This talk will explore how to ensure DR success, gleaned from PagerDuty's production systems. Doug Barth is a Senior Operations Engineer at PagerDuty. He has worked on all parts of the PagerDuty system, but especially moving PagerDuty to a multi-master MySQL cluster, host-to-host transport-layer encryption via IPSec, and recently rebuilding PagerDuty's DR site. Doug developed and operated production systems for Orbitz, a large scale online travel company, and Signal Engage, a startup marketing tool provider in Chicago. Surviving a large scale outage requires more than just standing up a few extra servers. Validation and capacity planning can mean the difference between proper mitigation, or just a bunch of wasted effort. This talk will explore how to ensure DR success, gleaned from PagerDuty's production systems. Doug Barth is a Senior Operations Engineer at PagerDuty. He has worked on all parts of the PagerDuty system, but especially moving PagerDuty to a multi-master MySQL cluster, host-to-host transport-layer encryption via IPSec, and recently rebuilding PagerDuty's DR site. Doug developed and operated production systems for Orbitz, a large scale online travel company, and Signal Engage, a startup marketing tool provider in Chicago. Available Media Read more about Ensuring Success During Disaster

12:30 pm–1:30 pm

Conference Luncheon

Terra Courtyard

1:30 pm–2:30 pm
Track 1 Winchester/Stevens Creek Rooms Instagration: A Case Study in Cloud Migration at Scale Chris Bray, Facebook/Instagram Instagram recently completed a large migration from AWS to Facebook's internal systems. This talk will go into more detail of what and how the team accomplished a very large scale AWS migration. It will touch on tools and lessons learned that Instagram applied to their migration that others might be able to both utilize and apply to their migration (to or from AWS) to help make it as painless and trouble free as possible. Instagram recently completed a large migration from AWS to Facebook's internal systems. This talk will go into more detail of what and how the team accomplished a very large scale AWS migration. It will touch on tools and lessons learned that Instagram applied to their migration that others might be able to both utilize and apply to their migration (to or from AWS) to help make it as painless and trouble free as possible. Chris Bray is a Production Engineer at Instagram and Facebook. He works with large scale software deployments running on, amongst other things, CentOS, Chef, Python, designer drip coffee, Ruby, RedBull, vi, bash, cable ties and duct tape, using a large infrastructure of both Amazon EC2 and Facebook's OpenCompute hardware. He was part of the team that recently Completed the migration of Instagram from EC2 to Facebook infrastructure and is currently focusing on working with new acquisitions at Facebook to help them take best advantage of Facebook's internal infrastructure. Available Media Read more about Instagration: A Case Study in Cloud Migration at Scale	Track 2 Lawrence/San Tomas/Lafayette Rooms Netflix RaaS: Reliability as a Service 1:30 pm-2:30 pm Coburn Watson, Netflix, Inc. The Netflix architecture is based on hundreds of microservices running in the cloud at massive scale across numerous AWS regions. Achieving excellent availability of such a complex system requires a capable operations methodology. At Netflix we have a shared services team which seeks to lower operational barriers for individual service teams in order to improve both aggregate and microservice-level reliability. The challenge lies in finding the right balance of responsibility between a shared service support team and the devops engineers on the microservice team itself. We have taken an approach in which tooling and associated methodologies developed by our Operations Engineering organization tackle the following subset of operational activities at a platform-level: The Netflix architecture is based on hundreds of microservices running in the cloud at massive scale across numerous AWS regions. Achieving excellent availability of such a complex system requires a capable operations methodology. At Netflix we have a shared services team which seeks to lower operational barriers for individual service teams in order to improve both aggregate and microservice-level reliability. The challenge lies in finding the right balance of responsibility between a shared service support team and the devops engineers on the microservice team itself. We have taken an approach in which tooling and associated methodologies developed by our Operations Engineering organization tackle the following subset of operational activities at a platform-level: Continuous integration and deployment automated staggered deployment of microservice code across cloud regions automated analysis of canary versus baseline code Tuning of curcuits in the system which respond to localized failures Improved observability for both macro and micro performance dimensions Identification and termination of server instances which are outliers Through elimination of such undifferentiated heavy lifting, the teams can shift their focus onto product development versus being mired in operational complexity. The key benefit is the improvement of engineering velocity alongside reliability. As an organization. a direction needs to be taken on where to draw the line for operational responsibilities. This is no different in the Netflix "Freedom and Responsibility" culture. This presentation will cover the operational complexities we have abstracted away from our microservice engineering teams, the associated decision factors, and future direction of the program. Coburn leads the Cloud Performance and Reliability Engineering team at Netflix. His team works to optimize the use of massive cloud resources with a keen focus on system performance and reliability. Prior to Netflix. he was at Rearden Commerce, HP, and numerous other companies. working to improve the performance of large scale distributed systems. Available Media Read more about Netflix RaaS: Reliability as a Service	Track 3 Cypress Room Making Every SRE Hire Count Chris Stankaitis, The Pythian Group Our industry is changing rapidly to keep up with the demands of the business. This means that many of the tools, processes, and techniques that were perfectly valid only a few years ago are now antiquated and no longer meet our needs. The hiring process is not immune to this shift, and finding the right people to fill a 21st century SRE position is challenging. From designing and grading technical tests, to participating in technical interviewing / hiring manager "FIT" interviewing, making great people decisions is critical to the success of an SRE team and the pressure to get it right is high. We will look at the process, how it has changed, and how we have had to evolve to find the people we are looking for. Our industry is changing rapidly to keep up with the demands of the business. This means that many of the tools, processes, and techniques that were perfectly valid only a few years ago are now antiquated and no longer meet our needs. The hiring process is not immune to this shift, and finding the right people to fill a 21st century SRE position is challenging. From designing and grading technical tests, to participating in technical interviewing / hiring manager "FIT" interviewing, making great people decisions is critical to the success of an SRE team and the pressure to get it right is high. We will look at the process, how it has changed, and how we have had to evolve to find the people we are looking for. Chris Stankaitis is a Manager for the Site Reliability Engineering group at Pythian, an organization providing Managed Services and Premium Consulting to companies whose data availability, reliability, and integrity is critical to their business. Chris is a key member of the hiring team for the Pythian SRE group and has participated in hundreds of candidate screenings and interviews over the past two years, resulting in the hiring of over 30 Site Reliability Engineers. Available Media Read more about Making Every SRE Hire Count

2:30 pm–2:35 pm
Short Break

2:35 pm–3:30 pm
Track 1 Winchester/Stevens Creek Rooms Incident Analysis Sue Lueder, Google Outages and incidents happen. Sht breaks, fibers get cut, bugs get pushed to production, teams fail to communicate, and all hell breaks loose. But those who don't learn from mistakes are doomed to repeat them...over and over and over again, with increasing frustration for those on the frontlines fixing the problems and from the users who suffer the impacts. In an effort to better learn from what happened across all products and services, Google launched an initiative in 2014 to gather data from all outages and incidents that occurred on production systems for trend analysis into system and user impacts, incident timelines, and root causes. The data is then used to drive improvements across systems, processes, and tools to improve the balance between system stability and development velocity. This talk aims to share Google's approach to setting up and running such an analysis program, some preliminary results, and lessons learned. Outages and incidents happen. Sht breaks, fibers get cut, bugs get pushed to production, teams fail to communicate, and all hell breaks loose. But those who don't learn from mistakes are doomed to repeat them...over and over and over again, with increasing frustration for those on the frontlines fixing the problems and from the users who suffer the impacts. In an effort to better learn from what happened across all products and services, Google launched an initiative in 2014 to gather data from all outages and incidents that occurred on production systems for trend analysis into system and user impacts, incident timelines, and root causes. The data is then used to drive improvements across systems, processes, and tools to improve the balance between system stability and development velocity. This talk aims to share Google's approach to setting up and running such an analysis program, some preliminary results, and lessons learned. Sue Lueder joined Google as a Site Reliability Program Manager in 2014 and is on the team responsible for disaster testing and readiness, incident management processes and tools, and incident analysis. Previous to Google, Sue was a technical program manager and a systems, software, and quality engineer in wireless and smart energy industries (OnRamp Wireless, Texas Instruments, Qualcomm). She has a M.S. in Organization Development from Pepperdine University and a B.S in Physics from UCSD. Available Media Read more about Incident Analysis	Track 2 Lawrence/San Tomas/Lafayette Rooms Collin and the Slingbot Joe Ruscio, Librato In this talk we describe our internal feature-flagging system, which combines Rollout, ZooKeeper, and our in-house campfire chatbot (twke), to transparently enable features for targeted production end-users without disrupting other customers. Our talk follows the escapades of our intrepid engineer Collin as he cleverly employs feature flagging to manage production traffic flow in order to re-engineer and replace a core component of our production SaaS infrastructure. Like the engineering acrobatics involved in reworking a bridge that must continue to bear traffic, Collin's story is a prolonged, high-stakes, surgical endeavor with a lot of moving parts. It is also a textbook illustration of the multi-disciplinary balance between architecture, programming, and operational ingenuity that exemplifies site reliability engineering in the wild. In this talk we describe our internal feature-flagging system, which combines Rollout, ZooKeeper, and our in-house campfire chatbot (twke), to transparently enable features for targeted production end-users without disrupting other customers. Our talk follows the escapades of our intrepid engineer Collin as he cleverly employs feature flagging to manage production traffic flow in order to re-engineer and replace a core component of our production SaaS infrastructure. Like the engineering acrobatics involved in reworking a bridge that must continue to bear traffic, Collin's story is a prolonged, high-stakes, surgical endeavor with a lot of moving parts. It is also a textbook illustration of the multi-disciplinary balance between architecture, programming, and operational ingenuity that exemplifies site reliability engineering in the wild. Joseph Ruscio is a Co-Founder and the Chief Technology Officer at Librato. He's responsible for the company's technical strategy, product architecture, and hacks on all levels of their vision for the future of monitoring. Joe has 15 years of experience developing distributed systems in startups, academia, and the telecommunications industry and he holds a Masters in Computer Science from Virginia Tech. In his spare time he enjoys snowboarding and obsessing over the details of brewing both coffee and beer. He loves graphs. Available Media Read more about Collin and the Slingbot	Track 3 Cypress Room Making the Sum of AWS Networking Greater than Its Parts—Achieving High Availability in VPCs Warren Turkal, SignalFx If you're running in VPC (and you should be by now), what is your network reliability strategy? This talk proposes a method of building flexible, redundant networking for VPC that goes far beyond a single NAT router per VPC. CloudFormation + boto + BGP come together in beautiful harmony for an architecture that you can depend on. Learn about the architecture, our implementation in Python and grab the scripts to run your own. Warren Turkal is a Site Reliability Engineer at SignalFx, a stealth startup in San Mateo, CA. He works to provide infrastructure components like SaltStack and Docker in a production environment. Warren has been working with AWS for about 3 years and has used the boto library to write many tools used at SignalFx. He previously worked at companies heavily invested in cloud technologies, such as Ooyala and Google. If you're running in VPC (and you should be by now), what is your network reliability strategy? This talk proposes a method of building flexible, redundant networking for VPC that goes far beyond a single NAT router per VPC. CloudFormation + boto + BGP come together in beautiful harmony for an architecture that you can depend on. Learn about the architecture, our implementation in Python and grab the scripts to run your own. Warren Turkal is a Site Reliability Engineer at SignalFx, a stealth startup in San Mateo, CA. He works to provide infrastructure components like SaltStack and Docker in a production environment. Warren has been working with AWS for about 3 years and has used the boto library to write many tools used at SignalFx. He previously worked at companies heavily invested in cloud technologies, such as Ooyala and Google. Available Media Read more about Making the Sum of AWS Networking Greater than Its Parts—Achieving High Availability in VPCs

3:30 pm–4:00 pm

Break with Refreshments

Mezzanine East/West

4:00 pm–5:00 pm

Tracks 1 & 2

Santa Clara Ballroom

Building a Billion User Load Balancer

4:00 pm-5:00 pm

Patrick Shuff, Facebook

Want to learn how Facebook scales their load balancing infrastructure to support more than 1.3 billion users? We will be revealing the technologies and methods we use to global route and balance Facebook's traffic. The traffic team at Facebook has built several systems for managing and balancing our site traffic, including both a DNS load balancer and a software load balancer capable of handling several protocols. This talk will focus on these technologies and how they have helped improve user performance, manage capacity, and increase reliability.

Patrick Shuff is a production engineer on the traffic team at facebook. His team's responsibilities include maintaining/monitoring the global load balancing infrastructure, dns, and our content delivery platform (i.e. photo/video delivery). Other roles at Facebook include being on the global site reliability team where he works with various infrastructure teams (messaging, real time infrastructure, email) to help increase service reliability and monitoring for their services.

Available Media

Read more about Building a Billion User Load Balancer

Track 3

Cypress Room

Error Budgets and Risks

Marc Alvidrez, Google

Striving for Imperfection: Using an error budget to move fast without compromising high reliability

You may assume Site Reliability Engineers aim to build systems that never go down. What that fails to realize is that 100% reliability is almost never the goal. Instead, our task is to trade off reliability against the many other goals we have for our services. SREs want to provide great service to end users and customers, and also have the flexibility to change the systems often and quickly. We want to ensure that the queries and the revenue keep flowing, and do so as efficiently as possible, provisioning as little excess as necessary to deliver good service. Taking an engineering approach to meeting these goals means we need to make these tradeoffs measurable, and this is where error budgets come in. Your error budget is a measure of risk, it is the amount of headroom you have above your SLA. Being smart about how you manage and spend this error budget is one of the best tools that SRE has to meet the various contending goals that services at Internet scale present.

Marc Alvidrez is a Senior Staff Site Reliability Engineer with Google. He joined the company in 2004, and starting as an early SRE he has led a variety of teams responsible for both infrastructure and major user-facing services. These have included the first team responsible for Google File System (GFS), and the teams responsible for Google's Display and AdSense advertising serving systems, Google+ and Google Photos. Prior to Google he held systems engineering roles at Vodafone and Internet startup Topica, where he was the Director of Operations.

Available Media

Read more about Error Budgets and Risks

5:00 pm–6:30 pm
Happy Hour, Sponsored by Google Mezzanine East/West

Tuesday, March 17, 2015

7:30 am–5:00 pm
Registration and Badge Pickup Santa Clara Ballroom Foyer
8:30 am–9:00 am
Continental Breakfast Mezzanine East/West

9:00 am–10:00 am
Track 1 Santa Clara Ballroom Mux: How I Stopped Worrying and Learned to Love the Multiplexing 9:00 am-10:00 am Berk Demir, Twitter At its core, Mux is a generic RPC multiplexing protocol created at Twitter. As a strictly OSI Layer 5 session protocol, it can be used in conjunction with protocols from other layers. We'll discuss the motivation for creating a session protocol as well as gains such as elimination of head-of-line blocking, explicit queue management, and better networking economics. At its core, Mux is a generic RPC multiplexing protocol created at Twitter. As a strictly OSI Layer 5 session protocol, it can be used in conjunction with protocols from other layers. We'll discuss the motivation for creating a session protocol as well as gains such as elimination of head-of-line blocking, explicit queue management, and better networking economics. Available Media Read more about Mux: How I Stopped Worrying and Learned to Love the Multiplexing	Track 2 Cypress Room Lightning Talks 9:00 am-10:00 am Lightning Talks are back-to-back five-minute presentations on just about anything. Lightning Talks are back-to-back five-minute presentations on just about anything. Managing Bare Metal @Spotify Drew Michel, Spotify At Spotify, the way we have provisioned and managed servers has seen many iterative improvements. The latest, is a platform that enables squads to fully manage the lifecycle of a server. I’ll go over why this is important for a movement we call "Ops in Squads." Drew Michel is an SRE at Spotify focusing on providing engineers with a self-service platform for provisioning bare metal hosts. Don't Fear the Rise of the Machines (or) Why You Can't Scale Ops by Cloning John Connor Philip Fisher-Ogden, Netflix We all know what machines are better at than humans and vice versa. Then why do we over-rely on humans to solve problems in production? I think we're afraid of the machines taking over. Once we get over that fear we can move to more scalable approaches to operations. Machines can be programmed to automatically shed load, engage fallbacks, and gracefully degrade. Insight tools can gather forensic data from affected nodes, search for deadlocks and memory leaks, and perform correlation analysis to identify probable causes for the outage. By the time a human gets paged the only thing left to do should be higher-order thinking - why did things fail, not gathering signals and twiddling knobs to try and recover, all while racing the "time to recovery" clock. John Connor might have been able to fend off Skynet's attempt to end humanity by himself, but I say let's accept the rise of the machines. They’ll do what they do best and we'll do the same. Sure, it might lead to the eventual end of humanity, but until then I'll be able to sleep through a few more pages without production falling over. Philip Fisher-Ogden, Director of Engineering @ Netflix, ensures "click play" works every time. Near-line Processing with Apache Samza Jon Bringhurst, LinkedIn Apache Samza is a near-line stream processing framework. This lightning talk will briefly cover an example use case. In addition, we'll also talk about how Samza fits into our overall data pipeline here at LinkedIn. Jon Bringhurst is an SRE for Samza, Kafka, and Zookeeper at LinkedIn. A Half-Petabyte NAS Using Commodity iSCSI Storage and OpenSource Tools Benjamin O'Connor, TripAdvisor At TripAdvisor we built a central storage system for data backups, near-line log aggregation, and general shared storage needs. Clients are mostly NFS and Rsync, and data is aged off to S3 after 5 months. We opted against vendor solutions from EMC/NetApp, etc. and built using commodity hardware and open source solutions. Cheap Dell iSCSI chassis, a couple of linux servers, 10gb ethernet networking, BTRFS, NFS, Rsync and CentOS come together to make the system work. Several challenges with such a huge BTRFS filesystem and making something like this work reliably and consistently without vendor support were encountered. We achieved significant cost savings at the expense of some reliability and administration/maintenance time. Benjamin O'Connor is currently a Technical Operations Engineer at TripAdvisor with over 15 years experience working on everything from large academic systems (UIUC, MIT) to video game backends (Rock Band, Dance Central, Second Life). Benchmarking TLS With IPython Chris Niemira, AOL The talk is about using IPython as a foundation for running distributed performance tests and analyzing the results. I use TLS benchmarking as an example and show our general method for evaluating the performance of hardware acceleration for different cipher suites. The talk touches on a number of topics, but is fundamentally about using a powerful tool in a novel way. I believe it’s interesting for this audience because I don’t see much about either security or benchmarking on the program. Chris Niemira is a Senior Site Reliability Engineer at AOL and is responsible for helping maintain the performance and availability of AOL's entire portfolio of products. Available Media Read more about Lightning Talks
10:00 am–10:30 am
Break with Refreshments Mezzanine East/West

10:30 am–11:25 am

Track 1

Santa Clara Ballroom

Being Afraid—How Paranoia at Dropbox Protects Your Data

David Mah, Dropbox

Dropbox is built around our users' trust in us to not lose their data, and the engineering challenges to preserve this are immense. We leverage external managed storage vendors in order to help us solve many of these challenges, but we've recently been transitioning some customer data into a storage platform designed for exabyte-scale that is developed from scratch and operated in-house by Dropbox engineers.

A crucial consideration of building this new system is the mitigation of accidental or intentional deletion and corruption through entropy, and we've taken special care to build defenses against these risks. In this talk we will describe the system's inherent failure domains and the operations infrastructure required to maintain strong durability. Attendees should leave the talk with a healthy dose of paranoia for their own systems and practical strategies for mitigating failure in inherently failure-prone infrastructure.

Available Media

Read more about Being Afraid—How Paranoia at Dropbox Protects Your Data

Track 2

Cypress Room

Panel: Fifty Shades of Grey: Different Models for Reliability Work

Fernanda Weiden, Facebook; Stephanie Dean, Dropbox; David Barr, Twitter; Abe Hassan, Google

There are many different approaches for reliability work, and different needs depending on each company's needs and infrastructure. One might spend time dealing with private clouds, public clouds, in-house software, third-party software, developing, watching, instrumenting, and debugging. Sometimes we need to build, sometimes we need to optimize, sometimes we need to un-break things.

How do we build teams that adapt to the different needs, and how can we hire effectively for each one of those cases? What are the different approaches we can take for the work? In this panel we'll discuss some of that, and compare different approaches from different organizations actively working on systems reliability.

Stephanie Dean has built, grown, and supported various types and sizes of operational engineering teams for the past 10 years at companies such as Amazon, Facebook, and Twitter. She's now living the life enabling teams to build out Infrastructure as a TPM at Dropbox.

Dave Barr has been in the industry for 20+ years and a Site Reliability Engineer with Google, StumbleUpon, and Twitter. He now manages several SRE teams at Twitter, covering core Twitter services, Search, and Ads serving.

Abe Hassan is a Site Reliability Manager at Google, working on the Search Indexing team. Prior to joining Google, Abe managed the operations team at Say Media and at Six Apart, responsible for the web operations of blogging services LiveJournal and Typepad.

Available Media

Read more about Panel: Fifty Shades of Grey: Different Models for Reliability Work

11:25 am–11:30 am
Short Break

11:30 am–12:30 pm

Track 1

Santa Clara Ballroom

Learning from Mistakes and Outages at Facebook

Rajesh Nishtala, Facebook

Facebook, like most large scale infrastructures, is not immune to failures that cause entire subsystems to degrade or become completely unavailable. These Site Events (or SEVs) can then lead to very user-visible performance degradations or worse yet, downtime. This talk will use major site events as case studies to describe lessons learned that have helped make Facebook more robust.

Rajesh Nishtala is a software engineer in the infrastructure team and currently works on understanding the site's performance and capacity bottlenecks. Rajesh also spent a significant amount of time working on the systems that efficiently serve the social graph, (e.g. memcache, TAO, and mcrouter). Rajesh holds a Ph.D. in Computer Science from the University of California, Berkeley. His dissertation focused on High Performance Computing and scaling applications to tens of thousands of processor cores.

Available Media

Read more about Learning from Mistakes and Outages at Facebook

Track 2

Cypress Room

Panel: The Weeping Angels of Site Reliability

Kurt Andersen, LinkedIn; Ryan Boyd, Groupon; Pete Cheslock, Threat Stack; Mark Risher, Google; Cory Scott, LinkedIn; Jamie Tomasello, AccessNow.Org

Site reliability is not just about keeping your service running, users clicking, and transactions humming along. Your service does not end at your network interconnect. Reliability has to include principles and practices of safe computing for your users, their data, and all of the other systems that your site connects to. Bake security in from the beginning to protect your site, your users, and user data from an increasing array of assailants: DevOps needs InfoSec too.

Threats to the stability of a service can come from poor code, unreliable hardware, or bad network or power connections, but they also come from people who want to exploit the power and resources which you have accumulated to operate and grow your site. Hardware failures and other random acts of nature will test the abilities of your operations teams, but they do not present the same threat as a sentient enemy. And any site with resources of value will have intelligent, resourceful predators. If you haven't seen them, you are not looking in the right places.

This panel of experienced anti-abuse and security experts will cover the threat space for services, from fraudulent accounts to nation state actors who want your user's data, to scammers who want to use your infrastructure to wreak havoc on other portions of the internet ecosystem. Learn about developing a data security action plan for your site.

Join us for an adventure to foil the dark side, and be careful not to blink.

Kurt Andersen has been active in the anti-abuse community for over 15 years and currently leads the Growth and Comm SRE team at LinkedIn, as well as serving as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG). He has spoken at M3AAWG, Velocity, and SANOG on various aspects of reliability, authentication and security.

Ryan Boyd manages the global email delivery team at Groupon, which is responsible for email delivery in 48 countries around the world. Ryan’s team integrates within the engineering and marketing organizations at Groupon to maintain industry-leading deliverability rates through establishing and maintaining best practices for email delivery and security.

Pete Cheslock is the head of Threat Stack's operations and support teams. He has over 15 years' experience in DevOps, and understands the challenges and and issues faced by security, development and operations professionals every day, and how we can help deal with them.

Mark Risher runs the Spam & Abuse team at Google, protecting services across the company from attacks of all kinds. Previously, he was CEO and Co-Founder of Impermium (which Google acquired in 2014), and "Spam Czar" at Yahoo! He has regularly presented worldwide to government, industry, and the media about spam, abuse and cyber security issues.

Cory Scott joined LinkedIn in 2013 and currently leads the House Security team, which is responsible for application and infrastructure security across all of LinkedIn. Prior to joining LinkedIn, he was the director of Matasano Security, an information security consultancy. He has spoken at BlackHat, USENIX LISA, SANS, and OWASP.

Jamie Tomasello is the technology director at AccessNow.Org, an international human rights organization dedicated to defending the digital rights of at-risk users. Jamie has been combating Internet abuse and addressing policy issues for more than the past 13 years. With a background in applied behavior analysis and pattern recognition, Jamie has focused her efforts in analyzing technical data points underlying cybercrime, identifying non-obvious data relationships, and profiling cybercriminals.

Available Media

Read more about Panel: The Weeping Angels of Site Reliability

12:30 pm–1:30 pm

Conference Luncheon

Terra Courtyard

1:30 pm–2:30 pm

Track 1

Santa Clara Ballroom

From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams

Andrew Widdowson, Google

or, "How can I strap a jetpack to my newbies, while keeping everyone up to speed?"

SRE teams go to where the action is, but when team members are deeply embedded in large scale problems, little time is left to do things like train one's newest teammates. "Here kid, grab a hose and help me fight this fire" only works up to a limit that you will definitely exceed when you're trying to mold your newest systems or software engineer into a fully functional Site Reliability Engineer. Plus, the stack(s) that your team is oncall for are rapidly evolving and if you blink, even your most senior SREs can quickly be out of touch with the state of the systems. Uh oh!

The often understated truth is that SREs need to be as good--or better--at scaling humans as they are at scaling computers, if they want to be able to keep up with the systems that they oversee. How, then, can you keep your existing SREs up to speed and sharp as a tack, while making sure that your newest teammates can learn the ropes and become just as seasoned, sooner rather than later or never?

In this talk, Andrew will share a set of practices we're using at Google to train our next wave of SREs better, stronger, and faster... and then keep them that way! You'll learn about ways to encourage large scale systems thinking, provide hands-on opportunities for learning, and impress the technical and philosophical subtleties of what make the best SREs so effective as quickly as possible.

or, "How can I strap a jetpack to my newbies, while keeping everyone up to speed?"

Andrew Widdowson is a Staff Software Engineer in Site Reliability Engineering at Google. In addition to being a long time member of the Search SRE team, he is the tech lead for "SRE EDU", Google's internal efforts to support a culture of teaching of and from SREs. Andrew also leads a team of engineers that fight abusers, scrapers, and attackers of the search stack. In his spare time, Andrew serves as a chair of the Carnegie Mellon University School of Computer Science Alumni Advisory Board.

Available Media

Read more about From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams

Track 2

Cypress Room

Panel: Ask Me Anything with the SREcon Chairs and Speakers

Read more about Panel: Ask Me Anything with the SREcon Chairs and Speakers

2:30 pm–2:35 pm
Short Break

2:35 pm–3:30 pm

Track 1

Santa Clara Ballroom

MySQL Automation at Facebook Scale

Shlomo Priymak, Facebook

Facebook has one of the largest MySQL database clusters in the world, comprising thousands of servers across multiple data centers. Operating a cluster of this size requires automating most of what a conventional MySQL Database Administrator (DBA) might do, so that the cluster can almost run itself. Learn about the design and architecture of our automation systems, and hear a few war stories.

Shlomo has been on the MySQL Infrastructure team at Facebook since 2011, managing one of the biggest MySQL clusters in the world—mostly by being lazy and making automation manage it instead of him. Before making the switch to manage 1000s of MySQL servers, Shlomo was pretty happy with 100s of them at companies like Sears and Wix, where he was a DBA and a developer.

Prior to diving into to the MySQL world in 2006, Shlomo used to be a SQL Server DBA at the Israeli Intelligence Corps, but he can't tell you how many servers he managed there.

Available Media

Read more about MySQL Automation at Facebook Scale

Track 2

Cypress Room

Panel: Educating SRE

Craig Sebenik, Matterport;David Mah, Dropbox; Andrew Widdowson, Google; Philip Boyle, Facebook

Join us for a panel discussing various challenges facing the education of SREs. The panel will focus on new SREs (coming out of higher education), ongoing education and how we as a group educate developers.

Training the next generation of SREs coming in to our field means figuring out what information is important from other fields as well as trying to pass on lessons learned from years of experience. Also, we need to listen to new people that bring an unfettered perspective.

Educating SREs within a company present slightly different issues. Namely, keeping up to date on tools that may be very specific to the processes used at any one company. Companies may choose to follow industry norms or may do something totally different. Companies need to keep their employees up to date with there specific goals.

Lastly, developers have their own priorities, but we need their help to emit metrics, add caching as well as a host of other code-level changes. We can learn from their experiences and also educate them to our best practices.

Craig Sebenik works for a startup as the only SRE (infrastructure engineer). This presents plenty of opportunities to empower developers to maintain their own code. But it also presents a number of challenges with educating devs on adding metrics, improving deployment, etc. Craig recently left LinkedIn where he was an SRE and led a few different initiatives aimed at training SREs. There are valuable lessons from both large and small companies. Craig has a passion for education and is very interested in sharing knowledge throughout our industry.

David Mah is a Dropbox SRE characterized by his paranoia when dealing with infrastructure. Sometimes people consider this to be 'attention to detail'; other times it would just be considered paranoia. His current focus is on building protections in Dropbox's in-house storage system to avoid losing user data. He recently graduated from the University of Washington and brings the perspective of one who didn't know much about SRE until fairly recently.

Available Media

Read more about Panel: Educating SRE

3:30 pm–4:00 pm

Break with Refreshments

Mezzanine East/West

4:00 pm–5:30 pm

Closing Talk

Santa Clara Ballroom

Architecting and Launching the Halo 4 Services

4:00 pm-5:00 pm

Caitie McCaffrey, Twitter

Halo 4 is a first-person shooter on the Xbox 360, with fast-paced, competitive gameplay. To complement the code on disc, a set of services were developed and deployed in Azure to store player statistics, display player presence information, deliver daily challenges, modify playlists, catch cheaters, and more. As of June 2013, Halo 4 had 11.6 million players who played 1.5 billion games, logging 270 million hours of gameplay.

The Halo 4 services were built from the ground up to support high demand, low latency, and high availability. In addition, video games have unique load patterns where the majority of the traffic and sales occurs within the first few weeks after launch, making this a critical time period for the game and supporting services. Halo 4 went from 0 to 1 million users on day 1, and 4 million users within the first week.

This talk will discuss the architectural challenges faced when building these services and how they were solved using Windows Azure and Project Orleans. In addition, we'll discuss the path to production, some of the difficulties faced, and the tooling and practices that made the launch successful.

Available Media

Read more about Architecting and Launching the Halo 4 Services

sponsors

help promote

general information

twitter

usenix conference policies

You are here

connect with us

SREcon15 Program

Monday, March 16, 2015

Registration and Badge Pickup

Continental Breakfast

Break with Refreshments

Short Break

Conference Luncheon

Short Break

Break with Refreshments

Happy Hour, Sponsored by Google

Tuesday, March 17, 2015

Registration and Badge Pickup

Continental Breakfast

Managing Bare Metal @Spotify

Don't Fear the Rise of the Machines (or) Why You Can't Scale Ops by Cloning John Connor

Near-line Processing with Apache Samza

A Half-Petabyte NAS Using Commodity iSCSI Storage and OpenSource Tools

Benchmarking TLS With IPython

Break with Refreshments

Short Break

Conference Luncheon

Short Break

Break with Refreshments

Gold Sponsors

Silver Sponsors

Bronze Sponsors

Media Sponsors & Industry Partners