SREcon15 Europe Programme

All events will take place at The Foundry, which is located inside Gordon House on Barrow Street, Dublin 4, at Google Dublin, unless otherwise noted.

Download the SRECon15 Europe Attendee List (Conference Attendees only)
Attendee Files 
Attendee List - SREcon15 Europe

 

Thursday, May 14, 2015

07:30–09:00  

Badge Pickup

08:00–09:00  

Continental Breakfast

Lounge

09:30–10:00  

Keynote Address

Auditorium

PostOps: Recovery from Operations

9:30 am-10:00 am

Todd Underwood, Google

This a shorter, tighter, updated version of my LISA13 talk, PostOps: A Non Surgical Tale of Software, Fragility and Reliability, which was a call (much like NoOps) for the elimination of operations functions rather than simply their automation.

Todd Underwood is Site Reliability Director at Google. Prior to that, he was in charge of operations, security, and peering for Renesys, a provider of Internet intelligence services; and before that he was CTO of Oso Grande, a New Mexico ISP. He has a background in systems engineering and networking. Todd has presented work related to Internet routing dynamics and relationships at NANOG and RIPE, and presented about SRE at LISA.

This a shorter, tighter, updated version of my LISA13 talk, PostOps: A Non Surgical Tale of Software, Fragility and Reliability, which was a call (much like NoOps) for the elimination of operations functions rather than simply their automation.

Todd Underwood is Site Reliability Director at Google. Prior to that, he was in charge of operations, security, and peering for Renesys, a provider of Internet intelligence services; and before that he was CTO of Oso Grande, a New Mexico ISP. He has a background in systems engineering and networking. Todd has presented work related to Internet routing dynamics and relationships at NANOG and RIPE, and presented about SRE at LISA.

Available Media
10:00–11:00  

Track 1

Auditorium

Distributed Teams and Successful Remote Engineering

10:00 am-10:30 pm

Avleen Vig, Etsy, Inc

Over half of Etsy's Operations team are remote engineers, spread out between North America and Europe. Most individuals work from home, with little face time with other members of the team. This type of environment is growing more common as organisations look outside their local areas when hiring new staff, and potential employees leverage the current market to relocate out of major metropolitan areas.

Over half of Etsy's Operations team are remote engineers, spread out between North America and Europe. Most individuals work from home, with little face time with other members of the team. This type of environment is growing more common as organisations look outside their local areas when hiring new staff, and potential employees leverage the current market to relocate out of major metropolitan areas.

Avleen will discuss the challenges faced by engineers, management, and organizations, and ways to address the issues to successfully build distributed teams.

Avleen is a Staff Operations Engineer at Etsy, where he spends much of his time growing the infrastructure for selling knitted gloves and cross-stitch periodic tables. Before joining Etsy, he worked at several large tech companies, including EarthLink and Google, as well as a number of small successful startups.

Available Media

Prometheus: A Next-Generation Monitoring System (Talk)

10:30 am-11:00 am

Julius Volz and Björn Rabenstein, SoundCloud Ltd.

Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.

Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.

In an introductory talk, we will explain the architecture of Prometheus and the motivation behind it. Taking the instrumentation and monitoring of services at SoundCloud as an example, we will demonstrate how Prometheus helps us stay on top of a growing microservice architecture and detect and investigate outages.

In a follow-up workshop, participants will set up all critical components of the Prometheus ecosystem to monitor some toy services.

Julius and Björn are production engineers at SoundCloud. Coincidentally, both were SREs at Google in their respective previous lives. Julius is a co-founder of the Prometheus project, and both are maintainers and main contributors.

Available Media

Track 2

What's Up, Doc?

Load Testing at Yandex

10:00 am-10:30 am

Alexey Lavrenuke, Yandex

The talk is about the methodology that we use to measure our services’ performance. I will go through different kinds of tests and some ideas on analyzing their results, and then we’ll discuss Yandex.Tank, an open-source load testing tool that we use.

The talk is about the methodology that we use to measure our services’ performance. I will go through different kinds of tests and some ideas on analyzing their results, and then we’ll discuss Yandex.Tank, an open-source load testing tool that we use.

Alexey Lavrenuke has about 10 years of overall experience in IT. For the past three years, he has worked as a performance testing engineer in the Advertisement Technologies department at Yandex. His job is to provide information about services performance to system administrators and developers. He also make tools for load testing and tests automation and is a committer to Yandex.Tank open-source project. Alexey has made several public appearances as a speaker at the biggest IT events in Russia and abroad, and co-organized "Load Testing Theory and Practice," the first event in Russia fully dedicated to load testing.

Available Media

Making Every SRE Hire Count

10:30 am-11:00 am

Chris Stankaitis, The Pythian Group

Our industry is changing rapidly to keep up with the demands of the business. This means that many of the tools, processes, and techniques that were perfectly valid only a few years ago are now antiquated and no longer meet our needs. The hiring process is not immune to this shift, and finding the right people to fill a 21st century SRE position is challenging.

Our industry is changing rapidly to keep up with the demands of the business. This means that many of the tools, processes, and techniques that were perfectly valid only a few years ago are now antiquated and no longer meet our needs. The hiring process is not immune to this shift, and finding the right people to fill a 21st century SRE position is challenging.

From designing and grading technical tests to participating in technical interviewing/hiring manager "FIT" interviewing, making great people decisions is critical to the success of an SRE team and the pressure to get it right is high.

We will look at the process, how it has changed and how we have had to evolve to find the people we are looking for.

Chris Stankaitis is a Manager for the Site Reliability Engineering group at Pythian, an organization providing managed services and premium consulting to companies whose data availability, reliability, and integrity is critical to their business.

Chris is a key member of the hiring team for the Pythian SRE group and has participated in hundreds of candidate screenings and interviews over the past two years, resulting in the hiring of more than 30 Site Reliability Engineers.

Track 3

Farmer's Market

Statistics for Engineers

10:00 am-1:30 pm

Heinrich Hartmann, Circonus

Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Are we fulfilling our SLA?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you in your daily work as a system operator. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations. Advanced topics like time series forecasting and scalability analysis will also be touched.

The tutorial focuses on practical aspects and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX command line tools, gnuplot, and the iPython toolkit.

Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Are we fulfilling our SLA?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you in your daily work as a system operator. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations. Advanced topics like time series forecasting and scalability analysis will also be touched.

The tutorial focuses on practical aspects and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX command line tools, gnuplot, and the iPython toolkit.

Heinrich Hartmann is the data science lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform.

In his prior life, Heinrich pursued an academic career as a mathematician. He earned his PhD from the University of Bonn (Germany) on geometric aspects of string theory and worked as a researcher for the University of Oxford, UK afterward. In 2012 he transitioned into computer science and now applies his 10+ years of mathematical expertise to data analytics.

11:00–11:30  

Break with Refreshments

Lounge

11:30–13:30  

Track 1

Auditorium

Continuous Improvement Using Comprehensive Root Cause Analysis

11:30 am-12:00 pm

Susan Coghlan, Argonne National Laboratory

At the Argonne Leadership Supercomputer Facility, we operate Mira, a 786K core tightly coupled supercomputer, built for scalable, tightly coupled scientific workloads. In the operation of this system and its predecessor, we have developed a process for continuous system improvement through the performance of root cause analysis of all failed jobs.

At the Argonne Leadership Supercomputer Facility, we operate Mira, a 786K core tightly coupled supercomputer, built for scalable, tightly coupled scientific workloads. In the operation of this system and its predecessor, we have developed a process for continuous system improvement through the performance of root cause analysis of all failed jobs. On a weekly basis, failed jobs are analyzed, correlated with hardware and software failures, and categorized by area and type. The data is rolled up into monthly statistics and analyzed for trends. The trends are then used as a basis for prioritizing system development, hardware and software procurements, and major projects.

Over the last five years, this process has contributed to improved reliability (e.g. 5x improvement in Mira’s MTTI), utilization (10% increase), and changed the way that we decide which problems we should tackle. The tools created and the process refinements made have also cut the time spent doing failure analysis each week in half.

In this talk, I'll describe how we've implemented this process, what infrastructure we needed in order to make it work smoothly, and changes to the management process that have taken place as a result of this work.

Susan Coghlan is the Deputy Division Director for the Argonne Leadership Computing Facility (ALCF) and the project director for the facility’s powerful supercomputing systems including Mira, the fifth fastest supercomputer in the world. She is responsible for overseeing the installation of the supercomputers and ensuring they meet the U.S. Department of Energy’s mission needs.

In her previous roles as Associate Division Director and Director of Operations for the ALCF, she was responsible for the installation and operation of the world's fastest open science computer (TOP500 List, June 2008), the ALCF's 557-teraflops Blue Gene/P production system. Susan has worked on parallel and distributed computers for over 25 years, from developing scientific applications, including her work on a model of the human brain at the Center for NonLinear Studies at Los Alamos National Laboratory, to managing ultra-scale supercomputers like ASCI Blue Mountain, a 6,144 processor system at Los Alamos. In 2000, she co-founded a research laboratory in Santa Fe (sponsored by Turbolinux, Inc.) that developed the world's first dynamic provisioning system for large clusters and data centers. Susan is well known within the high-performance computing community, and has presented numerous tutorials, lectures, and papers on her work.

Available Media

Continuous Pipelines at Google

12:00 pm-12:30 pm

Dan Dennison, Google, Inc.

This presentation will focus on the real life challenges of managing data processing pipelines of any depth and complexity. I’ll cover the frequency continuum between periodic pipelines that run very infrequently up through continuous pipelines that never stop running, discussing the discontinuities along the axis that can produce significant operational problems. A fresh take on the master-slave model is presented as a better alternative to the periodic pipeline for reliable Big Data scaling.

This presentation will focus on the real life challenges of managing data processing pipelines of any depth and complexity. I’ll cover the frequency continuum between periodic pipelines that run very infrequently up through continuous pipelines that never stop running, discussing the discontinuities along the axis that can produce significant operational problems. A fresh take on the master-slave model is presented as a better alternative to the periodic pipeline for reliable Big Data scaling.

Dan Dennison has been a Site Reliability Engineer at Google for the past eight years, working on various large pipelines in Ads Quality. The Ads Quality team at Google improves, expands, and supports one of the world’s largest machine learning deployments. Ads Quality also crawls and indexes all advertisements, and enforces policies on new and existing ads.

Available Media

Perimeter Management at Spotify

12:30 pm-1:00 pm

Alexey Lapitsky, Spotify

This presentation will focus on the web load balancers and various proxy systems across Spotify perimeter. I’ll explain how we are exposing our service network to the internet and the challenges we faced while automating that.

I will cover the process which allows our developers to punch the holes in the perimeter in an easy and secure way without any SRE assistance.

Alexey Lapitsky is an SRE at Spotify and in the recent years he has been focusing on the operational autonomy of the feature developers, load balancing, security, and various availability aspects of the Spotify infrastructure.

This presentation will focus on the web load balancers and various proxy systems across Spotify perimeter. I’ll explain how we are exposing our service network to the internet and the challenges we faced while automating that.

I will cover the process which allows our developers to punch the holes in the perimeter in an easy and secure way without any SRE assistance.

Alexey Lapitsky is an SRE at Spotify and in the recent years he has been focusing on the operational autonomy of the feature developers, load balancing, security, and various availability aspects of the Spotify infrastructure.

Available Media

Upgrade Your Database without Losing Your Data, Your Perf, or Your Mind

Charity Majors, Parse/Facebook

Upgrading databases can be terrifying and perilous, and for good reason: you can totally screw yourself! Every workload is unique, and standardized test suites will never give you enough information to evaluate how an upgrade will perform for your query set. We will talk about how paranoid you should be about various types of workloads and upgrades, how to balance risk vs. engineering effort, and how to safely execute the most challenging upgrades by capturing and replaying real production workloads. The principles apply to any database, but we’ll go particularly deep into war stories and tooling options for MongoDB and MySQL.

Upgrading databases can be terrifying and perilous, and for good reason: you can totally screw yourself! Every workload is unique, and standardized test suites will never give you enough information to evaluate how an upgrade will perform for your query set. We will talk about how paranoid you should be about various types of workloads and upgrades, how to balance risk vs. engineering effort, and how to safely execute the most challenging upgrades by capturing and replaying real production workloads. The principles apply to any database, but we’ll go particularly deep into war stories and tooling options for MongoDB and MySQL.

Charity Majors is a Production Engineering Manager on Parse, now working at Facebook. She has 10+ years of experience wrangling scaling problems and databases, and she has the scars to prove it. She loves free software, free speech, and free single malt scotch.

Available Media

Track 2

What's up, Doc?

Building Blocks of MySQL Automation

5:00 pm-5:30 pm

Simon Martin, Facebook Dublin

Describe the building block applications and how they would interact to manage a large number of MySQL servers. How to do automated master promotions, fast automatic fault detection, fast failover and node fencing to provide a reliable system that is unlikely to lose a transaction in the event of a server failure whilst keeping downtime to less than a few seconds.

Describe the building block applications and how they would interact to manage a large number of MySQL servers. How to do automated master promotions, fast automatic fault detection, fast failover and node fencing to provide a reliable system that is unlikely to lose a transaction in the event of a server failure whilst keeping downtime to less than a few seconds.

Simon Martin is a Production Engineer whose primary role is to write applications that facilitate the management of Facebook's MySQL infrastructure, keeping the operational overhead small in comparison to the growing fleet of servers to manage.

Available Media

Track 3

Farmer's Market

(continued from 10:00 session)

Statistics for Engineers

10:00 am-1:30 pm

Heinrich Hartmann, Circonus

Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Are we fulfilling our SLA?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you in your daily work as a system operator. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations. Advanced topics like time series forecasting and scalability analysis will also be touched.

The tutorial focuses on practical aspects and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX command line tools, gnuplot, and the iPython toolkit.

Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:

  • Are we fulfilling our SLA?
  • How did our query response times change with the last update?

Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you in your daily work as a system operator. We will cover probabilistic models, summarizing distributions with mean values, quantiles, and histograms and their relations. Advanced topics like time series forecasting and scalability analysis will also be touched.

The tutorial focuses on practical aspects and will give you hands-on knowledge of how to handle, import, analyze, and visualize telemetry data with UNIX command line tools, gnuplot, and the iPython toolkit.

Heinrich Hartmann is the data science lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform.

In his prior life, Heinrich pursued an academic career as a mathematician. He earned his PhD from the University of Bonn (Germany) on geometric aspects of string theory and worked as a researcher for the University of Oxford, UK afterward. In 2012 he transitioned into computer science and now applies his 10+ years of mathematical expertise to data analytics.

13:30–14:30  

Lunch

Garage Cafe

14:30–16:30  

Track 1

Auditorium

Running Open Compute Hardware in Facebook's Data Centers

2:30 pm-3:00 pm

Joel Kjellgren, Facebook

Facebook runs some of the world's most efficient data centers that are powered by hardware from the Open Compute project. Hear more on how Facebook designs, deploys, automates, and maintains Open Compute hardware at scale, and how we engineer the building, hardware, network, and software to operate as efficiently as possible.

Facebook runs some of the world's most efficient data centers that are powered by hardware from the Open Compute project. Hear more on how Facebook designs, deploys, automates, and maintains Open Compute hardware at scale, and how we engineer the building, hardware, network, and software to operate as efficiently as possible.

Joel Kjellgren is the manager for Facebook's award-winning data center site in Luleå, Sweden. In that role, he leads the teams operating the first data centers outside of the U.S. and the first to use 100% Open Compute Project hardware. Built to be cooled only using the arctic air in northern Sweden and powered 100% by renewable energy, the Luleå data center sets a new standard for efficiency and sustainability.

Prior to joining Facebook, Joel worked with large-scale IT and data center infrastructure at European internet businesses such as Spray, Lycos, Ongame, and Bwin.

Available Media

Disruptive Data Center Networks

3:00 pm-3:30 pm

Jason Hoffman, Ericsson

Jason Hoffman is the Head of Cloud Technology at Ericsson where he's responsible for product architecture and engineering. Previously he was the Head of Product Line, Ericsson Cloud System and Platforms in Business Unit Cloud and IP. Prior to that he was a founder and the CTO at Joyent, an early high-performance cloud IaaS and software provider, where he ran product, engineering, operations, and commercial management for nearly a decade. He is considered to be one the pioneers of large-scale cloud computing, in particular the use of container technologies, asynchronous high-concurrency runtimes, and converged server, storage, and networking systems. Jason is also an angel investor, strategy and execution advisor, venture and private equity advisor, and on the boards of the Wordpress Foundation and New Context, a Digital Garage company. Jason has a B.S. and M.S. from UCLA and a Ph.D. from UCSD.

Jason Hoffman is the Head of Cloud Technology at Ericsson where he's responsible for product architecture and engineering. Previously he was the Head of Product Line, Ericsson Cloud System and Platforms in Business Unit Cloud and IP. Prior to that he was a founder and the CTO at Joyent, an early high-performance cloud IaaS and software provider, where he ran product, engineering, operations, and commercial management for nearly a decade. He is considered to be one the pioneers of large-scale cloud computing, in particular the use of container technologies, asynchronous high-concurrency runtimes, and converged server, storage, and networking systems. Jason is also an angel investor, strategy and execution advisor, venture and private equity advisor, and on the boards of the Wordpress Foundation and New Context, a Digital Garage company. Jason has a B.S. and M.S. from UCLA and a Ph.D. from UCSD. He is a San Francisco native and now lives in Stockholm with his wife and daughters.

Available Media

Hardware Panel

3:30 pm-4:30 pm

Panel: Susan Coghlan, Argonne National Laboratory; Narayan Desai and Jason Hoffman, Ericsson; John Looney, Google; Josh Kjellgren, Facebook

Available Media

Distributed Consensus Algorithms for Extreme Reliability

4:00 pm-4:30 pm

Laura Nolan, Google

Distributed consensus algorithms (such as Paxos, ZAB, and Raft) deal with reaching agreement among a group of processes connected by an unreliable communications network. For decades these protocols were largely of interest to academics, but in today's world they are incredibly relevant as a way of building reliable, stateful, multihomed systems that do not require a lot of human effort to run.

Distributed consensus algorithms (such as Paxos, ZAB, and Raft) deal with reaching agreement among a group of processes connected by an unreliable communications network. For decades these protocols were largely of interest to academics, but in today's world they are incredibly relevant as a way of building reliable, stateful, multihomed systems that do not require a lot of human effort to run.

This talk will cover:

  • Use cases for distributed consensus algorithms (and what can go wrong trying to use adhoc solutions): split brain problems, faulty group-membership, avoiding the need to have humans resolve bad data as a result of problems, etc
  • System-level design patterns that have been used successfully to create reliable systems at scale: replicated state machines, replicated datastores, highly available services with a single lelader and failover, distributed queuing and messaging, distributed coordination and locking
  • Performance of these protocols in various different operating environments: different geographical deployments
  • Scaling these systems: batching, sharding
  • Pitfalls and monitoring

Laura Nolan has been a Site Reliability Engineer at Google for the past two years, and has worked with multiple large systems that use distributed consensus to achieve reliable, stable multihoming that really works without human babysitting. Prior to that she kept the e-commerce site gilt.com fast and stable during flash sales, and worked for the Irish software company Curam Software, which is now part of IBM Smarter Cities. She is a keen traveller and scuba diver and an international-level weightlifting referee.

Available Media

Track 2

Zinc/Copper/Bronze

Unconference

2:30 pm-4:30 pm

The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest.

The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest.

Track 3

Farmer's Market

Machine Learning for Machine Data

2:30 pm-4:30 pm

Adam Oliner and Jacob Leverich

Machine learning is a process for generalizing from examples. In this hands-on tutorial, we'll apply state-of-the-art methods to data from production systems to perform a variety of tasks relevant to SREs, including identifying anomalous systems, classifying messages by their content, and forecasting system state.

This will be a practical tutorial, aimed to instruct both about the capabilities of machine learning as well as its limitations. Machine learning topics include anomaly detection, classification, and clustering. We assume no prerequisite knowledge of machine learning, though some familiarity with statistics and linear algebra will be helpful.

Machine learning is a process for generalizing from examples. In this hands-on tutorial, we'll apply state-of-the-art methods to data from production systems to perform a variety of tasks relevant to SREs, including identifying anomalous systems, classifying messages by their content, and forecasting system state.

This will be a practical tutorial, aimed to instruct both about the capabilities of machine learning as well as its limitations. Machine learning topics include anomaly detection, classification, and clustering. We assume no prerequisite knowledge of machine learning, though some familiarity with statistics and linear algebra will be helpful.

16:30–17:00  

Break with refreshments

Lounge

17:00–17:30  

Track 1

Auditorium

Debugging and Extending Distributed Coordination Systems

5:00 pm-5:30 pm

Raúl Gutiérrez Segalés, Twitter

Raúl is a Staff Site Reliability Engineer at Twitter, with experience on the Coordination Team as well as the Traffic Team. He is the primary author of ZKTraffic and several of the extensions that we use to keep ZooKeeper sane at Twitter.

At Twitter, we use Apache ZooKeeper as a cornerstone for nearly all of our distributed systems. SREs run ZooKeeper clusters and write tools for diagnosing bad actors. They also extend the service to make it more reliable.

During this talk, we'll go over the tools we've written (and open sourced!) to monitor our systems and debug problems seen at large scale. In addition, as part of our work to rate-limit our myriad clients, we’ve developed several protocol extensions for ZooKeeper without breaking backwards compatibility. We’ll discuss the design considerations involved when extending a service with a multitude of client library versions, as well as how it has enabled us to improve our ability to quickly experiment and iterate.

At Twitter, we use Apache ZooKeeper as a cornerstone for nearly all of our distributed systems. SREs run ZooKeeper clusters and write tools for diagnosing bad actors. They also extend the service to make it more reliable.

During this talk, we'll go over the tools we've written (and open sourced!) to monitor our systems and debug problems seen at large scale. In addition, as part of our work to rate-limit our myriad clients, we’ve developed several protocol extensions for ZooKeeper without breaking backwards compatibility. We’ll discuss the design considerations involved when extending a service with a multitude of client library versions, as well as how it has enabled us to improve our ability to quickly experiment and iterate.

Available Media

Track 2

What's up, Doc?

Unconference

2:30 pm-4:30 pm

The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest.

The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest.

Track 3

Farmer's Market

(continued from 14:30 session)

Machine Learning for Machine Data

2:30 pm-4:30 pm

Adam Oliner and Jacob Leverich

Machine learning is a process for generalizing from examples. In this hands-on tutorial, we'll apply state-of-the-art methods to data from production systems to perform a variety of tasks relevant to SREs, including identifying anomalous systems, classifying messages by their content, and forecasting system state.

This will be a practical tutorial, aimed to instruct both about the capabilities of machine learning as well as its limitations. Machine learning topics include anomaly detection, classification, and clustering. We assume no prerequisite knowledge of machine learning, though some familiarity with statistics and linear algebra will be helpful.

Machine learning is a process for generalizing from examples. In this hands-on tutorial, we'll apply state-of-the-art methods to data from production systems to perform a variety of tasks relevant to SREs, including identifying anomalous systems, classifying messages by their content, and forecasting system state.

This will be a practical tutorial, aimed to instruct both about the capabilities of machine learning as well as its limitations. Machine learning topics include anomaly detection, classification, and clustering. We assume no prerequisite knowledge of machine learning, though some familiarity with statistics and linear algebra will be helpful.

17:30–18:00  

Thursday Closing Talk

Auditorium

Operational Software Design

5:30 pm-6:00 pm

Theo Schlossnagle, Circonus

There are two common tenets of operations: "hell is other people's software," and "better software is produced by those forced to operate it." In this session I'll take a fly-by-tour of two pieces of software that were built from the ground up for operability from the hard-earned teachings of their inoperable predecessors: a distributed datastore replacing PostgreSQL, and a message queue replacing RabbitMQ.

We'll discuss specific design aspects that increase resiliency in the event of failure and observability at all times.

There are two common tenets of operations: "hell is other people's software," and "better software is produced by those forced to operate it." In this session I'll take a fly-by-tour of two pieces of software that were built from the ground up for operability from the hard-earned teachings of their inoperable predecessors: a distributed datastore replacing PostgreSQL, and a message queue replacing RabbitMQ.

We'll discuss specific design aspects that increase resiliency in the event of failure and observability at all times.

After earning undergraduate and graduate degrees from Johns Hopkins University in computer science with a focus on randomized algorithms in distributed systems, Theo went on to research resource allocation techniques in distributed systems during four years of post-graduate work. Since starting OmniTI in 1997, Theo has gone on to found three software companies and produce and participate in countless open source projects.

A widely respected industry thought leader, Theo is the author of Scalable Internet Architectures (Sams) and is a frequent speaker at worldwide IT conferences. Theo is a computer scientist in every respect. Theo is a member of the ASF, the IEEE and a senior member of the ACM.

Available Media
18:30–20:30  

Reception at The Westin Dublin

Sponsored by Facebook
Ticket required for admission. Attendees who did not add this option may modify their registration to include a reception ticket as long as tickets remain available.

 

Friday, May 15, 2015

08:30–09:00  

Continental Breakfast

Lounge

09:00–11:00  

Track 1

Auditorium

Dr NMS or: How Facebook Learned to Stop Worrying and Love the Network

9:00 am-9:30 am

Jose Leitao and David Rothera, Facebook

Want to learn how Facebook operates their global network to support more than 1.3 billion users? We will be describing the technologies and methods we use to manage Facebook's production network. The neteng org at Facebook has built/leverage several systems for managing and operating the production network, including an audit framework, alarms daemons, drainers, and an automatic remediation engine. This talk will focus on these technologies and how they have helped improve user experience, administer complexity, automate day-to-day operations, mitigate impact, and increase reliability.

Want to learn how Facebook operates their global network to support more than 1.3 billion users? We will be describing the technologies and methods we use to manage Facebook's production network. The neteng org at Facebook has built/leverage several systems for managing and operating the production network, including an audit framework, alarms daemons, drainers, and an automatic remediation engine. This talk will focus on these technologies and how they have helped improve user experience, administer complexity, automate day-to-day operations, mitigate impact, and increase reliability.

Jose Leitao and David Rothera are production netengs in the Network Infrastructure Engineering team at Facebook. Their team responsibilities include maintaining, monitoring, and improving the global production network infrastructure.

Available Media

How Container Clusters Like Kubernetes Change Operations

9:30 am-10:00 am

Brendan Burns, Google

Containers have been at the core of how many web-scale companies build their distributed systems for years. More recently containers have become an increasingly popular tool for developers everywhere. As containers are moved from a developer's workstation to production, they are generally managed by cluster management systems like Kubernetes or Omega.

Containers have been at the core of how many web-scale companies build their distributed systems for years. More recently containers have become an increasingly popular tool for developers everywhere. As containers are moved from a developer's workstation to production, they are generally managed by cluster management systems like Kubernetes or Omega.

These cluster management systems offer significant opportunities to improve operations across the data center. In particular, they offer the following opportunities. By separating concerns along a variety of axis (developers from managing machines, applications from the OS on which they run, etc.) container cluster management systems enable ops specialization. Different operations teams can focus on kernel and machine maintenance, cluster operations, and application operations, and they can do these jobs in relative isolation. This focus and isolation means that the teams are significantly more productive, and less likely to make mistakes due to inexperience or interactions that they don't fully understand.

Additionally, the homgenization of the container cluster, and the presence of the cluster management system, make management tools (roll outs, monitoring, health maintenance) a property of the cluster environment, not each individual application. This means that these tools are deployed once for an entire cluster, and they are designed and deployed by experts who are, again, specialized to their specific task.

Finally, container clusters and cluster management apis enable the switch to immutable infrastructure and the development of patterns that make it possible to do operations without manual manipulating individual machines.

Brendan Burns is a Staff Software Engineer at Google, Inc and a founder of the Kubernetes project. He works in the Google Cloud Platform, leading engineering efforts to make the Google Cloud Platform the best place to run containers. He also has managed several other cloud teams, including the Managed VMs team and Cloud DNS. Prior to Cloud, he was a lead engineer in Google's web search infrastructure, building backends that powered social and personal search. Prior to working at Google, he was a professor at Union College in Schenectady, NY. He received a PhD in Computer Science from the University of Massachusetts Amherst, and a BA in Computer Science and Studio Art from Williams College.

Available Media

DHCP Infrastructure Evolution at Facebook and the Importance of Designing Stateless Services

10:00 am-10:30 am

Angelo Failla, Facebook

Facebook is one of the largest sites in the world, with multiple datacenters (and POPs in multiple continents) hosting a pretty large amount of machines. This talk is about the evolution of the DHCP production infrastructure at Facebook.

Facebook is one of the largest sites in the world, with multiple datacenters (and POPs in multiple continents) hosting a pretty large amount of machines. This talk is about the evolution of the DHCP production infrastructure at Facebook.

In this talk we will use the DHCP case as an example to discuss why it's good to design your systems to be stateless, and the fine line between leveraging OSS projects where possible and take a “Not Invented Here” approach instead. We will also talk about the challenges of driving large scope projects from remote offices and the importance of possessing skills in both systems and software development fields.

We'll look at DHCP in Facebook in both IPv4 and IPv6 worlds, we will dive into old architecture and its limitations. and then talk about how the Cluster Operations team in Dublin leveraged the ISC KEA open source project to migrate from a stateful service to a stateless one, discussing challenges faced in the process and the benefits we gained.

Angelo is a Production Engineer at Facebook. He joined the company in early 2011 as a Site Reliability Engineer and recently moved to the Cluster Operations Team. In this period he has contributed to various projects, like our cluster turnup tool Kobold and F.B.A.R. (the Facebook Auto Remediation tool). More recently he has been involved in revamping the DHCP architecture for the Facebook production network, which he will discuss in this talk. He is interested in automation tools and large-scale distributed systems.

Available Media

Going Off the Rails: Infrastructure Outage Planning

10:30 am-11:00 am

Matt Provost, Avere Systems

Over the 2014 Christmas holiday there was major engineering work scheduled on some of London's main train lines into King's Cross Station. This overran the outage window by a day and significantly disrupted the travel plans of tens of thousands of passengers. In January, Network Rail published a detailed report with their findings detailing the causes of the overrun.

Over the 2014 Christmas holiday there was major engineering work scheduled on some of London's main train lines into King's Cross Station. This overran the outage window by a day and significantly disrupted the travel plans of tens of thousands of passengers. In January, Network Rail published a detailed report with their findings detailing the causes of the overrun.

What can SREs learn from physical infrastructure maintenance and outage procedures? Even in this new era of cloud computing, someone still has to build and maintain the infrastructure that keeps the cloud working. With redundant systems and even data centres, there are still going to be times when the underlying infrastructure needs an outage for preventative maintenance to replace older equipment. or planned work on electrical or HVAC systems.

What lessons can be learned from the Christmas King's Cross outage that we can apply to data centre infrastructure? Even with cloud providers this is still relevant: Verizon Cloud had a 40-hour scheduled outage in January. Although there were underlying technical problems during the railway maintenance which added up and caused the delays, the larger problems were around planning, staff rotation, and communication. Teams in the field were focused on solving technical problems as they came up, and not escalating these problems to managers who could see the bigger picture and communicate with other teams to form an overall strategy and make go/no-go decisions based on accurate information. This situation can be even worse in a data centre where time estimation is notoriously unreliable.

In this talk, I will break down and analyse the King's Cross report and relate each finding back to the data centre environment.

Matt Provost started as a systems administrator in 1998. Before moving to London in 2014 he was the Systems Manager at Weta Digital in Wellington, New Zealand, where he oversaw the commissioning of a new water cooled data centre which hosted the render infrastructure for Avatar, Rise of the Planet of the Apes, Tintin, and the Hobbit films. During the production of Avatar, this hosted 7 systems in the Top 500 Supercomputer list. Matt has presented at LISA conferences about storage performance management, monitoring, and complex systems failure analysis.

Available Media

Track 2

Lighthouse Cinema

(session starts at 09:30)

Software Defined Data Center is More Than a Software Defined Network

9:30 am-10:00 am

Astrid Kreissig, IBM

Cloud-scale automation requires significant changes in the operation of a data center. Depending on the usage pattern of the cloud, either cloud provider or cloud consumer, the impact to your data center differs. In the industry, the technologies covered under the term "Software Defined Data Center (SDDC)" are supposed to address these changes. But are Software Defined Network and Software Defined Storage really enough? I will show that there is more required to achieve a true "Software Defined Environment (SDE)." This talk will describe an architectural view on all aspects of SDE and how it addresses the requirements triggered by cloud environments.

Cloud-scale automation requires significant changes in the operation of a data center. Depending on the usage pattern of the cloud, either cloud provider or cloud consumer, the impact to your data center differs. In the industry, the technologies covered under the term "Software Defined Data Center (SDDC)" are supposed to address these changes. But are Software Defined Network and Software Defined Storage really enough? I will show that there is more required to achieve a true "Software Defined Environment (SDE)." This talk will describe an architectural view on all aspects of SDE and how it addresses the requirements triggered by cloud environments.

After achieving a degree in Computer Science from RWTH Aachen, Astrid joined IBM Research and Development as a developer. She went through a career as software and firmware developer for System z and Power Systems, reaching the level of an IT architect. After that Astrid changed direction into technical sales as an IT architect for IBM systems as well as for IBM cloud solutions. She consulted clients in Europe, the U.S. and Asia on their transition into the cloud. The clients either wanted to expand their business and become a cloud provider or wanted more efficiency in their data centers by consuming cloud offerings. This gave Astrid a good background to understand the benefits of a Software Defined Environment. In 2014 she joined a team that developed a client adoption model for SDE.

Available Media

HTTPS and Forward Secrecy at Scale

10:00 am-10:30 am

Chris Niemira, AOL

For years, the conversation about using TLS was little more than an argument over whether proper key management was worth the effort. But today’s news reports are riddled with stories about data theft and Internet espionage, and secure content delivery is the new normal. Now SSL is dead, new CVEs show up fast and furious, and our reaction time to the latest bug reports is measured not only in hours but customers at risk.

For years, the conversation about using TLS was little more than an argument over whether proper key management was worth the effort. But today’s news reports are riddled with stories about data theft and Internet espionage, and secure content delivery is the new normal. Now SSL is dead, new CVEs show up fast and furious, and our reaction time to the latest bug reports is measured not only in hours but customers at risk. In a world where we expect forward secrecy and elliptic curves to save us, we have to realize that it’s never as easy as flipping a few switches. We have to balance the performance and cost implications of different grades of security while keeping an eye on both compatibility and the latest threats.

This talk will discuss what forward secrecy is, and how it’s achieved. It will also describe the mechanics of Diffie-Hellman exchanges and how we measure the “cost” of enabling them on different platforms, as well as the benefits of ECC. We discuss how we validate and benchmark different points of encryption termination (notably appliance ADCs). We will specifically describe how we used our methodology to accomplish the HTTP to HTTPS migration of our webmail platform, and how we overcame the problems that we ran into along the way.

Chris Niemira is an AOL veteran who spent over seven years running the public gateways for the AOL Mail system, one of the world's largest email platforms. Today, he works as a reliability engineer, writing tools and running analyses to help to ensure the performance and availability of many of AOL's high traffic properties across the Internet. He previously spent time building solutions and running web properties in the Banking and Pharmaceutical industries (as well as some dot-coms we won't talk about), and also is currently pursuing an MBA and Master of Finance.

Available Media

SRE UIs—Transcending the CLI

10:30 am-11:00 am

Michael Avrukin, Google

SRE Tooling is often limited to command line interfaces, making them inaccessible to a "non-SRE" audience. The barrier of entry to doing usable UI for SRE often seems insurmountable and plagued with many horror stories, coupled with the natural tendency of SRE Engineers to not have the necessary training to do UI design. This talk will present a framework and approach for integrating and building UIs into the SRE toolbox. Topics covered will include general architectural approaches to library design for SRE tooling in order to facilitate UI buildout, overview of some basic UI pitfalls and best practices, an approach to crafting your own unique toolset with minimal development investment, and and ongoing maintenance.

SRE Tooling is often limited to command line interfaces, making them inaccessible to a "non-SRE" audience. The barrier of entry to doing usable UI for SRE often seems insurmountable and plagued with many horror stories, coupled with the natural tendency of SRE Engineers to not have the necessary training to do UI design. This talk will present a framework and approach for integrating and building UIs into the SRE toolbox. Topics covered will include general architectural approaches to library design for SRE tooling in order to facilitate UI buildout, overview of some basic UI pitfalls and best practices, an approach to crafting your own unique toolset with minimal development investment, and and ongoing maintenance.

Michael Avrukin is a Site Reliability Engineering Manager at Google. Prior to Google Michael worked at numerous startups, helping to scale and build out teams across multiple countries and continents working with AWS, AliYun, GCE, and other cloud platforms to deliver enterprise-level QoS solutions and services. Michael spent his early career doing UI development on Windows, OS X, and later with early JavaScript before transitioning to back-end and infrastructure engineering where he brought his passion for visual eye candy.

Available Media

Track 3

Zinc/Copper/Bronze

Unconference

2:30 pm-4:30 pm

The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest.

The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest.

Track 4

Kelvin

Non-Abstract Large Scale Design

9:00 am-1:30 pm

Laura Nolan and Diego Elio Pettenò, Google

This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.

The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:

  • A high-level design that scales horizontally
  • Initial SLIs, SLOs
  • Estimates for hardware required to run each component
  • If time permits, monitoring design and disaster testing scenarios

Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too.

This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.

The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:

  • A high-level design that scales horizontally
  • Initial SLIs, SLOs
  • Estimates for hardware required to run each component
  • If time permits, monitoring design and disaster testing scenarios

Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too.

Laura Nolan has been a Site Reliability Engineer at Google for the past two years, and has worked with multiple large systems that use distributed consensus to achieve reliable, stable multihoming that really works without human babysitting. Prior to that she kept the e-commerce site gilt.com fast and stable during flash sales and worked for the Irish software company Curam Software, which is now part of IBM Smarter Cities. She is a keen traveller and scuba diver and an international-level weightlifting referee.

11:00–11:30  

Break with Refreshments

Lounge

11:30–13:30  

Track 1

Auditorium

Facebook Cache Invalidation Pipeline

11:00 am-12:00 pm

Melita Mihaljevic, Facebook

“There are only two hard things in Computer Science: cache invalidation and naming things.”—Phil Karlton

Facebook serves 1.3 billion users across multiple regions. To make sure that all users have a consistent experience on the site, we built cache invalidation pipeline. This talk will cover cache invalidation pipeline for both caching solutions Memcache and TAO cross multiple regions. The talk will also touch a bit on how we monitor and debug cache consistency problems.

“There are only two hard things in Computer Science: cache invalidation and naming things.”—Phil Karlton

Facebook serves 1.3 billion users across multiple regions. To make sure that all users have a consistent experience on the site, we built cache invalidation pipeline. This talk will cover cache invalidation pipeline for both caching solutions Memcache and TAO cross multiple regions. The talk will also touch a bit on how we monitor and debug cache consistency problems.

Melita Mihaljevic is a Production Engineer on the Global Consistency team at Facebook. The team is the first responder for keeping the cache consistent and for the health for Facebook's proprietary real-time data streaming infrastructure. They ensure that users have a consistent experience on the site across all their devices. Melita is actively developing and maintaining one of the cache invalidations pipeline services.

Available Media

Configuration Pinocchio: The Lies Plainly Seen and the Quest to Be a Real Discipline

12:00 pm-12:30 pm

Andre Masella, Ontario Institute for Cancer Research

Creating configuration files has always been pushed into the domain of “not programming,” but configuration files have a way of growing more complex. There is a struggle between keeping a configuration terse, by having the system infer information automatically, and explicit without having duplication. Either the configuration file develops embedded domain-specific programming languages (e.g., Apache, Asterisk, Postfix) or a text-based macro language is put in front (e.g., M4-based wrapper for sendmail, automake as an M4-based wrapper around Make).

Creating configuration files has always been pushed into the domain of “not programming,” but configuration files have a way of growing more complex. There is a struggle between keeping a configuration terse, by having the system infer information automatically, and explicit without having duplication. Either the configuration file develops embedded domain-specific programming languages (e.g., Apache, Asterisk, Postfix) or a text-based macro language is put in front (e.g., M4-based wrapper for sendmail, automake as an M4-based wrapper around Make). The middle path is to put a structured macro language in front of a configuration; a language with the smarts to semantically verify the configuration (unlike a text-based macro language) and that has well-defined, observable semantics outside the binaries being configured (unlike an in-configuration DSL).

There are three languages currently working toward this goal: NixOS, Jsonnet, and Flabbergast. Both Jsonnet and Flabbergast descend from Google's proprietary configuration language[4] with the intention of changing design decisions that made this language difficult to use.

In particular, one of the common myths is that the complexity of the configuration is related to the structure of the configuration format. That is, there is a tacit assumption that INI configurations are semantically simpler than JSON ones; which is demonstrably false. In understanding how semantically rich configurations are, writing configurations can be elevated to a real discipline that is related to but distinct from general purpose programming and, thereby, focus making configurations easier to understand, easier to write, and more sophisticated.

Andre Masella previously worked at Google as an SRE supporting the AdSense serving stack. While there, he spent most of his time refactoring the configuration files of the serving stack in Google's much-despised proprietary configuration language and developing idioms to manage the complexity. After leaving, he quested to create a configuration language with the same expressive power, but more simple to understand and write effectively, yielding Flabbergast.

Available Media

A Scalable and Resilient Microservice Environment with Apache Mesos and Apache Aurora

12:30 pm-1:00 pm

Florian Pfeiffer, Gutefrage.net GmbH

"Treat your data center as a single machine." This idea has been getting more and more traction over the last years. Beside a couple of other projects, there's Apache Mesos which is providing a solution to easily create this one big single pool of your resources. There are companies which are running it on 10,000s of machines, but for an architecture that relies on easy scalability and good resilience, it completely makes sense to run it on a small cluster as well. A framework that runs upon Mesos is the scheduler Aurora, which takes care about how many instances of a job should run on which machines, and reschedules running jobs if machines in your cluster die.

"Treat your data center as a single machine." This idea has been getting more and more traction over the last years. Beside a couple of other projects, there's Apache Mesos which is providing a solution to easily create this one big single pool of your resources. There are companies which are running it on 10,000s of machines, but for an architecture that relies on easy scalability and good resilience, it completely makes sense to run it on a small cluster as well. A framework that runs upon Mesos is the scheduler Aurora, which takes care about how many instances of a job should run on which machines, and reschedules running jobs if machines in your cluster die.

After the introduction of these projects, I will show you how and why Gutefrage.net has glued together those technologies with Jenkins to implement a continuous deployment workflow, that takes care of our 100+ daily deployments on a relatively small Mesos cluster and has the goals of providing a fault tolerant and low latency user experience to our customers.

Florian is Head of Data and Infrastructure at gutefrage.net. Before that he learned the ropes at Yahoo! Together with his team he accepts the challenges that running Germany's biggest Q&A site brings. As an agile company, this not only means the usual scaling and high availability topics, but also multiple daily releases and branchless development with feature switches.

In February 2014 he introduced Mesos as the basic building block for the next generation platform for gutefrage.net

Available Media

Bad Machinery— Managing Interrupts Under Load

1:00 pm-1:30 pm

Dave O'Connor, Google Dublin

Lots of thought is given to how to organise oncall rotations — around people's schedules, around periods of critical coverage, and with fairness in mind. Less thought is given to the human aspect of oncall —how it affects people's ability to get other work done, their general cognitive flow state, and burnout rates. This talk will present a paper used internally in several SRE teams at Google to organise rotations around people, bearing in mind that people are not machines.

Lots of thought is given to how to organise oncall rotations — around people's schedules, around periods of critical coverage, and with fairness in mind. Less thought is given to the human aspect of oncall —how it affects people's ability to get other work done, their general cognitive flow state, and burnout rates. This talk will present a paper used internally in several SRE teams at Google to organise rotations around people, bearing in mind that people are not machines.

Dave O'Connor is a Senior Site Reliability Manager at Google. He has been at Google for almost 11 years, 9 of which were spent oncall, and organising oncall rotations. He has spent time on several teams in Google SRE, and currently manages the teams that run Google's storage in their Dublin, Ireland office. His specialty is being spectacularly grumpy at being interrupted, both for himself and on people's behalf; he has spent many years building teams that handle heavy interrupt load without everyone hating their lives.

Available Media

Track 2

Lighthouse Cinema

Heraclitus Wears Prada: DevOps, SRE, and Organisational Activities

11:30 am-12:00 pm

Niall Murphy, Google

Taking some lessons from Meryl Streep as Anna Wintour, and the ancient Greek philosopher Heraclitus, we look at the role of SRE and devops within organisations. What is it that we do, and why? What are some organisational factors that contribute to the success of an SRE/devops group, and its failure? What incentives need to be in place, and what happens if they are missing?

Taking some lessons from Meryl Streep as Anna Wintour, and the ancient Greek philosopher Heraclitus, we look at the role of SRE and devops within organisations. What is it that we do, and why? What are some organisational factors that contribute to the success of an SRE/devops group, and its failure? What incentives need to be in place, and what happens if they are missing?

Niall Murphy is currently head of the Ads Reliability Engineering function in Google Dublin, ostensibly in charge of the infrastructure supporting $18bn/quarter. Prior to Google he was in Amazon.COM Network Engineering, and a variety of self- or other-founded startups and Irish Internet institutions. He is the author of a book, a lecture course, and numerous articles, and is probably one of the few people in the world with a degree in Computer Science & Mathematics, and Poetry Studies.

Available Media

Show Me the Metrics: OpenTSDB and OpenTSP at Betfair

12:00 pm-12:30 pm

James Brooks, Betfair

Time Series metrics can be an important part of a comprehensive monitoring solution. Betfair will present a talk on their experiences running OpenTSDB and a new open source tool called OpenTSP, designed to streamline the process of gathering and delivering system metrics quickly and reliably to multiple endpoints so that you can use any of your favourite tools to analyse the stream.

Time Series metrics can be an important part of a comprehensive monitoring solution. Betfair will present a talk on their experiences running OpenTSDB and a new open source tool called OpenTSP, designed to streamline the process of gathering and delivering system metrics quickly and reliably to multiple endpoints so that you can use any of your favourite tools to analyse the stream.

James Brooks is a Senior Engineer with Betfair's site reliability engineering team. Having spent time at Sun Microsystems and Interoute Communications, James is keenly aware of the power of well-implemented monitoring to be a driver for continuous improvement.

Vilian Atmadzhov is the youngest member of Betfair's SRE team. Fresh from University, he has taken on a number of projects at Betfair, including the design of a config generator for Riemann, to allow our teams to rapidly deploy alerting and dashboarding functionality

Available Media

Lightning Talks

12:30 pm-1:30 pm

The programme committee will select a number of 5-minute lightning talks from those proposed by attendees, on topics that interest them. This is usually used to spur more discussion in the breakout areas.

The programme committee will select a number of 5-minute lightning talks from those proposed by attendees, on topics that interest them. This is usually used to spur more discussion in the breakout areas.

Track 3

Steak House

(session starts at 12:30)

PostMortem Facilitation: Theory and Practice of "New View" Debriefings

2:30 pm-5:30 pm

John Allspaw, Etsy

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. However, leading a debriefing is not straightforward and done haphazardly can bring cultural and technical damage to an organization.

This 3-hour session will cover the theory and fundamentals of the “New View” on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. However, leading a debriefing is not straightforward and done haphazardly can bring cultural and technical damage to an organization.

This 3-hour session will cover the theory and fundamentals of the “New View” on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

What will be covered

  • Foundations and limitations of generating post-hoc narratives
  • Fundamentals of the New View: accountability, responsibility, risk, and "safety"
  • Debriefing techniques to facilitate dialogue with diverse perspectives and potential cognitive biases
  • Plotting your exploration of dynamic fault management: the phases of anomaly response, communication, and diagnosis
  • Interviewing tips and tricks: handling defensiveness and setting the stage for a productive and blame-free environment
  • We will use case studies of known accidents and outages to discuss concepts
  • How to think about the scope, purpose, and implementation of recommendations and remediation items

The session is intended to be very interactive, and sections will require back and forth with the attendees on the various topics

Background reading for attendees can be found here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.411.4985&rep=rep1&type=pdf

Track 4

Kelvin

(continued from 09:00 session. Prometheus workshop starts at 12:30)

Non-Abstract Large Scale Design

9:00 am-1:30 pm

Laura Nolan and Diego Elio Pettenò, Google

This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.

The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:

  • A high-level design that scales horizontally
  • Initial SLIs, SLOs
  • Estimates for hardware required to run each component
  • If time permits, monitoring design and disaster testing scenarios

Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too.

This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.

The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:

  • A high-level design that scales horizontally
  • Initial SLIs, SLOs
  • Estimates for hardware required to run each component
  • If time permits, monitoring design and disaster testing scenarios

Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too.

Laura Nolan has been a Site Reliability Engineer at Google for the past two years, and has worked with multiple large systems that use distributed consensus to achieve reliable, stable multihoming that really works without human babysitting. Prior to that she kept the e-commerce site gilt.com fast and stable during flash sales and worked for the Irish software company Curam Software, which is now part of IBM Smarter Cities. She is a keen traveller and scuba diver and an international-level weightlifting referee.

Prometheus: A Next-Generation Monitoring System (Workshop)

12:30 pm-1:30 pm

Julius Volz and Björn Rabenstein, SoundCloud Ltd.

Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.

In an introductory talk, we will explain the architecture of Prometheus and the motivation behind it. Taking the instrumentation and monitoring of services at SoundCloud as an example, we will demonstrate how Prometheus helps us stay on top of a growing microservice architecture and detect and investigate outages.

In this workshop, participants will set up all critical components of the Prometheus ecosystem to monitor some toy services.

Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.

In an introductory talk, we will explain the architecture of Prometheus and the motivation behind it. Taking the instrumentation and monitoring of services at SoundCloud as an example, we will demonstrate how Prometheus helps us stay on top of a growing microservice architecture and detect and investigate outages.

In this workshop, participants will set up all critical components of the Prometheus ecosystem to monitor some toy services.

Julius and Björn are production engineers at SoundCloud. Coincidentally, both were SREs at Google in their respective previous lives. Julius is a co-founder of the Prometheus project, and both are maintainers and main contributors.

Available Media

13:30–14:30  

Lunch

Lounge

14:30–16:30  

Track 1

Auditorium

Infrastructure Kata and Moving a Large Enterprise to the Cloud

2:30 pm-3:00 pm

Keith Mosher, Yelp

Up until a few years ago, Yelp had a traditional datacenter-based, bare-metal infrastructure for the production web site. The flexibility and dynamic nature of cloud storage and data processing environments was our first introduction to the cloud, and was soon heavily adopted internally. As the company continued to expand, both in its home (U.S.-based) territory, and internationally, scaling infrastructure in traditional data centers became expensive and constraining.

Up until a few years ago, Yelp had a traditional datacenter-based, bare-metal infrastructure for the production web site. The flexibility and dynamic nature of cloud storage and data processing environments was our first introduction to the cloud, and was soon heavily adopted internally. As the company continued to expand, both in its home (U.S.-based) territory, and internationally, scaling infrastructure in traditional data centers became expensive and constraining.

Therefore the operations team’s board-level goals became to build out production environments in the cloud, initially in one region, and then into other (more remote) regions. Taking an application and infrastructure built for a traditional environment and porting it to run in the cloud is not always a simple endeavour. Each build-out has undertaken a significant amount of work to clean up and iterate on our architecture, however at each step pragmatic compromises have had to be made to deliver business value.

This talk will recap the first several iterations of this build-out, concentrating on the problems that we encountered both technically with the new deployments and socially as the team scaled and became international. The focus will be on the "Kata" of each step—deliberate meditations on the nature of our engineering and social problems, and the improvements we made by repeatedly solving the same problem(s) again and again in an iterative manner as we built out each new environment.

Equal coverage will be given to things which worked out well (being a wise investment in time or not too painful to ignore till the next iteration), and those which didn’t work—where we spent resources automating things not on the critical path, or chose not to tackle an issue which caused issues later.

Tomas works on infrastructure automation and hybrid cloud deployments at large scale for Yelp. He speaks regularly at devops events and other technical conferences on a number of topics between testing, development, architecture, automation and systems administration. Tom came to the dark side of systems and devops after being a professional perl developer for many years. He's an avid open source contributor and lives in London with his fiancé and far too many cats.

Available Media

Signatures, Patterns, and Trends: Timeseries Data Mining at Etsy

3:00 pm-3:30 pm

Andrew Clegg, Etsy

Etsy loves metrics. Everything that happens in our data centres gets recorded, graphed, and stored. But with over a million metrics flowing in constantly, it’s hard for any team to keep on top of all that information. Graphing everything doesn’t scale, and traditional alerting methods based on thresholds become very prone to false positives.

Etsy loves metrics. Everything that happens in our data centres gets recorded, graphed, and stored. But with over a million metrics flowing in constantly, it’s hard for any team to keep on top of all that information. Graphing everything doesn’t scale, and traditional alerting methods based on thresholds become very prone to false positives.

That’s why we started Kale, an open-source software suite for pattern mining and anomaly detection in operational data streams. These are big topics with decades of research, but many of the methods in the literature are ineffective on terabytes of noisy data with unusual statistical characteristics, and techniques that require extensive manual analysis are unsuitable when your ops teams have service levels to maintain.

In this talk I’ll briefly cover the main challenges that traditional statistical methods face in this environment, and introduce some pragmatic alternatives that scale well and are easy to implement (and automate) on Elasticsearch and similar platforms. I’ll talk about the stumbling blocks we encountered with the first release of Kale, and the resulting architectural changes coming in version 2.0. And I’ll go into a little technical detail on the fingerprinting and anomaly detection algorithms we apply to metrics and their associated stat`istical metadata. These techniques have applications in clustering, outlier detection, similarity search, and supervised learning, and they are not limited to the data centre but can be applied to any high-volume timeseries data.

Andrew joined Etsy in 2014, becoming their first data scientist outside the USA. In the past he has worked on visualization, information retrieval, and data mining techniques for streaming data, so he naturally gravitated towards the Kale project and the broader issues of operational data mining—areas ripe for fruitful collaboration between devops and data science specialists. He has a MSc in bioinformatics and a PhD in natural language processing, both from the University of London.

Available Media

Building A Billion User Load Balancer

4:00 pm-4:30 pm

Patrick Shuff, Facebook

Want to learn how Facebook scales their load balancing infrastructure to support more than 1.3 billion users? We will be revealing the technologies and methods we use to global route and balance Facebook's traffic. The Traffic team at Facebook has built several systems for managing and balancing our site traffic, including both a DNS load balancer and a software load balancer capable of handling several protocols. This talk will focus on these technologies and how they have helped improve user performance, manage capacity, and increase reliability.

Want to learn how Facebook scales their load balancing infrastructure to support more than 1.3 billion users? We will be revealing the technologies and methods we use to global route and balance Facebook's traffic. The Traffic team at Facebook has built several systems for managing and balancing our site traffic, including both a DNS load balancer and a software load balancer capable of handling several protocols. This talk will focus on these technologies and how they have helped improve user performance, manage capacity, and increase reliability.

Patrick Shuff is a production engineer on the traffic team at Facebook. His team's responsibilities include maintaining/monitoring the global load balancing infrastructure, DNS, and our content delivery platform (i.e. photo/video delivery). Other roles at Facebook include being on the global site reliability team where he, with various infrastructure teams (messaging, real time infrastructure, email), to help increase service reliability and monitoring for their services.

Available Media

Mapping a Service-oriented Architecture

4:30 pm-5:00 pm

Peter Bourgon

The modern service-oriented architecture is a rat's nest of incidental complexity. Truly understanding what's running in production is a massive task, made more complex by heterogeneous runtimes, frameworks, and idioms.

The modern service-oriented architecture is a rat's nest of incidental complexity. Truly understanding what's running in production is a massive task, made more complex by heterogeneous runtimes, frameworks, and idioms.

In this talk I explore the problem space of mapping process-to-process communication in a large network. I enumerate the options we've explored, focusing on their costs and benefits. Finally I present Cello, a tool that I've developed over the past 18 months which deduces a service topology as a directed graph.

Peter Bourgon is a distributed systems engineer who has seen things. He's currently working on the infrastructure that powers SoundCloud, the web's biggest audio platform.

Available Media

Track 2

Lighthouse

What Our SRE Culture Looks Like Now, and How It Got There

2:30 pm-4:30 pm

Simon McCartney, HP Cloud Services

HP Cloud’s current SRE team was built from a private cloud operations team that spent too long fighting fires caused by other people's bad decisions. This talk will chart the progress so far, and the steps yet to come, as we transition from a fire-fighting team that engineered their way out of a dire situation to a SRE group helping our business unit build better products and services. We’ll discuss how we turned around a bad reputation to one where colleagues in dev, QA, operations, and documentation have faith and confidence in our abilities. We’ll also discuss the interaction models we follow for the products and services we contribute to.

HP Cloud’s current SRE team was built from a private cloud operations team that spent too long fighting fires caused by other people's bad decisions. This talk will chart the progress so far, and the steps yet to come, as we transition from a fire-fighting team that engineered their way out of a dire situation to a SRE group helping our business unit build better products and services. We’ll discuss how we turned around a bad reputation to one where colleagues in dev, QA, operations, and documentation have faith and confidence in our abilities. We’ll also discuss the interaction models we follow for the products and services we contribute to.

Simon works for HP Cloud Services, having worked in both development and operations in companies large and small over the last 20 years in IT. At HP Simon has been the Technical Lead for a private OpenStack cloud used to host public services and is now the Technical Lead for HP’s Platform SRE team. He lives in Northern Ireland with his wife Emma and three beautiful children, Zoe, Rory, and Ollie, who make sure Simon enjoys life away from the keyboard.

Canonical's DevOps Culture

3:00 pm-3:30 pm

Tom Haddon, Canonical

In this talk I'll explain our approach to "DevOps" at Canonical. I'll go into some detail on the specific tools we use, which are all Open Source, including OpenStack, Juju, and Mojo. I'll explain some of the policy choices we've made about the interaction between developers and sysadmins, and where we plan to improve things in the future.

In this talk I'll explain our approach to "DevOps" at Canonical. I'll go into some detail on the specific tools we use, which are all Open Source, including OpenStack, Juju, and Mojo. I'll explain some of the policy choices we've made about the interaction between developers and sysadmins, and where we plan to improve things in the future.

Tom Haddon is a squad lead within Canonical’s IS department. He manages a globally distributed team of senior systems administrators rotating between three functions: Projects, Operations, and Webops (devops). He has a strong focus on cloud technologies including OpenStack, Juju, and MAAS, and has worked most recently on helping to architect and implement Canonical’s approach to DevOps.

Available Media

Track 3

Steak House

(continued from 12:30 session)

PostMortem Facilitation: Theory and Practice of "New View" Debriefings

2:30 pm-5:30 pm

John Allspaw, Etsy

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. However, leading a debriefing is not straightforward and done haphazardly can bring cultural and technical damage to an organization.

This 3-hour session will cover the theory and fundamentals of the “New View” on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. However, leading a debriefing is not straightforward and done haphazardly can bring cultural and technical damage to an organization.

This 3-hour session will cover the theory and fundamentals of the “New View” on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.

What will be covered

  • Foundations and limitations of generating post-hoc narratives
  • Fundamentals of the New View: accountability, responsibility, risk, and "safety"
  • Debriefing techniques to facilitate dialogue with diverse perspectives and potential cognitive biases
  • Plotting your exploration of dynamic fault management: the phases of anomaly response, communication, and diagnosis
  • Interviewing tips and tricks: handling defensiveness and setting the stage for a productive and blame-free environment
  • We will use case studies of known accidents and outages to discuss concepts
  • How to think about the scope, purpose, and implementation of recommendations and remediation items

The session is intended to be very interactive, and sections will require back and forth with the attendees on the various topics

Background reading for attendees can be found here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.411.4985&rep=rep1&type=pdf

Track 4

Kelvin

(continued from 12:30 session)

Prometheus: A Next-Generation Monitoring System (Workshop)

12:30 pm-1:30 pm

Julius Volz and Björn Rabenstein, SoundCloud Ltd.

Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.

In an introductory talk, we will explain the architecture of Prometheus and the motivation behind it. Taking the instrumentation and monitoring of services at SoundCloud as an example, we will demonstrate how Prometheus helps us stay on top of a growing microservice architecture and detect and investigate outages.

In this workshop, participants will set up all critical components of the Prometheus ecosystem to monitor some toy services.

Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.

In an introductory talk, we will explain the architecture of Prometheus and the motivation behind it. Taking the instrumentation and monitoring of services at SoundCloud as an example, we will demonstrate how Prometheus helps us stay on top of a growing microservice architecture and detect and investigate outages.

In this workshop, participants will set up all critical components of the Prometheus ecosystem to monitor some toy services.

Julius and Björn are production engineers at SoundCloud. Coincidentally, both were SREs at Google in their respective previous lives. Julius is a co-founder of the Prometheus project, and both are maintainers and main contributors.

Available Media

16:30–17:00  

Break with refreshments

Lounge

17:00–18:00  

Closing Session

Auditorium

Closing Words

Gareth Eason, Google; Narayan Desai, Ericsson; John Looney, Google

Available Media