SREcon18 Europe/Middle East/Africa Conference Program

Wednesday, 29 August, 2018

08:00–09:00

Morning Coffee and Tea

Rheinlandsaal Ballroom Foyer

09:00–10:50

Opening Plenary

Rheinlandsaal Ballroom

Circonus: Design (Failures) Case Study

Wednesday, 09:00–09:40

Theo Schlossnagle and Heinrich Hartmann, Circonus

Available Media

The Circonus platform is a telemetry (time-series) ingest, storage, and analysis platform that provides engineers with tooling to manage systems via SLOs. As SREs, we use SLOs to manage Circonus. Herein lie some interesting recursive lessons. This talk will detail the systems architecture from inception to current day including a migration from bare-metal to Google Cloud. Along this path have been many crimes against computing. I will talk specifically about the architectural evolution as punctuated by my failure.

The Founder/CEO of Circonus, Theo Schlossnagle is a practicing software engineer and serial entrepreneur. At Johns Hopkins University he earned undergraduate and graduate degrees in computer science, with a focus on graphics and randomized algorithms in distributed systems. Theo founded four technology startups focusing on large systems scalability and distributed systems. He is a Distinguished Member of the ACM and sits on the ACM Practitioners Board and serves as co-chair for the ACM Queue.

Connect:

@postwait

Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician. Later he transitioned into computer science and worked as consultant for a number of different companies and research institutions.

Connect:

@heinrichhartman

SRE Theory vs. Practice: A Song of Ice and TireFire

Wednesday, 09:40–10:20

Corey Quinn, Last Week in AWS, and John Looney, Facebook

Available Media

In many technical talks, you see a speaker from a renowned tech company stand up and describe a perfect utopia of an environment. You look at the perfect environment and dedicated hordes of senior engineers they describe, and you despair of ever getting to that point. Your environment looks nothing like that.

Surprise—their environment doesn't really look like that either! In this talk, a speaker from an unnamed tech unicorn describes their amazing environment—and then what they just said gets translated from "thought leader" into plain English for you by an official SREcon translator. Stop feeling sad—everything is secretly terrible!

Currently a Cloud Economist at the Quinn Advisory Group, and an advisor to ReactiveOps, Corey has a history as an engineering director, public speaker, and cloud architect. He specializes in helping companies address horrifying AWS bills, and curates LastWeekinAWS.com, a weekly newsletter summarizing the latest in AWS news, blogs, and tips, sprinkled with snark. Outside of his professional work, Corey is known for overdressing, telling entertaining stories, and carrying a cigarette case full of tiny umbrellas.

Connect:

@Quinnypig

John Looney is a Production Engineer in Intercom, improving datacenter automation. Previously, he helped build a modern SaaS-based infrastructure platform for Intercom, one of the fastest-growing technology companies in the world. Before that, he was a full-stack SRE at Google, who did everything from rack design and data-center automation through ads-serving, stopping at GFS, Borg, and Colossus along the way. He wrote chapters for The SRE Book and Seeking SRE, and is on the steering committee for USENIX SRECon.

Data Protection Update and Tales from the Introduction of the GDPR

Wednesday, 10:20–10:50

Simon McGarr, Data Compliance Europe

Available Media

The General Data Protection Directive came into force on the 25th May 2018, and this, in addition to other current legal developments such as cases relating to the Privacy Shield, make the area of data protection rather fast-moving and interesting at the moment. SRE organisations are often involved in ensuring compliance with these initiatives. This talk will be an update on current events and analysis of how they may impact SREs in the immediate future.

Simon McGarr is recognised as one of Ireland’s leading experts in Data Protection. A practising solicitor, Data Protection consultant and external DPO, he has lectured in the Law Society, regularly appears on national media discussing data issues and was recently invited by the Irish parliament to give evidence on the implementation of the GDPR. He has been involved in many of the landmark cases developing Data Protection law in the EU and focuses much of his work on helping organisations to understand their data protection law needs.

10:50–11:20

Break with Refreshments

Rheinlandsaal Ballroom Foyer

11:20–12:30

Track 1

Rheinlandsaal Ballroom A

What Makes a Good SRE: Findings from the SRE Survey

Wednesday, 11:20–12:00

Dawn Parzych, Catchpoint

Available Media

Site Reliability Engineering is a relatively new discipline when it comes to careers, having only been in existence for about 15 years. While 15 years may seem like an eternity, the SRE role can be considered to be in its infancy. This leads to challenges defining the role and understanding exactly what it is. Browsing through job descriptions or speaking with other SREs you will see many different job descriptions and responsibilities.

Earlier this year Catchpoint conducted a survey of 416 professionals with the title or responsibility of an SRE in an attempt to create a real-world profile of the SRE and the organizations where they work. This session will review the findings of the survey. Learn what the top technical and non-technical skills are, and whether they vary by industry or size of the company. What surprised us with the findings? What additional questions arose analyzing the results? And how can the survey results be used to help organizations building out an SRE team.

Dawn is a Director at Catchpoint where she uses her storytelling prowess to write and speak about the intersection of technology and psychology. She makes technical information accessible avoiding buzzwords and jargon whenever possible. Dawn has spoken at DevOpsDays, Velocity, Interop, and Monitorama. Her articles have appeared in numerous technical publications. She uses her non-existent spare time to organize web performance meetups and serve as a chapter organizer for Write/Speak/Code, a non-profit organization to empower women and non-binary coders to become speakers, writers, and leaders.

Sustainability Starts Early: Creating a Great Ops Internship

Wednesday, 12:00–12:30

Fatema Boxwala, University of Waterloo

Available Media

Representing the next generation of engineers, interns are integral to a sustainable SRE team. Why then, do some revolutionize your deployment process, while others never make it past their first commit? Interns and junior engineers can be particularly susceptible to burnout and effective mentorship can make all the difference.

With little effort, you can drastically increase the likelihood of successful internships. Supporting your intern with realistic projects and expectations, reasonable docs, and frequent feedback can tip the scales. Unfortunately, there is very little information out there about what this really looks like. I’ve known students who after an internship chose to leave tech entirely, and other who were inspired to start a career in operations. In this talk, you’ll hear about how you can set your intern up for success with concrete steps and examples.

Fatema Boxwala is a CS student at the University of Waterloo. At school she’s involved with the Women in Computer Science Committee and the Computer Science Club, occasionally teaching people about Python, Git and Systems Administration. At work she’s been an intern at Rackspace, Yelp, and Facebook.

Connect:

@fatty_box

Track 2

Rheinlandsaal Ballroom BC

The Silver Lining Consortium: Post-Mortems for the Rest of Us

Wednesday, 11:20–12:00

Niall Murphy, Microsoft

Available Media

This talk describes why the industry would really be much better off if we shared our post-mortems more widely, and in a more structured way. It shows why it benefits both cloud consumers and cloud providers. It proposes the creation of a consortium to do precisely this, and suggests how to express interest in joining.

Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer Science and Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.

Connect:

@niallm

Migrations under Production Load: How to Switch Your Database without Disrupting Service

Wednesday, 12:00–12:30

Vilde Opsal, SoundCloud

Available Media

How do you handle live traffic and serve correct results while switching databases? In this talk, Vilde will share a general step-by-step approach to live database migrations from a backend application developer perspective. Using two recent migrations as examples she will walk through what went wrong, what went right, the patterns such projects follow, and how you can apply this to your own projects. You will hear about a database migration from MySQL to Cassandra in a high-traffic critical production system and a data migration that involved recreating the whole database in a complex legacy system.

Vilde Opsal is a feminist developer from Norway, currently living in Berlin by way of California.

Track 3

Hegel Room

Instrumenting an Existing Service for Monitoring

Wednesday, 11:20–12:30

Ingvar Mattsson

Space is limited: add this to your schedule if you plan to attend.

A new service lands on your desk, ready for production! Only, it doesn't seem to have any monitoring, and the dev team have moved on to more urgent needs. Starting with a design document, an existing code-base, an editor and a compiler, we will take this from "runs," to "runs, and is suitable for production use."

Some familiarity with Go is expected. There's (probably) no need for extensive coding, but the workshop does include modifying code, then compile it. During the workshop, you will need a Go compiler and a way of cloning git repositories, as well as running the resulting binaries.

Ingvar Mattsson has 20+ years in the "unix admin" trenches paired with an unhealthy fascination for monitoring and alerting systems.

Track 4

Leibniz Room

Data Visualization for SREs—An Essential Skill for Quick Debugging

Wednesday, 11:20–12:30

Yash Shah, LinkedIn

Space is limited: add this to your schedule if you plan to attend.

SREs are software engineers with a broad skill set who work with systems in general. Depending on the type of work and teams, our time is usually spent correlating incidental data to conclude the causes of issues. While we use ELK, splunk, etc. to visualize our logs; it’s an essential skill to parse log file by hand and visualize it to make useful observations quickly. Many times, we end up writing APIs and command line shortcuts to accelerate our debugging. We can make use of some of the techniques I’ll show you to visualize this data quickly.

WHY this talk?

SREs are usually from more architectural/back-end backgrounds and generally lack working with front-end and visualization. The techniques I’ll show will hopefully be helpful to SREs in day-to-day scenarios.

Yash Shah is a site reliability engineer @LinkedIn.

Connect:

@yashness_

12:30–14:00

Luncheon

Sponsored by Booking.com

Aristoteles, Platon, and Sokrates Rooms

14:00–15:30

Track 1

Rheinlandsaal Ballroom A

The 7 Deadly Sins of Documentation

Wednesday, 14:00–14:50

Chastity Blackwell, Yelp

Available Media

Documentation can often be forgotten when it comes to the pillars of SRE practice; developing tools and new infrastructure has supremacy, and a constant sense of urgency leads to documentation being a "we'll get around to it later" task. Even in places where documentation is actually written, it's often done quickly or poorly in the first place, not maintained, or not organized in a way that makes it easy to use. This can impair onboarding, obfuscate problems, and impair incident response. But if done correctly, documentation can preserve institutional knowledge, streamline incident response, and allow new team members to quickly begin contributing. This talk will discuss the biggest problems surrounding creating, maintaining, and providing utility with documentation, and how to solve them.

Chastity Blackwell has worked in Operations for almost 20 years in a variety of environments, from the hallowed halls of the University of Illinois to a smaller startup, and now works for Yelp as an SRE. In addition, she has done work as a freelance writer and content developer for CCP, the developers of Eve Online.

Connect:

@Black_Isis

Y2K and Other Disappointing Disasters: Risk Reduction and Harm Mitigation

Wednesday, 14:50–15:30

Heidi Waterhouse, LaunchDarkly

Available Media

Every disaster is a concatenation of smaller failures. How can we design software and processes to accept that we live in an imperfect world? Explore the concepts of resiliency, harm reduction, over-engineering, and planning for failure with real examples.

Risk Reduction is trying to make sure bad things happen as rarely as possible. It's anti-lock brakes and vaccinations and irons that turn off by themselves and all sorts of things that we think of as safety modifications in our life. We are trying to build lives where bad things happen less often.

Harm Mitigation is what we do so that when bad things do happen, they are less catastrophic. Building fire sprinklers and seatbelts and needle exchanges are all about making the consequences of something bad less terrible.

This talk is focused on understanding where we can prevent problems and where we can just make them less bad, and what kinds of tools we can use to make every disaster a disappointing fizzle.

Heidi is a developer advocate with LaunchDarkly. She delights in working at the intersection of usability, risk reduction, and cutting-edge technology. One of her favorite hobbies is talking to developers about things they already knew but had never thought of that way before. She sews all her conference dresses so that she's sure there is a pocket for the mic.

Connect:

@wiredferret

Track 2

Rheinlandsaal Ballroom BC

Availability, Latency, and Cost: Withstanding Regional Outages

Wednesday, 14:00–14:50

Aaron Blohowiak, Netflix

Available Media

Running in multiple regions is better for your users through increased availability and lower latencies, and it won't cost as much as you think. We've turned region resiliency from a driver of cost and complexity into a strategic advantage by understanding human and system dynamics both at a high-level and in the nitty-gritty details. Calamity, heartbreak, and inefficiency drove us to refine our approach—and our understanding—as we've matured.

Executing a failover used to be an all-hands-on-deck situation that would bring VPs to the table. Now, it's a matter of routine that usually concludes with an brief "all is well" email.

This talk dives into the experiences of operating in multiple regions at scale and the algebraic models, code and incident management playbooks we've developed to tame, refine, and leverage our approach. Once you've decided to go multi-region, the three major questions that arise are: how many regions, how should we steer users to regions, and how do we actually perform the failover? In addition to the story of how we got to where we are, I'll present the design considerations and system models we used to make those decisions.

Aaron has been building, breaking, and fixing systems for over a decade from tiny startups to serving over 100M users at Netflix. He is presently applying his passion for empiricism and system design to multi-region high-availability architecture and operations on the Traffic team at Netflix. Previously, Aaron co-authored Chaos Engineering (O'Reilly, 2017.)

Connect:

@aaronblohowiak

SRE for Mobile Applications

Wednesday, 14:50–15:30

Samuel Littley, Google

Available Media

In the server side world, we can and do lean heavily on redundancy, scaling, and direct control to engineer reliability; however, in the mobile world these well known facets of SRE (amongst others) are virtually non-existent. We typically have no binary rollbacks or downgrades, no forced updates/upgrades, and no ability to turn it off and back on again. Users rely on mobile applications more and more, and it's important that SREs consider client-side reliability. This talk discusses practices, principles, and processes that can be applied in the mobile world to make client-side code a first-class citizen, along with real-world case studies of what has and hasn't worked.

Note: Much of the talk will centre around Android, although many of what is discussed can be applied to other platforms

Samuel Littley has been a Site Reliability Engineer at Google for 2 years, and is currently working in London on a team supporting the Google Search App and Google Play Services, as well as working on a wider effort supporting production best practices for Google's wide array of mobile applications.

Track 3

Hegel Room

Statistics for Engineers

Wednesday, 14:00–17:30

Heinrich Hartmann, Circonus

Space is limited: add this to your schedule if you plan to attend.

Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information. Some key questions become:

How to interpret the telemetry data that is emitted from the systems you are running?
How to measure the quality of APIs you provide and consume?
How to aggregate metrics from single nodes to service-level views?

In this workshop we will address these questions with statistical methods like: data visualisation, averages, percentiles, outlier-analysis, histograms, regressions, robustness, and mergeability. We will cover the material from a theoretical and a practical perspective. Bring pen and paper and a laptop!

Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician. Later he transitioned into computer science and worked as consultant for a number of different companies and research institutions.

Connect:

@heinrichhartman

Track 4

Leibniz Room

SRE Classroom, Or, How to Design a Distributed System in 3 Hours

Wednesday, 14:00–17:30

Salim Virji, Fabian Geisberger, and Jean Joswig, Google

Space is limited: add this to your schedule if you plan to attend.

Available Media

This workshop ties together academic and practical aspects of systems engineering, with an emphasis on applying principles of systems design to a production service. We will analyze the service to quantify its performance, and iteratively improve the design. Participants will work together in small groups to sketch out the design, identify components and their relationships, and to assess the suitability of the design to the system’s Service Level Objective (SLO).

Participants will have a system design and bill of materials at the conclusion of this workshop.

Participants will not need laptops or specific coding experience; participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving. Participants will come away with an understanding of the principles of iterative systems engineering, popularly known as “Non-abstract large systems design.”

This workshop covers material critical for SRE, an increasingly-broad field that combines software engineering and systems design.

Salim Virji is a Site Reliability Engineer at Google New York City.

Connect:

@salim

Fabian is a Site Reliability Engineer at Google in New York, where he currently works on monitoring systems. He previously worked on the Ganeti SRE team, the Production Monitoring team, and several other Google services. Fabian received a Masters (Diploma) in Computer Science from the Karlsruhe Institute of Technology (KIT), Germany, in 2012.

Connect:

@theotherfabi

15:30–16:00

Break with Refreshments

Rheinlandsaal Ballroom Foyer

16:00–17:30

Track 1

Rheinlandsaal Ballroom A

Your System Has Recovered from an Incident, but Have Your Developers?

Wednesday, 16:00–16:45

Jaime Woo, DigitalOcean

Available Media

Mistakes are inevitable, and happen to the best of us. Our industry adopts a blame-free culture, but that doesn't negate the sting that occurs when we're at the heart of a mess-up.

Developers continually raise the bar on how to prevent errors, mitigate damage for ones that arise, and wring out as many learnings as possible after the damage is done. But much of this work is focused on the products, and not the people. And given the high-stakes in SRE, the range of how a mistake psychologically impacts people can run the gamut from minor to the near-traumatic.

Where are the game day exercises that simulate how to support a coworker who just caused 3 am pings and 20 hour work days? What resources should we share to help people understand the stages of emotions they'll feel after a major incident?

The concept of psychological safety is well understood as a key predictor for high-performing teams, but what does that entail? Drawing from original research, and lessons from fields like sports, medicine, and even stand-up comedy, attendees will leave with a series of tangible actions and exercises to help restore team trust and rebuild a developer's confidence.

Jaime Woo started his career as a molecular biologist, working on cartilage replacements. While he adored nurturing genetically-modified E. coli, he realized his main passion was storytelling. He has written an award-nominated book, launched the Engineering blog at Riot, built the technology communications team at Shopify, and currently shepherds content at DigitalOcean. He has a dog named Taco that he will absolutely show you pictures of.

Against On-Call: A Polemic

Wednesday, 16:45–17:30

Niall Murphy, Microsoft

Available Media

There have been computer emergencies as long as there have been computers and emergencies. There is evidence dating computer on-call shifts from the 1940s, and we can trace on-call activity in an essentially unbroken line from today back through seven decades of the computer industry.

But just because we've been doing it for seven decades doesn't make it right. In fact, if you think about it, the fact we've been doing something essentially unchanged for seven decades, given everything else that has changed around the practice, might cause us to ask: is it right that we are doing this? Is there something better? What are the alternatives?

This talks builds on previous work to analyse and establish what the _real_ reasons for continuing to do on-call are, provides evidence that human beings are actually really bad at it—in fact, it's really harmful for them to do it—and proposes solutions, ending with a call for action for the industry to bring into being an on-call-free future.

(Note: this talk builds on an article written for DNB's "Seeking SRE.")

Niall Richard Murphy is Director of Engineering for Microsoft Azure Ireland, where his group works on cloud engineering systems and SRE. He is the instigator, co-author, and co-editor of two books on SRE, and a history of the Irish Internet. He is the holder of degrees in Computer Science and Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.

Connect:

@niallm

Track 2

Rheinlandsaal Ballroom BC

Impact of Network Automation

Wednesday, 16:00–16:45

Roman Romanyak, Squarespace Inc.

Available Media

Running a network efficiently requires a lot of time and effort. Provisioning resources and configuring networking devices often relies on many manual processes and requires highly trained and skilled engineers. Using automation saves time and helps maintain consistent configuration across many networking devices.

In this talk, I will discuss how we, a small network engineering team, started using automation to build and maintain the Squarespace network. We began our move towards orchestration by aggregating small Ansible roles and playbooks. For example, when we migrated the internal routing protocol from OSPF to BGP, we created an Ansible role to fully control the internal routing. Now we run Ansible playbooks when we need to move traffic away from devices before performing maintenance.

I will also describe how we use route servers to minimize configuration changes on network devices. BGP Route Servers is an elegant and sophisticated tool that we use to shift internal or external traffic without impact. We integrated Route Servers with our CI/CD tools to do BGP traffic engineering in the internal and external network. I will provide an example of how we route traffic onto a DDoS mitigation service with the click of a button.

I'm a Staff Network Engineer on the Infrastructure Engineering team at Squarespace. I have been working building reliable and redundant datacenter networks, focusing my efforts also on building automation tools for network deployments and management.

Connect:

@rromanyak

Migrating Your Old Server Products to Be Stateless Cloud Services

Wednesday, 16:45–17:30

Kurt Scherer and Craig Knott, Atlassian

Available Media

This talk will be a technical walkthrough of how we built out an infrastructure in AWS using industry open-source tooling (like Packer, Ansible, and Terraform) to build stateless instances that run our antiquated applications and how the CI/CD pipeline allows for rollouts in a reliable way. Because these products are business critical (like billing services), it was quintessential to have a solid migration plan.

I'll cover the before architecture, the goal, the migration, the end architecture, the upgrade build and deployment process, and the monitoring and alerting taxonomy. I'll share what worked and what didn't and the complications (and outages) along the way.

The goal is to give any SRE the knowledge needed to be confident to perform this type of migration themselves.

Kurt Scherer has been in the industry for 15 years, working in both software engineering and operations roles. Starting in the United States military as a Systems and Network Engineer in their ISP, I moved on to consulting for a few years. Later, I joined Yahoo as a software engineer on a video streaming platform and then moved to Microsoft in their Virtual Networking stack. I spent time at Google as an SRE for a streaming data pipeline that indexes events for ads targeting. Most recently, I've moved to Atlassian to pave the way for a new SRE organization.

Connect:

@srekurt

Craig Knott is a Cloud Farmer at Atlassian.

Connect:

@craigknott92

Track 3

Hegel Room

(Continued from previous session)

Statistics for Engineers

Wednesday, 14:00–17:30

Heinrich Hartmann, Circonus

Space is limited: add this to your schedule if you plan to attend.

Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information. Some key questions become:

How to interpret the telemetry data that is emitted from the systems you are running?
How to measure the quality of APIs you provide and consume?
How to aggregate metrics from single nodes to service-level views?

In this workshop we will address these questions with statistical methods like: data visualisation, averages, percentiles, outlier-analysis, histograms, regressions, robustness, and mergeability. We will cover the material from a theoretical and a practical perspective. Bring pen and paper and a laptop!

Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician. Later he transitioned into computer science and worked as consultant for a number of different companies and research institutions.

Connect:

@heinrichhartman

Track 4

Leibniz Room

(Continued from previous session)

SRE Classroom, Or, How to Design a Distributed System in 3 Hours

Wednesday, 14:00–17:30

Salim Virji, Fabian Geisberger, and Jean Joswig, Google

Space is limited: add this to your schedule if you plan to attend.

Available Media

This workshop ties together academic and practical aspects of systems engineering, with an emphasis on applying principles of systems design to a production service. We will analyze the service to quantify its performance, and iteratively improve the design. Participants will work together in small groups to sketch out the design, identify components and their relationships, and to assess the suitability of the design to the system’s Service Level Objective (SLO).

Participants will have a system design and bill of materials at the conclusion of this workshop.

Participants will not need laptops or specific coding experience; participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving. Participants will come away with an understanding of the principles of iterative systems engineering, popularly known as “Non-abstract large systems design.”

This workshop covers material critical for SRE, an increasingly-broad field that combines software engineering and systems design.

Salim Virji is a Site Reliability Engineer at Google New York City.

Connect:

@salim

Fabian is a Site Reliability Engineer at Google in New York, where he currently works on monitoring systems. He previously worked on the Ganeti SRE team, the Production Monitoring team, and several other Google services. Fabian received a Masters (Diploma) in Computer Science from the Karlsruhe Institute of Technology (KIT), Germany, in 2012.

Connect:

@theotherfabi

18:00–19:00

Lightning Talks

Wednesday, 18:00–19:00

Discovering the Beauty of Tech Debt
Daisy Galvan, Facebook
Incident Management Simulation using LEGO
Will Nowak, Google, Inc.
From 32 hours to 7: How the Traffic@FB Team Stopped Dreading Deployments
Kyle Lexmond, Facebook Inc.
Thanos - High Availability and Long Term Storage of Prometheus Metrics
Dominic Green, Improbable
SLA-driven container updates and host maintenance in Apache Aurora
Stephan Erb, Blue Yonder GmbH
The Curious Case of Hiring and Being Hired
Effie Mouzeli
How We Created More Relevant Alerting Even with the Old-school Zabbix-tool
Vladimir Dobriakov, infrastructure-as-code.de
Mastering AIOps with Deep Learning and Time-series Analysis
Jorge Cardoso, Huawei Munich Research Center and University of Coimbra
System Reliability that Plant Trees
Jason Gwartz, Ecosia.org

Available Media

19:00–20:00

Happy Hour

Sponsored by Microsoft Azure

Aristoteles, Platon, and Sokrates Rooms and Rheinlandsaal Ballroom Foyer

Come for the refreshments and the opportunity to meet and network with other attendees, speakers, and conference organizers at the Wednesday Happy Hour.

20:00–22:00

Birds-of-a-Feather Sessions (BoFs)

BoFs, or Birds-of-a-Feather sessions, are an opportunity for informal and ad hoc discussion of a topic of shared interest between a group of conference attendees. SREcon Europe will have BoF sessions on the Wednesday and Thursday evenings.

Go to the BoFs page for information on scheduling BoFs.

Thursday, 30 August, 2018

08:00–09:00

Morning Coffee and Tea

Rheinlandsaal Ballroom Foyer

09:00–10:30

Track 1

Rheinlandsaal Ballroom A

Dealing with Dark Debt: Lessons Learnt at Goldman Sachs

Thursday, 09:00–09:55

Vanessa Yiu, Goldman Sachs International

Available Media

Dark debt is a form of technical debt that is invisible until it causes failures. In enterprise systems, technical engineering liabilities often develop over time due to system complexities, and changes down the line might have unintended consequences due to unforeseen inter-dependencies or hidden problems.

In this talk, we will cover strategies to prevent and mitigate dark debt, as well as tools we have developed or adopted at Goldman Sachs to help manage complexity across our environment.

Attendees of this session can expect to learn:

Challenges of managing distributed systems
Practical tips on how to manage technical debt in your environment
How SRE can help with tackling this form of risk

Vanessa is a site reliability engineer at Goldman Sachs in London. She has worked in a number of engineering roles across Goldman Sachs including sysadmin, storage, market data, trading support as well as managing the firm’s SDLC tool stack for build, test, and deploy.

Halt and Don’t Catch Fire

Thursday, 09:55–10:30

Effie Mouzeli

Available Media

Every young systems team (regardless if it’s in a startup or not) is burdened with someone else’s decisions, good and bad.

Startups have the uniqueness where most of the people that contributed to their technical debt are still there. They are flexible, and, in many cases, still can create a solid foundation (both cultural and technological) for the future. On the other hand, they are fragile, change fast, and frequently adopt new technologies prematurely. New systems teams in medium-sized companies face even harder challenges. They must keep everything running, take care of an unknown complex infrastructure and development team while dealing with the lack of original information.

We will discuss how we can slow down the generation of technical debt in new teams, put it in perspective using practical examples and real-life stories, and how small iterations, and changes in culture can lead to more sustainable systems and teams.

Effie is a systems engineer having worked in a number of startups and small organisations where her responsibilities are usually automation, infrastructure architecture, and working closely with developers.

She is interested in the challenges (tech and non tech) systems engineers come across in those environments and how they can overcome them. Away from work, she loves camping, concerts, and sewing.

Connect:

@manjiki

Track 2

Rheinlandsaal Ballroom BC

Applying the Principles of Chaos to Serverless

Thursday, 09:00–09:55

Yan Cui

Available Media

Chaos engineering is a discipline that focuses on improving system resilience through experiments that expose the inherent chaos and failure modes in our system, in a controlled fashion, before these failure modes manifest themselves like a wildfire in production and impact our users. Netflix is undoubtedly the leader in this field, but much of the publicised tools and articles focus on killing EC2 instances, and the efforts in the serverless community has been largely limited to moving those tools into AWS Lambda functions.

But how can we apply the same principles of chaos to a serverless architecture built around AWS Lambda functions?

These serverless architectures have more inherent chaos and complexity than their serverful counterparts, and, we have less control over their runtime behaviour. In short, there are far more unknown unknowns with these systems.

Can we adapt existing practices to expose the inherent chaos in these systems? What are the limitations and new challenges that we need to consider?

Yan is an experienced engineer with nearly 10 years of experience working with AWS. He is a regular speaker at user groups and conferences internationally, and he is also the author of AWS Lambda in Motion and a co-author of F# Deep Dives. In his spare time he keeps an active blog at http://theburningmonk.com where he shares his thoughts on topics such as AWS, serverless, functional programming, and chaos engineering.

Connect:

@theburningmonk

Know Your Kubernetes Deploys

Thursday, 09:55–10:30

Felix Glaser, Shopify

Available Media

Containers changed the way we develop and package our code. Kubernetes made it easy to deploy and orchestrate our workloads. Now that those steps are well understood, it is time to draw attention to securing the software supply chain. This talk shows how Shopify secures and tracks its workloads.

We secure our software supply chain by creating signatures on our containers which state that they originate from the correct deploy pipeline, got tested and contain no known vulnerabilities or outdated software.

During deployment we use an admission controller that enables us to enforce deploy time policies that check the presence of the before created signatures so that we prevent privilege escalation via code deployment.

Since new exploits show up all the time, we need to add another piece to the puzzle to sure containers: a place to track all the metadata created during the lifetime of a container. For example, where it's deployed so that if it becomes vulnerable it gets pulled out of production, fixed, and redeployed.

Felix likes to climb, cycle, and code. He does the first two outside. And the last but not least at Shopify, where he works on securing containers and their deployment into the cloud.

Track 3

Hegel Room

Developing Effective Service Level Indicators and Service Level Objectives

Thursday, 09:00–12:30

Liz Fong-Jones, Kristina Bennett, Daniel Quinlan, Gwendolyn Stockman, and Stephen Thorne, Google

Space is limited: add this to your schedule if you plan to attend.

Available Media

Devising service level indicators and service level objectives that effectively measure user experiences requires knowledge of the fundamentals of SLIs and SLOs and practice examining SLIs/SLOs for potential weaknesses. In this workshop, you will review SLI/SLO fundamentals, then practice devising and checking SLIs/SLOs in groups with guidance from the Google Customer Reliability Engineering team.

Attendees are expected to have some basic knowledge of monitoring (e.g. whitebox/blackbox) and some experience with how systems can fail.

Attendees will gain appropriate knowledge and experience from the exercise of devising SLOs and SLIs for an example application to propose SLO and SLIs for their own businesses. They will understand what SLAs, SLOs, SLIs, and error budgets are and how they are related; they will be able to understand to define and measure meaningful SLOs and how to run a service or application targeting those SLOs.

Liz is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Connect:

@lizthegrey

Kristina Bennett has worked at Google since 2009. Although she recently joined the Customer Reliability Engineering team in their mission to apply the principles and lessons of SRE at Google towards customers, prior to that she spent five years working on data integrity across Google.

Connect:

@kilobitten

Stephen Thorne is a Senior Site Reliability Engineer at Google. He currently works in Customer Reliability Engineering, helping to integrate Google's Cloud customers' operations with Google SRE. Stephen learned how to be an SRE on the team that runs Google's advertiser and publisher user interfaces, and later worked on App Engine. Before his time at Google, he fought against spam and viruses in his home country of Australia, where he also earned his B.S. in Computer Science.

Connect:

@jerub

Track 4

Leibniz Room

The EU's New Data Protection Law—A Survival Guide

Thursday, 09:00–12:30

Simon McGarr, Data Compliance Europe; Laura Nolan

Space is limited: add this to your schedule if you plan to attend.

What data do you hold?

Are you processing the data, or controlling it?

Do you have the consents to use that data like that?

Do you have a register of all that data and every way you use it, and what for?

Can you find every piece of data you hold that relates to an individual, copy it and send it to them—for free—within 30 days?

What happens when they say they want it erased?

The General Data Protection Directive came into force on the 25th May 2018. New powers mean regulators can impose fines for breaches up to 4% of annual turnover. But that’s not the only thing that will drive compliance- the data supply chain and a fresh threat of litigation are both driving change in organisations as well. This workshop is for anyone trying to make sure that their organisation isn't in breach, and can deal with requests related to the GDPR.

GDPR isn't just a compliance project. It's a business culture change project. Let's struggle our way through together.

This will be an audience-driven workshop session, so bring your hardest questions.

Simon McGarr is recognised as one of Ireland’s leading experts in Data Protection. A practising solicitor, Data Protection consultant and external DPO, he has lectured in the Law Society, regularly appears on national media discussing data issues and was recently invited by the Irish parliament to give evidence on the implementation of the GDPR. He has been involved in many of the landmark cases developing Data Protection law in the EU and focuses much of his work on helping organisations to understand their data protection law needs.

Laura Nolan’s background is in Site Reliability Engineering, software engineering, distributed systems and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly ‘Site Reliability Engineering’ book, and is co-chair of SREcon18 Europe/Middle East/Africa. Laura is currently enjoying a well-earned sabbatical (and tinkering with some of her own projects) after 15 years in industry, most recently at Google.

Connect:

@lauralifts

10:30–11:00

Break with Refreshments

Rheinlandsaal Ballroom Foyer

11:00–12:30

Track 1

Rheinlandsaal Ballroom A

Not Invented Here Syndrome and Dark Debt: The PagerDuty Story

Thursday, 11:00–11:35

Aish Raj Dahal, PagerDuty

Available Media

The conundrum of building something in-house vs using something off-the shelf is an important one. Organizations sometime have their own set of priorities and business requirements that results in them to conclude that no off-the-shelf solution fits the bill. In such scenarios, companies and teams often end up building their own custom software. Once such custom software is written and deployed, organizations rarely consider revisiting their original choices. Proliferation of such decisions within an organization ultimately results in dark-debt.

This talk aims to talk about the important conundrum of build vs buy. I will start with the different choices that teams and organizations have when considering an in-house solution. Next, I will share a case study about PagerDuty’s journey from building a highly available Cassandra based distributed queue solution to address a specific use to using it albeit incorrectly everywhere else, to depreciating and retiring it, in favour of something off-the-shelf. Along the way, the audience will also learn about how such upgrades were done without affecting the health of production systems, and how revisiting decisions from the past helped uncover dark-debt.

Aish works as an Engineer at PagerDuty in San Francisco. He currently works in building PagerDuty’s event intelligence platform often dealing with fallacies of distributed computing. His recent focus has been on Elixir/OTP and building event driven microservices using Kafka and Elixir. In the past, he has worked as an early stage employee at HackerRank as well as a programmer at Goldman Sachs.

Connect:

@aishrajdahal

Building a Debuggable Go Server

Thursday, 11:35–12:00

Keeley Erhardt, Improbable

Available Media

Microservice architectures bring flexibility and scalability to service-based applications, however, they also drastically increase their operational complexity. This presentation will explore how Improbable uses a common Go server with built-in debuggability as a basis for new services, minimizing the complexity inherent in microservices. This server provides consistent metrics, logging, tracing, and more, to deliver a reliable system, and it is now open-source!

Keeley is a software engineer at Improbable, a London-based tech company focused on enabling massive-scale simulation. She graduated from MIT with a B.S. and an M.Eng in Computer Science. Keeley is passionate about distributed systems and open source and has contributed to a variety of open source projects, including, Buck (a build system open-sourced by Facebook) and Chromium. More recently, her focus has turned towards DevOps and building systems that scale.

Connect:

@keeleyerhardt

Building a Fellowship Program to Mentor and Grow Your SRE Team

Thursday, 12:00–12:30

Tom Spiegelman, DigitalOcean

Available Media

Mentorship is invaluable at any point in your career. At DigitalOcean, we introduced an internal two-week fellowship program pairing any developer interested in learning more about what infrastructure did with a senior engineer. We followed the Tuckman 4-stages of group development of forming, storming, norming, and performing. We believe we create the best performing team when mentors and mentees go through the four stages together as a team. Two weeks may seem brief, but we were able to iterate quickly, and also it meant we could focus our energies on mentoring just one person at a time to limit straining the team’s bandwidth. The benefits were manifold: our infrastructure team gained a better perspective of what other teams go through and work on a daily basis which helps us build better tools and workflows to support them. Not only did participants strengthen their skills, but some joined infrastructure, realizing it was right for them. And for those that didn't join it was an excellent way to cross-pollinate ideas and build the infrastructure team's relationships with other teams. In this talk, attendees will hear about the theory, lessons learned, and how to create their own fellowship program.

I am truly passionate for all things tech specifically if you want to talk to me about Infrastructure. I have been at DigitalOcean coming on 3 years now and really love it.

Track 2

Rheinlandsaal Ballroom BC

SoundCloud's Story of Seeking Sustainable SRE

Thursday, 11:00–11:35

Björn Rabenstein, SoundCloud Ltd.

Available Media

SoundCloud runs a complex microservice architecture to serve a great diversity of features to a large user base. All of this is done by a relatively small number of engineers, under constant pressure to innovate in the not exactly easy market of music streaming. While this might appear quite similar to the sitatution of many other startups, SoundCloud is a rather extreme example. As such, it is perfectly suited to find out how to tackle this tech-debt prone situation.

About six years ago, with the microservice migration in full swing, site reliability became more and more problematic at SoundCloud. At about the same time, SoundCloud happened to employ a handful of ex-Google SREs. Naively, one might have expected they would simply wave their magic G-wands and make the site reliable again. However, simply copying Google-style SRE and applying it to an organization very different in scale and culture was doomed to fail. Studying the exact reasons for the failure and SoundCloud's subsequent mission to find their own implementation of SRE is a helpful exercise for many smaller organizations in a similarly challenging situation of sustainably running a diverse set of services.

Björn Rabenstein is a Production Engineer at SoundCloud and a Prometheus developer. Previously, Björn was a Site Reliability Engineer at Google and a number cruncher for science.

How We Un-Scattered Our DNS Setup and Unlocked New Automation Options

Thursday, 11:35–12:00

Dan Lüdtke, eGym.com

Available Media

We own over a hundred different domains. They were spread over multiple registrars. DNS servers were not under active management and DNS data was neither version controlled nor reviewed. Deployments were risky and rollbacks challenging.

We gained control over the situation by reducing the number of contracts with registrars, selecting a cloud-based DNS service, and convinced the teams to manage DNS data in a version-controlled manner. To deploy DNS changes, we build tooling that we open-sourced. Today, we are able to deploy much faster and safer. We also have automated checks and implemented some safety measures to prevent the most common mistakes I made in the past. Sharing the mistakes will be part of the presentation, as well as quick outlook to the new automation options we unlocked by having a more robust DNS setup.

Dan served his country, worked as a security consultant, wrote a book about IPv6, contributes to open source software projects, regularly helps to organize large hacker events, runs an autonomous system for fun, and dreams of space travel.

Connect:

@danrl_com

Kernel Upgrades at Facebook

Thursday, 12:00–12:30

Pradeep Nayak Udupi Kadbet

Available Media

The goal of this talk is to explain the importance of automating your kernel upgrades and why you should invest time in building automation which reliably and continuously enforces newer kernels on your hosts.

The Kernel Team at Facebook is in charge of the Linux kernel used at Facebook, along with other 'system level' packages that go with. The kernel team works on tasks like:

Merging upstream changes into the Facebook Linux Kernel
Creating custom kernel changes for our needs
Investigating Linux-related performance issues and failures
Periodically building and initial testing of new Facebook kernel rpms

MySQL is one of the primary data stores which Facebook relies on. We have tens of thousands of database hosts which run on linux boxes with different kernel versions. No kernel is perfect and often time database hosts hit kernel bugs which impact production traffic. The remediation often is to upgrade to newer kernels which have these fixes.

In this talk I will go over some of the kernel bugs which impacted our production database servers and how we invested time in developing an automation framework to enforce new kernels on our database hosts in a continuous fashion at Facebook scale. I will also go over how MySQL Infrastructure at Facebook adopted this and is successfully upgrading tens of thousands of database servers without impacting production traffic.

Pradeep is a Production Engineer at Facebook and works with MySQL Infrastructure. He loves hacking code in python and builds bots to do things for him. While he is not working, he enjoys traveling and clicking pictures.

Connect:

@_prdp

Track 3

Hegel Room

(Continued from previous session)

Developing Effective Service Level Indicators and Service Level Objectives

Thursday, 09:00–12:30

Liz Fong-Jones, Kristina Bennett, Daniel Quinlan, Gwendolyn Stockman, and Stephen Thorne, Google

Space is limited: add this to your schedule if you plan to attend.

Available Media

Devising service level indicators and service level objectives that effectively measure user experiences requires knowledge of the fundamentals of SLIs and SLOs and practice examining SLIs/SLOs for potential weaknesses. In this workshop, you will review SLI/SLO fundamentals, then practice devising and checking SLIs/SLOs in groups with guidance from the Google Customer Reliability Engineering team.

Attendees are expected to have some basic knowledge of monitoring (e.g. whitebox/blackbox) and some experience with how systems can fail.

Attendees will gain appropriate knowledge and experience from the exercise of devising SLOs and SLIs for an example application to propose SLO and SLIs for their own businesses. They will understand what SLAs, SLOs, SLIs, and error budgets are and how they are related; they will be able to understand to define and measure meaningful SLOs and how to run a service or application targeting those SLOs.

Liz is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Connect:

@lizthegrey

Kristina Bennett has worked at Google since 2009. Although she recently joined the Customer Reliability Engineering team in their mission to apply the principles and lessons of SRE at Google towards customers, prior to that she spent five years working on data integrity across Google.

Connect:

@kilobitten

Stephen Thorne is a Senior Site Reliability Engineer at Google. He currently works in Customer Reliability Engineering, helping to integrate Google's Cloud customers' operations with Google SRE. Stephen learned how to be an SRE on the team that runs Google's advertiser and publisher user interfaces, and later worked on App Engine. Before his time at Google, he fought against spam and viruses in his home country of Australia, where he also earned his B.S. in Computer Science.

Connect:

@jerub

Track 4

Leibniz Room

(Continued from previous session)

The EU's New Data Protection Law—A Survival Guide

Thursday, 09:00–12:30

Simon McGarr, Data Compliance Europe; Laura Nolan

Space is limited: add this to your schedule if you plan to attend.

What data do you hold?

Are you processing the data, or controlling it?

Do you have the consents to use that data like that?

Do you have a register of all that data and every way you use it, and what for?

Can you find every piece of data you hold that relates to an individual, copy it and send it to them—for free—within 30 days?

What happens when they say they want it erased?

The General Data Protection Directive came into force on the 25th May 2018. New powers mean regulators can impose fines for breaches up to 4% of annual turnover. But that’s not the only thing that will drive compliance- the data supply chain and a fresh threat of litigation are both driving change in organisations as well. This workshop is for anyone trying to make sure that their organisation isn't in breach, and can deal with requests related to the GDPR.

GDPR isn't just a compliance project. It's a business culture change project. Let's struggle our way through together.

This will be an audience-driven workshop session, so bring your hardest questions.

Simon McGarr is recognised as one of Ireland’s leading experts in Data Protection. A practising solicitor, Data Protection consultant and external DPO, he has lectured in the Law Society, regularly appears on national media discussing data issues and was recently invited by the Irish parliament to give evidence on the implementation of the GDPR. He has been involved in many of the landmark cases developing Data Protection law in the EU and focuses much of his work on helping organisations to understand their data protection law needs.

Laura Nolan’s background is in Site Reliability Engineering, software engineering, distributed systems and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly ‘Site Reliability Engineering’ book, and is co-chair of SREcon18 Europe/Middle East/Africa. Laura is currently enjoying a well-earned sabbatical (and tinkering with some of her own projects) after 15 years in industry, most recently at Google.

Connect:

@lauralifts

12:30–14:00

Luncheon

Sponsored by Google

Aristoteles, Platon, and Sokrates Rooms

14:00–15:30

Track 1

Rheinlandsaal Ballroom A

Managing Misfortune for Best Results

Thursday, 14:00–14:45

Kieran Barry, SRE @ Google

Available Media

The Simulated Outage training game is a regular part of SRE training at Google and elsewhere. They represent great opportunities to simulate an outage, and to practice problem debugging and escalation. Perhaps equally important, they provide an opportunity to simulate the stress of an outage for an oncall engineer.

This talk describes techniques to ensure a productive training environment. It will emphasise the importance of providing context to the trainee engineer. It will also talk about the importance of calibrating the level of stress to the needs of the student. Since training games are often observed by whole teams, the talk will cover ways to maintain engagement among the group of observers.

Finally, it will talk about potential anti-patterns to be avoided.

Kieran has worked at Google as an SRE on the search team for the past four years. He has also volunteered on new-SRE education as part of the SREEDU team.

Clearing the Way for SRE in the Enterprise

Thursday, 14:45–15:30

Damon Edwards, Rundeck

Available Media

Do you get excited listening to the SRE success stories from companies whose practices, infrastructure, and tooling seems ready-made for this new style of working? But then do you quickly become disillusioned when you look back at the hairball that is your company's multiple generations of legacy infrastructure, tooling, skills, processes, and business constraints? This talk is for you.

We'll first look at how decades of traditional operations beliefs and practices leave organizational scar tissue that is difficult to overcome. We'll examine examples of how silos, excessive toil, reliance on queues, and incorrectly applied governance models undermine the adoption of SRE principles and practices in the enterprise.

We will also look at how high-performing enterprises are having success transforming their operations. Specifically, we'll look at the design patterns they applied to change organizational structures, update roles/responsibilities, align incentives, and change individual mindsets. The result was that these organization cleared away that scar tissue and made it easier to move to an SRE model of working.

Note: This talk is based on Damon Edwards's chapter "Clearing the Way for SRE in the Enterprise" in the upcoming Seeking SRE (O'Reilly 2018).

Damon Edwards is a Co-Founder of Rundeck Inc., the makers of Rundeck, the popular open source Operations Management Platform. Damon has spent over 15 years working with both the technology and business ends of IT Operations and is noted for being a leader in porting Lean and cutting-edge DevOps techniques to large-scale enterprise organizations. Damon is a frequent conference speaker and writer who focuses on DevOps, SRE, and Operations improvement topics. Damon is active in the international DevOps community, a co-host of the DevOps Cafe podcast, and a content chair for Gene Kim’s DevOps Enterprise Summit.

Connect:

@damonedwards

Track 2

Rheinlandsaal Ballroom BC

Care and Feeding of Data Processing Pipelines

Thursday, 14:00–14:45

Rita Sodt, Google

Data processing pipelines have important use cases ranging from business analytics, machine learning, eliminating spam and abuse, and delivering billing invoices to transforming data for many important user facing serving jobs. These pipelines are often composed of multiple steps where the input of one is the output of another and with dependencies on external systems and storage, all of which can break. When they do, and pipelines fail to meet SLOs, fixes are often expensive and time consuming, especially if a large data set needs to be reprocessed or repaired. It is best to focus on prevention and quickly detecting and responding to the issues, which is where SRE can help.

In part the difficulty of managing pipelines lies in their difference from serving jobs. Unable to monitor RPC latency and errors directly as a proxy for customer happiness it's necessary to gain visibility into the age of oldest unprocessed data and measure data correctness since corrupt output data may be customer visible and persisted even when serving jobs report no errors. To prevent issues and minimize impact techniques such as canarying, incremental rollout, automatic failover, and autoscaling can be used, which all have specific considerations for pipelines.

Rita is an SRE at Google with experience managing data processing pipelines, including Google Analytics. She has worked with other pipeline groups at Google on automation and, in particular, monitoring products that meet the needs of pipelines as well as serving jobs. She started her career as a software developer on Google cloud and comes from an interdisciplinary research background at University of Washington, including projects to predict and model brain tumor growth and to interface sensors with mobile devices for applications in the developing world.

Panel: Data Pipelines—Scaling and Reliability

Thursday, 14:45–15:30

Moderator: Laura Nolan
Panelists include: Narayan Desai, Google; Matthew Flaming, New Relic; Theo Schlossnagle, Circonus; Rita Sodt, Google

Data processing pipelines, in some form or another, are the lifeblood of all large systems that aggregate data, sort and structure unordered input, or compute features for machine learning. These kinds of systems have become much more common in recent years, and problems with delayed or incorrect results are becoming more likely to have business and user impact. Designing and running pipelines is quite different from designing and running serving jobs. In this panel, pipeline experts from a range of organisations and applications will discuss their experiences scaling pipelines and dealing with their pitfalls.

Track 3

Hegel Room

Chaos Engineering Bootcamp

Thursday, 14:00–17:30

Ana Medina, Gremlin

Space is limited: add this to your schedule if you plan to attend.

Available Media

Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering can be thought of as the facilitation of controlled experiments to unearth weaknesses.

Ana Medina leads a hands-on tutorial on chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Everyone will be given a Kubernetes cluster with a demo application to perform a set of chaos engineering attacks using a variety of chaos engineering tools. You’ll identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering to create reliable distributed systems.

Pre-Reading List
1. Production-Ready Microservices by Susan Fowler—especially the section on chaos testing at Uber (p. 94)

Prerequisites, Skills, and Tools
1. A basic understanding of production environments and the infrastructure required to run systems
2. Experience with Linux, cloud infrastructure, hardware, networking, and systems troubleshooting

Ana is a Software Engineer living in San Francisco. She is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina mostly about traveling, diversity in tech and mental health.

Track 4

Leibniz Room

The Art of Debugging

Thursday, 14:00–17:30

Avishai Ish-Shalom, Aleph VC, and Nati Cohen, HERE Mobility

Space is limited: add this to your schedule if you plan to attend.

Available Media

Are you one of those "gifted debuggers" that everyone turns to when they need to solve a difficult problem? Great! This workshop isn't for you. For the rest of us, debugging is often considered a mysterious trait that some engineers were born with, but alas, some simply aren't. This workshop is here to bust that myth.

In this workshop, we will practice a well-structured debugging methodology—conducting debugging "katas" with the aim of mastering debugging technique.

Let's stop using trial and error (and other witchcraft tactics) to find the cause(s) of our problems!

This workshop is for junior and senior engineers interested in improving their debugging methodology. Despite debugging being a very common activity, in real word scenarios noise, cognitive biases, system complexity, and production pressures can easily lead us astray. By training yourself in debugging methodology you can improve your real world performance under these harsh conditions.

Avishai is a veteran operations and software engineer with years of high scale production experience. At present, Avishai helps growing startups and the Israeli high-tech eco-system as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.

Connect:

@nukemberg

Nati Cohen is a Production Engineer at Here Technologies and a Teaching Assistant at the Interdisciplinary Center Herzliya. Previous experience includes: operations consulting, software development, *nix administration and security research in the Intelligence Corps as well as in multiple startup companies.

Connect:

@nocoot

15:30–16:00

Break with Refreshments

Rheinlandsaal Ballroom Foyer

16:00–17:30

Track 1

Rheinlandsaal Ballroom A

The Math behind Project Scheduling, Bug Tracking, and Triage

Thursday, 16:00–16:45

Avery Pennarun

Available Media

Many projects have poorly defined (and often overridden) priorities, hopelessly optimistic schedules, and overflowing bug trackers that are occasionally purged out of frustration in a mysterious process called "bug bankruptcy." But a few projects seem to get everything right. What's the difference? Avery collected the best advice from the best-running teams at Google, then tried to break down why that advice works—using math, psychology, an ad-hoc engineer simulator (SimSWE), and pages torn out of Agile Project Management textbooks.

We'll answer questions like:

Why are my estimates always too optimistic, no matter how pessimistic I make them?
How many engineers have to come to the project planning meetings?
Why do people work on tasks that aren't on the schedule?
What do I do when new bugs are filed faster than I can fix them?
Should I make one release with two features or two releases with one new feature each?
If my bug tracker is already a hopeless mess, how can I clean it up without going crazy or declaring bankruptcy?

Once upon a time, Avery was the lead engineer for Google Fiber's home wifi devices, building, managing, and monitoring the whole fleet in customers' homes. More recently, he's branched out into projects that are harder to explain. Before that, he started startups including one that deployed Lotus Domino in 10 minutes, and one that did unspeakable things to Microsoft Access databases. He's also on the board of directors for a Canadian bank. Nobody knows why.

Connect:

@apenwarr

Ethics in Computing

Thursday, 16:45–17:30

Theo Schlossnagle, Circonus

Available Media

As computing evolved to touch billions of lives, it did so over the rapid course of about 25 years. In that time, the profession of computing has not self-regulated with respect to professional domain-specific ethical enforcement. In fact, many computing professionals have not taken a course on ethics, know very little about ethics, and don't understand how it uniquely applies to their industry. This talk aims to cast light on this important subject in a tone and tenor that should resonate with computing professionals.

The Founder/CEO of Circonus, Theo Schlossnagle is a practicing software engineer and serial entrepreneur. At Johns Hopkins University he earned undergraduate and graduate degrees in computer science, with a focus on graphics and randomized algorithms in distributed systems. Theo founded four technology startups focusing on large systems scalability and distributed systems. He is a Distinguished Member of the ACM and sits on the ACM Practitioners Board and serves as co-chair for the ACM Queue.

Connect:

@postwait

Track 2

Rheinlandsaal Ballroom BC

Canarying Well: Lessons Learned from Canarying Large Populations

Thursday, 16:00–16:45

Štěpán Davidovič, Google

Available Media

Canarying, the process of controlled and observed partial rollout in production to mitigate risk, is one of the common techniques used to ensure safe production changes. In this talk, we will cover common pitfalls, discuss best practices, and outline an end-to-end strategy for the canary process.

Štěpán Davidovič is a Site Reliability Engineer at Google. He currently works on internal infrastructure for automatic monitoring. In previous Google SRE roles, he developed Canary Analysis Service, worked on distributed Cron solution, and has worked on both a wide range of shared infrastructure projects and AdSense reliability. He obtained his bachelor's degree from Czech Technical University, Prague, in 2010.

Real World SLOs and SLIs: A Deep Dive

Thursday, 16:45–17:30

Matthew Flaming and Elisa Binette, New Relic

Available Media

If you've read almost anything about SRE best practices, you've probably come across the idea that clearly defined and well-measured Service Level Objectives (SLOs) and Service Level Indicators (SLIs) are a key pillar of any reliability program. SLOs allow organizations and teams to make smart, data-driven decisions about risk and the right balance of investment between reliability and product velocity.

But in the real world, SLOs and SLIs can be challenging to define and implement. In this talk, we’ll dive into the nitty-gritty of how to define SLOs that support different reliability strategies and modalities of service failure. We’ll start by looking at key questions to consider when defining what “reliability” means for your organization and platform. Then we'll dig into how those choices translate into specific SLI/SLO measurement strategies in the context of different architectures (for example, hard-sharded vs. stateless random-workload systems) and availability goals.

Matthew Flaming began his career in software engineering back when creating a web portal meant hacking together your own version of JSP and racking your own Solaris boxes. Since then he has led the development of complex, high-scale backend systems ranging from CDNs to IoT platforms with an equal emphasis on technical architecture and building organizations where innovation thrives. In his current role as VP of Site Reliability at New Relic, he focuses on the SRE practice and the technical, operational, and cultural aspects of scaling and reliability.

Connect:

@mflaming

Elisa Binette is a Senior Engineering Manager within the Site Reliability Organization at New Relic. The group focuses on helping teams measure and achieve their reliability goals, improving reliability for both the engineers within the company and for the end customers of New Relic. She’s actively involved with PDXWIT, a local non-profit whose purpose is to strengthen the Portland women in tech community. She also loves martial arts, and has enjoyed both practicing and teaching classes for many years.

Connect:

@elisabPDX

Track 3

Hegel Room

(Continued from previous session)

Chaos Engineering Bootcamp

Thursday, 14:00–17:30

Ana Medina, Gremlin

Space is limited: add this to your schedule if you plan to attend.

Available Media

Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos engineering can be thought of as the facilitation of controlled experiments to unearth weaknesses.

Ana Medina leads a hands-on tutorial on chaos engineering, covering the tools and practices you need to implement chaos engineering in your organization. Everyone will be given a Kubernetes cluster with a demo application to perform a set of chaos engineering attacks using a variety of chaos engineering tools. You’ll identify new ways to use chaos engineering within your engineering organization and discover how other companies are using chaos engineering to create reliable distributed systems.

Pre-Reading List
1. Production-Ready Microservices by Susan Fowler—especially the section on chaos testing at Uber (p. 94)

Prerequisites, Skills, and Tools
1. A basic understanding of production environments and the infrastructure required to run systems
2. Experience with Linux, cloud infrastructure, hardware, networking, and systems troubleshooting

Ana is a Software Engineer living in San Francisco. She is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina mostly about traveling, diversity in tech and mental health.

Track 4

Leibniz Room

(Continued from previous session)

The Art of Debugging

Thursday, 14:00–17:30

Avishai Ish-Shalom, Aleph VC, and Nati Cohen, HERE Mobility

Space is limited: add this to your schedule if you plan to attend.

Available Media

Are you one of those "gifted debuggers" that everyone turns to when they need to solve a difficult problem? Great! This workshop isn't for you. For the rest of us, debugging is often considered a mysterious trait that some engineers were born with, but alas, some simply aren't. This workshop is here to bust that myth.

In this workshop, we will practice a well-structured debugging methodology—conducting debugging "katas" with the aim of mastering debugging technique.

Let's stop using trial and error (and other witchcraft tactics) to find the cause(s) of our problems!

This workshop is for junior and senior engineers interested in improving their debugging methodology. Despite debugging being a very common activity, in real word scenarios noise, cognitive biases, system complexity, and production pressures can easily lead us astray. By training yourself in debugging methodology you can improve your real world performance under these harsh conditions.

Avishai is a veteran operations and software engineer with years of high scale production experience. At present, Avishai helps growing startups and the Israeli high-tech eco-system as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.

Connect:

@nukemberg

Nati Cohen is a Production Engineer at Here Technologies and a Teaching Assistant at the Interdisciplinary Center Herzliya. Previous experience includes: operations consulting, software development, *nix administration and security research in the Intelligence Corps as well as in multiple startup companies.

Connect:

@nocoot

17:30–19:00

Conference Reception

Mingle with fellow attendees at the the Conference Reception. Enjoy dinner, drinks, and the chance to connect with other attendees, speakers, and conference organizers.

19:00–22:00

Birds-of-a-Feather Sessions (BoFs)

BoFs, or Birds-of-a-Feather sessions, are an opportunity for informal and ad hoc discussion of a topic of shared interest between a group of conference attendees. SREcon Europe will have BoF sessions on the Wednesday and Thursday evenings.

Go to the BoFs page for information on scheduling BoFs.

Friday, 31 August, 2018

08:00–09:00

Morning Coffee and Tea

Rheinlandsaal Ballroom Foyer

09:00–10:30

Track 1

Rheinlandsaal Ballroom A

I’m SRE and You Can Too!—A Fine Manual for Migrating Your Organization to the New Hotness

Friday, 09:00–09:55

Blake Bisset, Dropbox, and Jonah Horowitz

Available Media

Tales of brown-field SRE transitions from Netflix, Stripe, YouTube, Chrome, and the Google acquisitions process that you can use to overcome your own obstacles to continuous operational improvement.

How do you build a sustainable SRE program? How do you migrate an existing infrastructure, teams, and workflow to an architecture that can realize the benefits of SRE? What choices have companies made with their SRE programs, and what should you consider when adopting the SRE model?

We've been through this process a half dozen times in differing circumstances and organizations, and will share with you what has worked—and occasionally failed—for us as well as attempt to answer some of the most common questions we've had from previous attendees undergoing their own SRE/DevOps transitions.

Blake got his first legal tech job at 16, long enough ago that he’s entitled to make shakeyfists while shouting “Get off my LAN!”

He did three startups (a Dupont/ConAgra venture; a UW biotech spinoff; and this other time some kids were sitting around New Year's Eve, wondering why they couldn’t watch movies on the Internet) before becoming an SRM.

At YouTube and Chrome his "20%" was organizing Google's global tech management conference and MentorsOnCall programs, but his happiest accomplishment was holding the go/bestpostmortem link for multiple years. He currently serves as head of Reliability Engineering at Dropbox.

Connect:

@blakebisset

Jonah is a Senior Site Reliability Engineer with 18 years experience building and scaling production applications. He's worked at several startups and large companies including Quantcast, Netflix, and Stripe.

Connect:

@jonahhorowitz

Lessons Learned—Data Driven Hiring 3 Years Later

Friday, 09:55–10:30

Chris Stankaitis, The Pythian Group

Available Media

In 2015 I gave a talk on Data Driven Hiring at the first SREcon EMEA. Three years later I would like to revisit that talk, and discuss what's changed and what has remained the same since I first put in place a new hiring system based on evaluating "potential" rather than hard skills.

With the speed of changing technology, we have shifted to a reality where companies need people with growth mindsets who are able to embrace change and pick up new technologies fast. Gone are the days when the most technologies on a resume wins the race, and hiring managers are now working to replace their old methods with ones that validate a baseline of technical competency but focus on hiring for unknown unknowns.

Even if you are not a manager, interviewing for new roles is a constant in our world. This talk will help you understand what many managers are looking for and will give you an understanding of the evolution of the hiring process in the event you have not changed jobs recently.

Working for Pythian, Chris builds and manages high performing SRE and Hadoop Teams which are globally distributed (follow the sun) and remote (work from home) based. Working with companies from startup to web-scale, Chris's teams keeps many of the sites and services people use on a daily basis up and running smoothly. Chris is a passionate advocate and evangelist for the transformation of the traditional systems administration role into the current SRE paradigm. Outside of work Chris is a gamer geek, a volunteer with Scouts Canada, and works with his local LGBTQ+ community on education and outreach initiatives.

Connect:

@drtns

Track 2

Rheinlandsaal Ballroom BC

SRE Team Lifecycles

Friday, 09:00–09:55

Stephen Thorne, Google

Available Media

Site Reliability Engineering is more than a job title or the name of a team: we know it is a collection of approaches and practices that allow an organisation to deliver their applications at scale.

We set out to give a definition of what the principles of SRE should be able to deliver to an organisation, and to supply you with a roadmap of how to go from first considering implementing SRE practices, to what your first SREs should be tasked with, to making your SRE team a stable and productive fixture in your organisation.

Stephen Thorne is a Senior Site Reliability Engineer at Google. He currently works in Customer Reliability Engineering, helping to integrate Google's Cloud customers' operations with Google SRE. Stephen learned how to be an SRE on the team that runs Google's advertiser and publisher user interfaces, and later worked on App Engine. Before his time at Google, he fought against spam and viruses in his home country of Australia, where he also earned his B.S. in Computer Science.

Connect:

@jerub

Capacity Planning in Four Parts: Telling the Future without a Crystal Ball

Friday, 09:55–10:30

Evan Smith, SRE at Hosted Graphite

Available Media

Capacity Planning is a pretty difficult topic to wrap your head around and can seem almost impossible to start with. In this talk, I present the four basic steps to go from monitoring what you have to planning for the future. If you've found yourself woken up at 4 AM to expand nodes to meet demand, it's time to pull away from reactive actions and start proactive capacity planning.

The goal of the session is to:

Provide an approachable method for capacity planning right now.
Highlight concerns and considerations before starting your plan.
Explain how to enforce your plan after you've made it.
Explain when to re-evaluate your plans and reduce toil.
Recommend some methods for avoiding common pitfalls engineers/business folk fall into.

Evan Smith is a Service Reliability Engineer with Hosted Graphite in Dublin, Ireland. He's responsible for architecting capacity plans and the company ChatOps bot while helping manage an ingestion pipeline of over 100 billion data points a day. A lover of systems, security, and making spreadsheets, Evan's new to being an SRE, but six years of experience as a Web Dev/Ops Engineer has made the transition just a little easier.

Connect:

@TheJokersThief

Track 3

Hegel Room

Unconference: Resilient Design

Friday, 09:00–10:30

Avishai Ish-Shalom, Aleph VC

Space is limited: add this to your schedule if you plan to attend.

Building resilient applications is not easy. We expect our servers to have decent predictable latency even in the face of high loads, unexpected load/latency spikes; we want graceful degradation in face of failures. This unconference session provides a space for practitioners to discuss experiences of designing and building applications with these concerns in mind. The agenda will be decided by the participants on the day, but examples of topics relevant to the conference are:

Queueing delays
Sizing queues, thread pools, concurrency limits
The effect of outliers (high percentiles) on system performance and capacity
Load shedding
Circuit breakers
Cascading failures
Isolating interdependent systems

Avishai is a veteran operations and software engineer with years of high scale production experience. At present, Avishai helps growing startups and the Israeli high-tech eco-system as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.

Connect:

@nukemberg

Track 4

Leibniz Room

Building Blocks of Distributed Systems

Friday, 09:00–12:40

John Looney, Facebook

Space is limited: add this to your schedule if you plan to attend.

Available Media

All distributed systems make tradeoffs and compromises. Different designs behave very differently with respect to cost, performance, and how they behave under failure conditions.

It's important to understand the tradeoffs that the building blocks in your systems make, and the implications this has for your system as a whole. In this workshop we'll look at several examples of different real-world distributed systems and discuss their strengths and shortcomings.

This workshop will include some practical elements. You will be given some system designs to read and to evaluate, and then we'll discuss the implications of each design together as a group.

John Looney is a Production Engineer in Intercom, improving datacenter automation. Previously, he helped build a modern SaaS-based infrastructure platform for Intercom, one of the fastest-growing technology companies in the world. Before that, he was a full-stack SRE at Google, who did everything from rack design and data-center automation through ads-serving, stopping at GFS, Borg, and Colossus along the way. He wrote chapters for The SRE Book and Seeking SRE, and is on the steering committee for USENIX SRECon.

10:30–11:00

Break with Refreshments

Rheinlandsaal Ballroom Foyer

11:00–12:40

Track 1

Rheinlandsaal Ballroom A

The Nth Region Project: An Open Retrospective

Friday, 11:00–11:45

Andrew Bloomgarden, New Relic

Available Media

For the past year, a small team of engineers and I have had one job: allow New Relic to run an independent European region for data sovereignty reasons. That means taking around 500 services written by around 50 teams that have historically been assumed to run in just one deployment and changing them to work anywhere. And, at the end of the process, we needed to be able to spin up new regions quickly and sustainably operate them with our existing staff.

The talk will be in two parts, because a project like this isn't purely technical or organizational.

We needed to choose technical changes that turned building out a new region from a many-month-long process for all teams into a project for one small team. We decided that the key was to move all services to run in containers, and have them all do service discovery via dependency injection.

The reality of working at a medium-sized organization meant we had to have a lot of coordination and buy-in. I'll talk about how our roadmapping process both hindered and enabled this project to work at all, and how we used test buildouts and teardowns to integrate early and often.

This wouldn't be an open retrospective without talking about what didn't work well, which was primarily organizational rather than technical. We've learned some lessons on how to run large-scale projects that will hopefully help us on our next one, so I hope that we can provide some hard-earned lessons.

Andrew Bloomgarden is a Principal Software Engineer at New Relic. He's worked on a wide range of projects, including the NRDB distributed event database, charting, the autocompleting NRQL query editor, bare metal hardware provisioning, and supporting multiple regions. He lives in Pittsburgh, Pennsylvania, USA, where he also sings classically in the Mendelssohn Choir of Pittsburgh.

Connect:

@aughr

This IS NOT Fine: Putting Out (Code) Fires

Friday, 11:45–12:15

Emily Freeman, Kickbox

Available Media

So the dumpster is on fire. Again. The site’s down. Your boss’s face is an ever-deepening purple. And you begin debating whether you should join the #incident channel or call an ambulance to deal with his impending stroke.

Fires are never going to stop. We’re human. We miss bugs. Or we fat finger a command — deleting dozens of servers and bringing down S3 in US-EAST-1 for hours — effectively halting the internet. These things happen.

But we can fundamentally change the way we approach fires. And that requires adopting the techniques of industries much older than ours.

Firefighters have clear procedures and a strong hierarchy. The first truck at a scene immediately begins assessing the situation. They evaluate building construction, visible smoke, fire and flow paths. The Incident Commander gives orders to his or her personnel regarding fire attack as well as calling for additional resources, communicating with those not yet at at the scene and preparing an action plan.

After many years of ghostwriting, Emily Freeman made the bold (insane?!) choice to switch careers into software engineering. Emily is the curator of JavaScript January — a collection of JavaScript articles which attracts 20,000 visitors in the month of January. She works as a developer advocate for Kickbox and lives in Denver, Colorado.

Connect:

@editingemily

What Medicine Can Teach Us about Being On-Call

Friday, 12:15–12:40

Daniel Turner, Shopify

Available Media

Being on-call is a critical and stressful part of being a SRE. While most organizations want and are willing to take steps to reduce the on-call burden, few have used quantitative research methods to try and optimize being on-call.

At the same time, being on-call is a part of most physician’s practice. This is especially true for medical residents—postgraduate doctors in training—who can be on-call as often as once every three days. The field of medicine has undertaken numerous studies and research projects to optimize the handling of on-call duties. These studies have explored work-life balance, ways to decrease the number of critical incidents (which can literally mean life or death), as well as reducing mistakes.

This talk breaks down the techniques and research that have led to practices that can be adopted for SREs. It also looks at issues that remain unsolved in both fields, like pages sent to the wrong team or those that shouldn’t have been sent at all. Finally, it concludes with words of warning that SREs are not physicians, and as with any interdisciplinary study, we must be mindful of these differences when borrowing techniques.

Daniel Turner is a Sr. Production Engineer at Shopify. He is part of the team building a company-wide platform on top of Kubernetes as well as maintaining Shopify’s data centers. He is married to a wonderful physician who is the inspiration for this talk.

Track 2

Rheinlandsaal Ballroom BC

Tradeoffs in Resiliency: Managing the Burden of Data Recoverability

Friday, 11:00–11:45

Kristina Bennett, Google

Available Media

Almost every service has critical data somewhere, whether it's large-scale blob storage or minimalistic index tables or just the service's own production configuration. The data's sizes and shapes and storage technologies vary widely; and yet, the possibilities for data loss remain, and the same obstacles to recovery consistently appear. This talk reviews the practices that can prepare a service for practical data recoveries, highlights some of the hidden dangers waiting to ambush a recovery attempt, and examines some of the risk/cost tradeoffs that inevitably dominate data integrity coverage, based on the lessons of five years of data integrity tooling and consulting across Google.

Kristina Bennett has worked at Google since 2009. Although she recently joined the Customer Reliability Engineering team in their mission to apply the principles and lessons of SRE at Google towards customers, prior to that she spent five years working on data integrity across Google.

Connect:

@kilobitten

Scalable Coding—Find the Error

Friday, 11:45–12:15

Igor Ebner de Carvalho, Microsoft

Available Media

An important function in our job as SREs is being able to write code and review code, but even here we miss a lot of details and many bugs remain undiscovered. Why is this? A major part of the problem is how our brain works—we only see what we want to see. This talk will explain this important "shortcoming" of our brain and give some insights on how to overcome this problem by addressing common examples of code snippets that can easily lead to bugs and problems at scale.

At Microsoft Azure, Igor has been working with some of the largest and most resilient services on the planet, where he helps drive the necessary changes needed for Azure to be ahead of the current growth curve we are experiencing in the Cloud business.

Connect:

@_igore_

Delete This: Decommissioning Servers at Scale

Friday, 12:15–12:40

Anirudh Ra, Facebook

Available Media

Facebook's datacenter footprint has increased significantly; we now have 12 locations across USA and Europe. As these new locations come online, we have had to plan for the end-of-life process: decommissioning server racks and replacing them in a timely and streamlined manner. Until recently, decommissioning a cluster entailed a lot of manual work: service oncalls were ticketed by project managers and then migrated off the old hardware onto new hardware, after which hardware was unplugged and rolled out.

We realized the need for automation that covered all of this. We started with a framework that allows for automated service migration, given a list of retiring machines and a list of replacements. We moved on to an automated process that looks at a decommission schedule and kicks off jobs to drain server clusters on time so that old racks can be taken away and new racks rolled into their place.

With this automated process in place, we have learned lessons and figured out how to minimize the time that old servers spend without services running on them before being rolled out of the datacenter. We are also exploring ways to reuse parts of this framework in other ways to increase efficiency.

Customer support tech turned production engineer, Anirudh tries to remember that his job is even now about helping people succeed. He builds frameworks for service owners to run their services with minimal bother and enjoys baking bread, using oxford commas, and reading fiction, histories, and fictional histories.

Track 3

Hegel Room

Unconference: Developing Effective SRE Teams

Friday, 11:00–12:40

Kurt Andersen, LinkedIn

Space is limited: add this to your schedule if you plan to attend.

Skill acquisition applies to both individuals and teams. For a team, it is helpful to understand where the team is currently executing and potential changes to move to greater levels of effectiveness. For convenience, we can usefully divide skill levels into five ranges: novice, advanced beginner, competent, proficient, and expert.

The precise agenda will be decided by the participants on the day, but the idea in this unconference is to take aspects of SRE practice (such as monitoring, measurement against SLOs, incident management and postmortems, managing toil, etc) and discuss what these look like at the different skill levels - not so much at an individual level as at an organizational one.

We would also like to discuss how attendees can gauge their own team and company's state and progress, and develop a plan for growing weak areas. This unconference session is a follow-up to Kurt's talk The Never-Ending Story of Site Reliability from SREcon17 Europe and a later version presented at SREcon Asia 2018 Characterizing and Phases of SRE Practice.

Kurt Andersen is one of the co-chairs for SREcon18Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware, and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security.

Connect:

@drkurta

Track 4

Leibniz Room

(Continued from previous session)

Building Blocks of Distributed Systems

Friday, 09:00–12:40

John Looney, Facebook

Space is limited: add this to your schedule if you plan to attend.

Available Media

All distributed systems make tradeoffs and compromises. Different designs behave very differently with respect to cost, performance, and how they behave under failure conditions.

It's important to understand the tradeoffs that the building blocks in your systems make, and the implications this has for your system as a whole. In this workshop we'll look at several examples of different real-world distributed systems and discuss their strengths and shortcomings.

This workshop will include some practical elements. You will be given some system designs to read and to evaluate, and then we'll discuss the implications of each design together as a group.

John Looney is a Production Engineer in Intercom, improving datacenter automation. Previously, he helped build a modern SaaS-based infrastructure platform for Intercom, one of the fastest-growing technology companies in the world. Before that, he was a full-stack SRE at Google, who did everything from rack design and data-center automation through ads-serving, stopping at GFS, Borg, and Colossus along the way. He wrote chapters for The SRE Book and Seeking SRE, and is on the steering committee for USENIX SRECon.

12:40–14:00

Luncheon

Sponsored by Circonus

Aristoteles, Platon, and Sokrates Rooms

14:00–15:30

Track 1

Rheinlandsaal Ballroom A

Observability for Emerging Infra: What Got You Here Won't Get You There

Friday, 14:00–14:50

Charity Majors, honeycomb.io

Available Media

Distributed systems, microservices, containers and schedulers, polyglot persistence... modern infrastructure is ever more fluid and dynamic, chaotic and transient. Likewise, individual engineering roles can no longer be broken down neatly into software engineers (who write the code) and ops engineers (who deploy the code (and buffer the consequences)). Many teams have already sailed past an event horizon of complexity and found that their old tools and processes no longer work for them. But why, exactly? What was wrong with traditional metrics and logs? Why are they failing to keep pace with modern requirements? Isn't observability just a marketing term for monitoring? And what on earth can we do about it? In this talk, we'll cover the shortcomings of traditional metrics and logs, and the technical and cultural differences between monitoring and observability. We'll also talk about the deep cultural revolution underway from siloed specialties towards software ownership (in every type of engineering role)—and what exactly does that mean when it comes to systems observability?—as well as the technical practices and mental shifts that are absolutely required to keep pace with modern infrastructure.

Charity is an ops engineer and accidental CEO at honeycomb.io. Before this she worked at Parse, Facebook, Linden Lab, etc., on operations and developer tools, and always seemed to wind up running the databases. She is the co-author of O'Reilly's Database Reliability Engineering, and loves free speech, free software, and single malt scotch.

Deploying SRE Training Best Practices to Production: What We Learned (a.k.a. Strapping Jetpacks on Unicorns, the Postmortem)

Friday, 14:50–15:30

Jennifer Petoff, Google Ireland

Available Media

In 2015, Andrew Widdowson gave a talk at SREcon Americas titled “From Zero to Hero: Recommended Practices for Training your Ever-Evolving SRE Teams.” His recommendations were based on nearly a decade of personal experience ramping up new SREs at Google.

Fast forward to 2018. Google SRE now has a global training organization called SRE EDU. In many ways, SRE EDU was charged with developing a formal program to deploy these training best practices into production. Our goal? Spin up a globally consistent and reliable education program for Site Reliability Engineering.

Of course a cornerstone of SRE practice is the blameless postmortem. This talk addresses what we learned when scaling training best practices globally. Along the way, we’ll share tips for small and large organizations alike on how you can learn from our experience and ensure that you deliver an effective training experience for your SREs.

Jennifer Petoff is a Senior Program Manager for Google's Site Reliability Engineering team based in Dublin, Ireland and is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production Systems. Jennifer currently co-leads the global SRE EDU training program at Google.

Connect:

@jennski

Track 2

Rheinlandsaal Ballroom BC

Keep Building Fresh: Shopify's Journey to Kubernetes

Friday, 14:00–14:50

Niko Kurtti, Shopify

Available Media

Shopify, in 2014, was one of the first large-scale users of Docker in production. We ran 100% of our production traffic in hundreds of containers. We saw the value of containerization and aspired to also introduce a real orchestration layer.

Fast forward two years to 2016, when instead we had a clumsy and fragile homemade middleware for controlling containers. We started looking at orchestration solutions again and the technology behind Kubernetes intrigued us.

In this talk I'll briefly go over challenges we saw in moving from a traditional host-based infrastructure to a cloud native one, moving not only our core app to Kubernetes but also hundreds of our other apps at the same time. I'll focus on the cluster tooling solutions we've built like controllers, cluster creators, and deploy tools. We've automated things ranging from our DNS to certificates and even complex cluster creations—and all with a real programming language and projects rather than a handful of random scripts.

The ability to extend Kubernetes to fit our needs has been the greatest reward of this project. It's given us a new paradigm on which to build upon rather than relying on old patterns.

Tinkerer with keen interest in container technologies. Working on the cloudplatform team at Shopify and building an internal PaaS on top of Kubernetes.

The Myth of Cloud Agnosticism

Friday, 14:50–15:30

Corey Quinn, Last Week in AWS

Available Media

In theory, the idea of having infrastructure that can seamlessly deploy between different cloud providers is a wonderful concept. Who wouldn't love to migrate workloads seamlessly between providers for a variety of reasons? In theory, a tiger with an anger management problem is just a scaled up house-cat.

This talk explores the practical reality of cloud agnosticism, with all of its warts. The financial, technical, and operational complexities introduced by multiple providers can take companies by surprise. Come explore the basic truth of "however much you hate your cloud provider, you will hate the migration process far more."

Currently a Cloud Economist at the Quinn Advisory Group, and an advisor to ReactiveOps, Corey has a history as an engineering director, public speaker, and cloud architect. He specializes in helping companies address horrifying AWS bills, and curates LastWeekinAWS.com, a weekly newsletter summarizing the latest in AWS news, blogs, and tips, sprinkled with snark. Outside of his professional work, Corey is known for overdressing, telling entertaining stories, and carrying a cigarette case full of tiny umbrellas.

Connect:

@Quinnypig

Track 3

Hegel Room

SparkPost: The Day the DNS Died

Friday, 14:00–14:50

Jeremy Blosser, SparkPost

Available Media

More than 30% of the world's non-spam email is sent using SparkPost's technology, and our cloud service sends over 15 billion messages per month. Deploying that service on AWS has provided all the expected cloud benefits of flexibility and scalability, but also unique challenges due to email's unique profile and needs.

Our DNS needs are particularly extreme. Our infrastructure currently has to support 8,000 DNS queries per second. We have experienced several issues deploying a service model that can meet this need, and a major DNS-related outage in May of 2017 caused significant pain for our customers and sent us back to the drawing board once again. We recently completed a ground-up DNS service redesign that includes dedicated VPCs with optimized security groups and ACLs, distribution across tiers and availability zones, resolver tuning and custom configurations, and multiple local caching resolvers per instance.

In this talk, we will discuss our history addressing this challenge and lessons learned, the May outage event itself, and our current architecture's design and results. Attendees will gain an understanding of what it takes to host a robust DNS service in AWS at a scale beyond what is currently natively supported by AWS' resolver services.

Jeremy Blosser has worked in systems administration and engineering for 20 years, and most of that time has included a focus on reliably delivering email and other traffic at scale. He is currently the Principal Operations Engineer at SparkPost, responsible for technical architecture oversight and keeping the cloud service operating and healthy. He lives in Texas with his wife and five kids.

Unikernels—The New Black

Friday, 14:50–15:30

Hristo Mohamed, CERN

Available Media

The Operations world has never been more exiting! Long gone are the days of monolithic systems being responsible for multiple services and one cannot help but smile with a bit of sentiment reading books of old that gave advice along the lines "If possible, try to run a single service on a machine."

From the rise of VMs some 15 years ago, to the rise in popularity of containers in the last few years, the need for performance, easily scalable and manageable solutions for the age-old problem of running our software reliably, often on a massive scale, is obvious and addressed both by academia and industry.

Containers have been the new black for quite some time, but are they the future?

In this talk, I am going to:

Explore the amazing and exciting world of Unikernels! Unikernels are not a new technology and have been around for quite a while, but are a technology which is gaining popularity as an application container.
Explore some "young" and "old" unikernel projects
Talk about the growing unikernel community and just how easy it is nowadays to start experimenting with unikernels.

Hristo Mohamed really likes system administration and has decided to make this his main activity during the day. He works at CERN in the LHCb Online Team as a System Administrator and mainly deals with making things run as smoothly as possible. His main activities are automation and monitoring of the LHCb Online Cluster. He also tries to make sure that Linux is on its best behavior.

Track 4

Leibniz Room

Lessons Learned from Our Main Database Migrations at Facebook

Friday, 14:00–14:50

Yoshinori Matsunobu, Facebook

At Facebook, we created a new MySQL storage engine called MyRocks. Our objective was to migrate one of our main databases (UDB) from compressed InnoDB to MyRocks and reduce the amount of storage and number of servers used by half. In August 2017, we finished converting from InnoDB to MyRocks in UDB. The migration was very carefully planned and executed, and it took nearly a year, but that was not the end of the migration. SREs needed to continue to operate MyRocks databases reliably. It was also important to find any production issue and to mitigate or fix it before it became critical. Since MyRocks was a new database, we encountered several issues after running in production. In this session, I will introduce several interesting production issues that we have faced, and how we have fixed them. Some of the issues were very hard to predict. These will be interesting for attendees to learn too.

Attendees will learn the following topics.

What is MyRocks, and why it was beneficial for large services like Facebook
What should be considered for production database migration
How migration should be executed
Learning 4-6 real production issues

Yoshinori Matsunobu is a Production Engineer at Facebook, and is leading MyRocks project and deployment. Yoshinori has been around the MySQL community for over 10 years. He was a senior consultant at MySQL Inc. from 2006 to 2010. Yoshinori created a couple of useful open source product/tools, including MHA (automated MySQL master failover tool) and quickstack.

Have You Tried Turning It Off (and Not On Again)?

Friday, 14:50–15:30

Josh Deprez, Google Australia

Available Media

Productivity may not be a simple function of person-hours, but having zero person-hours available because the persons are preoccupied with legacy services will kill a launch or landing. Turning legacy off is important for sub-linear headcount growth.

The most promising targets for turndown are:

Old systems riddled with technical debt;
Internal tools and services nobody cares for anymore;
Yesterday's formerly-shiny ${THING} that ${THING2} has replaced;
Services at least one or two layers beneath what really matters to the organisation.

But there are always blockers even when the goal is clear and the motives are strong:

The system is in the critical path for something critical;
Other systems rely on it in some long-forgotten way;
The organisation's current "Death Star megaproject" is using it and you can't shift their migration timeline;
It offers some useful feature that the replacement doesn't have.

In this talk I will explain why turning things off (for good) is desirable, describe a few services my team was responsible for turning down, how they related to other services with illustrations, what made turning down possible, and how we got there in the end (or didn't get there).

Josh is a senior SRE at Google, where he is TL in a team internally renowned for turning things off. He has a PhD in Mathematics, a fact which is not relevant to the talk topic.

Connect:

@DrJosh9000

15:30–16:00

Break with Refreshments

Rheinlandsaal Ballroom Foyer

16:00–17:40

Closing Plenary

Rheinlandsaal Ballroom

SRE for Good: Engineering Intersections between Operations and Social Activism

Friday, 16:00–16:40

Liz Fong-Jones, Google, and Emily Gorcenski

Available Media

Our job as engineers does not stop purely with adherence to Service Level Objectives. A service that does a reliable job of harming people, exacerbating injustices, or excluding marginalized groups is not at all a service worth building and maintaining. Technology is poised to change the world, for good or for ill, and engineers of all kinds share a responsibility to ensure that their work is "for the public good." We can apply SRE practices to advocate for justice in the products we build and in the broader industries we work in.

Attendees will learn about the parallels between SRE and social activism movements, and how they can advocate for changing their workplaces and the world.

Liz is a Staff Site Reliability Engineer at Google and works on the Google Cloud Customer Reliability Engineering team in New York. She lives with her wife, metamour, and two Samoyeds in Brooklyn. In her spare time, she plays classical piano, leads an EVE Online alliance, and advocates for transgender rights.

Connect:

@lizthegrey

Can I Tell You a Secret? I See Dead Systems

Friday, 16:40–17:00

Avishai Ish-Shalom, Aleph VC

Available Media

We live in a world of shiny new tech introduced all the time. Heck, we even made cars that drive themselves. Yet all around us, unseen and hidden, lurk ancient, forgotten systems. They're in our kernels, our terminals and our CPUs... They are everywhere. This is the story of how remnants of dead systems continue to haunt us in present days and why we can't seem to get rid of them.

Legacy is a fact of life, riddled with hacks and weird workarounds that have survived 30+ years. We all complain about it, yet are building tomorrow's legacy systems today. The aim of this talk (besides being amusing) is to raise awareness for End-Of-Life phase of software products.

Avishai is a veteran operations and software engineer with years of high scale production experience. At present, Avishai helps growing startups and the Israeli high-tech eco-system as Engineer in Residence in Aleph VC fund. In his spare time, Avishai is spreading weird ideas and conspiracy theories like DevOps and Operations Engineering.

Connect:

@nukemberg

Junior Engineers Are Features, Not Bugs

Friday, 17:00–17:40

Kate Taggart, HashiCorp

Available Media

There are many benefits to hiring junior engineers, but when it comes to teams responsible for production infrastructure, we default to thinking such risky environments are no place for newbies. However, "the things we do are too risky to have junior engineers working on them" often instead means "we haven’t invested properly in resiliency." Hiring junior engineers onto production-critical teams can guide you to reduce risk to your production systems by highlighting needed improvements you should be making regardless. We’ll walk through three categories—architecture, process, and tooling—and discuss the specific ways that a junior engineer may illustrate the need for improvement in each. We'll also discuss concrete approaches to effectively hiring and onboarding junior engineers on SRE teams specifically.

Kate has worked in multiple areas of tech over the past decade, ranging from power grid resiliency to fintech to enterprise software. Kate has managed a variety of teams across the devops spectrum at New Relic and Simple, and is now at HashiCorp helping build tools for other companies and teams to manage their own infrastructure.

Connect:

@qkate

17:40–17:45

Closing Remarks

Rheinlandsaal Ballroom

Program Co-Chairs: Laura Nolan; Avleen Vig, Facebook

SREcon18 Europe/Middle East/Africa Conference Program

SREcon18 Europe/Middle East/Africa Program Grid

Downloads for Registered Attendees

Wednesday, 29 August, 2018

08:00–09:00

Morning Coffee and Tea

09:00–10:50

10:50–11:20

Break with Refreshments

11:20–12:30

12:30–14:00

Luncheon

14:00–15:30

15:30–16:00

Break with Refreshments

16:00–17:30

18:00–19:00

19:00–20:00

Happy Hour

20:00–22:00

Birds-of-a-Feather Sessions (BoFs)

Thursday, 30 August, 2018

08:00–09:00

Morning Coffee and Tea

09:00–10:30

10:30–11:00

Break with Refreshments

11:00–12:30

12:30–14:00

Luncheon

14:00–15:30

15:30–16:00

Break with Refreshments

16:00–17:30

17:30–19:00

Conference Reception

19:00–22:00

Birds-of-a-Feather Sessions (BoFs)

Friday, 31 August, 2018

08:00–09:00

Morning Coffee and Tea

09:00–10:30

10:30–11:00

Break with Refreshments

11:00–12:40

12:40–14:00

Luncheon

14:00–15:30