SREcon19 Europe/Middle East/Africa Program Grid
View the program in mobile-friendly grid format.
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Wednesday, 2 October
07:45–08:45
Morning Coffee and Tea
The Forum
08:45–10:30
The Liffey B
Opening Remarks
Program Co-Chairs: Emil Stolarsky, Incident Labs, and Murali Suriar, Google
The SRE I Aspire to Be
Yaniv Aknin, Google Cloud
Yaniv Aknin dives into the secret sauce for a successful SRE organization: high-quality measurements of reliability. He explains why measuring reliability is crucial (and why it’s so hard), shares a couple of tips for getting it right, and explores why it's key for SREs doing Engineering work.
Yaniv Aknin, Google Cloud
Yaniv Aknin is Google Cloud Platform’s lead for quantitative reliability. He works with product managers, developers, and fellow SREs to create availability and performance metrics that accurately model customers’ experience, then optimizes those metrics toward the right reliability/cost point. He’s been an SRE with Google since 2013, working on network infrastructure and several parts of the Google Cloud Platform. He has over two decades’ experience solving business problems in corporate, early startup, government, and nonprofit organizations. Outside of work, he enjoys travel, food, improv theater, and pop-sci, especially behavioral economics.
A Systems Approach to Safety and Cybersecurity
Nancy Leveson, MIT
Nancy Leveson, MIT
10:30–11:00
Break with Refreshments
The Forum
11:00–12:30
The Liffey B
SLOs for Data-Intensive Services
Yoann Fouquet, Booking.com
Designing and maintaining a search engine service can be challenging. One of the challenges is to set insightful SLOs where standard availability/latency SLOs do not fit. We will go through our journey towards defining a monitoring process for such services at Booking.com, from ineffective availability/latency SLOs to the current setup and all its advantages; travelling in a world where providing accurate and consistent responses can be as important as availability.
Yoann Fouquet, Booking.com
Yoann is a Site Reliability Engineer at Booking.com, working on core services within the Booking.com infrastructure.
Latency SLOs Done Right
Heinrich Hartmann, Circonus
Latency is a key indicator of service quality, and important to measure and track. However, measuring latency correctly is not easy. In contrast to familiar metrics like CPU utilization or request counts, the "latency" of a service is not easily expressed in numbers. Percentile metrics have become a popular means to measure the request latency, but have several shortcomings, especially when it comes to aggregation. The situation is particularly dire if we want to use them to specify Service Level Objectives (SLOs) that quantify the performance over a longer time horizons. In the talk we will explain these pitfalls, and suggest three practical methods how to implement effective Latency SLOs.
Heinrich Hartmann, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician. Later he transitioned into computer science and worked as a consultant for a number of different companies and research institutions.
Building a Scalable Monitoring System
Molly Struve, Kenna Security
A year ago, my company's monitoring setup was a disaster! We had 6 different monitoring tools sending alerts all over the place. In this talk, I will share how we overhauled our entire monitoring system and created a single, centralized, easy to use system that fits all of our needs. Not only does get the job done, but because it is so simple to use, developers have bought into the system and are actively helping to improve it as well.
Molly Struve, Kenna Security
Molly Struve is the Lead Site Reliability Engineer at Kenna Security. She joined Kenna in 2015 and has had the opportunity to work on some of the most challenging aspects of Kenna’s code base. This includes scaling Elasticsearch, sharding MySQL databases, and creating infrastructure that can grow as fast as Kenna's business. When not making code run faster, she can be found fulfilling her need for speed by riding and jumping her show horses.
The Liffey A
A Tale of Two Rotations: Building a Humane & Effective On-Call
Nick Lee, Uber
Everyone wants to provide excellent and reliable service to their customers, but the world is a messy place. Things will break for reasons inside and outside of your control, and for the most unexpected reasons. At the end of the day, someone is going to be the on-call and step in to restore order to keep customers happy.
The question is, how do you keep your on-call as happy as your customers?
This talk examines how highly critical on-call rotations that protect core functionality can be made extremely effective and low-stress, and how completely ordinary rotations can get out of hand. We’ll discuss how best practices from the first rotation were successfully applied to the second, and how to apply them in your own rotation.
Nick Lee, Uber
Nick Lee has worked at Uber Amsterdam for three years, starting off as a backend engineer building user facing features and transitioning over to Production Engineering as he discovered how great reliability and toil reduction work is.
Support Operations Engineering: Scaling Developer Products to the Millions
Junade Ali, Cloudflare
Large scale internet infrastructure companies are increasingly relied upon by other engineering organisations, from self-serve customers to large enterprise organisations. The duty of helping customers SRE and engineering teams diagnose complex and stressful issues will likely rest with technical support. Support Operations Engineers compliment this by treating support as an optimisation problem with engineering solutions.
This talk describes the essential principles that Cloudflare’s Support Operations Engineers use to scale developer support in a large-scale internet infrastructure company serving 16+ million customer domains and more than 10% of global HTTP requests, whilst driving dramatic improvements in operational efficiency and delivering exceptional business value.
This talk will cover how Stateless Testing is used to introduce proactive support and improve customer retention, how Safety Engineering strategies are used with Machine Learning to automate customer support and how Operations Research with alerting data is used to create a next-gen Security Operations Centre.
Junade Ali, Cloudflare
Junade Ali is an Engineering Manager at Cloudflare, focusing on building the Support Operations Group. He has previously worked on high-integrity software for safety critical applications and previously served as Lead Developer of the largest digital agency in the UK (by headcount). He has publications in the fields of combinatorics, software engineering and cryptography. He has an MSc in Computer Science with an award-winning thesis and is extending his work as a part-time PhD student.
The Unmonitored Failure Domain: Mental Health
Jaime Woo, Incident Labs
As stigma around mental health slowly peels away, a lot of our current conversations are centered around this individual model: Operators are responsible for watching their own stress levels, well-being, and avoiding burnout.
Yet, mental health can be contagious among team members. Studies show that if one team member is feeling stressed, anxious, or burnt out, that feeling will slowly spread to their co-workers. We must start addressing mental health on a team, organizational, and systemic level.
Attendees will leave with a new perspective of how they can use existing SRE approaches to improve mental health (e.g. SLOs) and a set of strategies for improving mental health (e.g. self-compassion and mindfulness). They’ll understand how the benefits from improving team well-being are widespread, and that, just as there are patterns for ensuring our systems remain resilient in the face of pressure, we can arm our teams with techniques as well.
Jaime Woo, Incident Labs
Jaime began his career as a molecular biologist before following his passion for communications, working at DigitalOcean, Riot Games, and Shopify, where he launched the engineering communications function. He is an award-nominated writer, focusing his work on the locus between culture and technology, with recent works in the Advocate, the Globe and Mail, and StarTrek.com. He has spent two years learning about mental health and mindfulness. He is also an avid lover of dumplings.
Liffey Hall 2
Automating HA Deployments with BGP, IPv6, and Anycast
John Studarus, JHL Consulting LLC
Using BGP to load balancing and failover an application to increase site reliability has traditionally been expensive and tricky. In this workshop, we’ll walk through setting up a multi-host load-balanced and fail-over web application using BGP and open source technologies including Terraform, BGP, BIRD, an open source router, and IPv6.
John Studarus, JHL Consulting LLC
John merges his interests in computing infrastructure, networking, and software security. His background includes leading product teams, writing prototype code and examining distributed systems at Fortune 500s and startups alike. He brings a rare combination of technical expertise and product strategy and is just as comfortable writing code as he is developing a product strategy.
12:30–14:00
Luncheon
The Forum
14:00–15:30
The Liffey B
Control Theory for SRE
Ted Hahn, TCB Technologies, and Mark Hahn, Ciber Global
Control Theory is a long and well-studied discipline in engineering. Nearly every large scale industrial process has dedicated control engineers, creating and maintaining safety and quality systems by assuring that parameters remain within bounds—or alert appropriately.
This session will teach you how to create a PID (Proportional, Integral, Derivative) controller to autoscale your Kubernetes deployment based on a custom target. This controller ensures smooth scale-up and scale down.
Ted Hahn, TCB Technologies
Ted Hahn is a experienced Site Reliability Engineer, having worked at Google, Facebook and Uber, and most recently having been the primary SRE for Houseparty - Maintaining an infrastructure used for thousands of QPS by millions of users in a company of less than 50.
Mark Hahn, Ciber Global
Mark Hahn is the Director of the DevOps Practice at Ciber Global where he enables teams and organizations to deliver applications faster and with higher quality in a digitally connected world.
Eventually Consistent Service Discovery
Suhail Patel, Monzo
Traditionally, service discovery has leaned towards strong consistency. If you are querying an endpoint, ideally you don't have to deal with split brain on the set of active healthy nodes. This talk will demonstrate how systems such as Envoy are shifting away from the strong consistency coordinator model and making a strong separation with not having service discovery in the hot path of the data plane.
We will be covering systems like Zookeeper and Raft from first principles to discuss how systems like Kafka and Etcd handle their service discovery. We will talk about the practicals of scaling systems like Envoy for tens of thousands of endpoints in a constantly shifting environment.
Suhail Patel, Monzo
Suhail is a Backend Engineer at Monzo focused on working on the core Platform. His role involves building and maintaining Monzo's infrastructure which spans more than 1000 microservices and leverages key infrastructure components like Kubernetes, Cassandra, Etcd, Envoy Proxy and more.
He focuses specifically in investigating deviant behaviour and ensuring services continue to work reliably in the face of a constantly shifting environment in the cloud.
Network Monitor: A Tale of ACKnowledging an Observability Gap
Jason Gedge, Shopify
In the Fall of 2018 we spent nearly 6 weeks debugging Redis connection issues from our core app, pulling in many engineers along the way. The smoking gun to get our cloud provider involved was a high number of TCP retransmits. After bringing this evidence to them, their network engineers were able to fix the issue.
This incident showed us that we had an observability gap, due to lack of access and monitoring in our cloud environment. To this end, we built network monitor, a daemon running on all of our nodes to collect relevant network data. This daemon has evolved into a generic eBPF (extended Berkeley Packet Filter) orchestrator. In this talk, you'll learn about what we've built, and should walk away understanding why monitoring your network is a valuable endeavour, as well as how your teams can use eBPF to improve your observability stack.
Jason Gedge, Shopify
Jason is a Staff Production Engineer on the service communication team at Shopify. In the past, he spearheaded the first iteration of Shopify’s self-serve cloud platform and is now rolling out their first cloud service communication mesh. On the side, he is keeping busy in the #crazy-cat-people Slack channel and is working on becoming a next level eBPF wizard. Before Shopify, he was responsible for developer productivity at YouTube's San Bruno office.
The Liffey A
Being Reasonable about SRE
Vítek Urbanec, Unity Technologies
When companies try to adopt SRE they're often just following a trend. They're doing so without previous analysis of the situation, expecting magic to start happening from day one. By the time they learn that this step hasn't really given them what they hoped for, there's a ton of frustration and bad taste. Let's look at how to explore what SRE is going to be doing in your company and how to build strong relationships with other teams.
Vit Urbanec, Unity Technologies
Vítek has joined the SRE movement with a background of systems architecture, consulting and infrastructure automation. He likes bridging the gap between the operations and service owners to get the most out of the DevOps ideals. He also leads the Unity DevMetal band in Helsinki.
From Nothing to SRE: Practical Guidance on Implementing SRE in Smaller Organisations
Matthew Huxtable, Sparx
2019 is a brilliant time for SRE, and it's time to bring the field to organisations of every size! Smaller tech teams (<= ~50 engineers) often encounter unique technical and management challenges during SRE adoption. For example, a full on-call load may spell burnout, yet a typical SRE approach to risk may cause concern. Moreover, misaligned incentives impede operational excellence whenever handing back the pager could spell the end of the service—and the organisation!
This talk follows the journey of an SRE team built from scratch—starting with "Is SRE right for you?", we explore practical technical and team guidance to gain buy-in for SRE and usher cultures of continual experimentation. We discuss challenges and blindspots which may cause surprise for teams at all stages of maturity.
Whether you are preparing to establish a reliability team or you already practise SRE, the practical guidance in this talk will ensure your efforts are a success.
Matthew Huxtable, Sparx
Matt founded and now leads the Site Reliability Engineering team at Sparx, an evidence-based education technology and data science company. With a background in systems engineering and Computer Science, he spends his days maintaining and promoting reliability of the core Sparx platform, continually trying to put himself out of a job through pursuit of automation, and relentlessly encouraging his peers to do likewise. When he's not convincing Kubernetes who's really at the helm, you can find him being fascinated by aviation or dreaming about one day literally exploring the clouds as a qualified skydiver.
My Life as a Solo SRE
Brian Murphy, G-Research
2015 was the worst year of my professional career. Between botched fail forward releases, major customer impacting incidents and weakly supported features, I was being worn down. And then in 2016, I helped introduce the organisation to SRE culture. I became the first SRE and helped change the engineering organisation. Over the next few months and years, I championed SLIs, drove down MTTA/MTTR and improved release cadence. In this talk, you will hear my tails of woe, but I will leave you with advice and tips on how to make your life and your organisation better.
Brian Murphy, G-Research
Brian Murphy is an SRE by nature and a manager by training. He currently works for G-Research where he built and leads an SRE team. Previously, he was an SRE Manager for a startup that was bought by Cisco. He currently lives with his wife, son and sassy dog in West London. His superpower is the ability to find the best coffee and vegan food in any city.
Liffey Hall 2
SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours
Alex Perry, Google LLC, and Andrew Suffield, Goldman Sachs
This workshop ties together academic and practical aspects of systems engineering, with an emphasis on applying principles of systems design to a production service. We will analyze the service to quantify its performance, and iteratively improve the design.
Participants will work together in small groups to sketch out the design, identify components and their relationships, and to assess the suitability of the design to the system’s Service Level Objective (SLO). Participants will have a system design and bill of materials at the conclusion of this workshop.
Participants will not need laptops or specific coding experience; participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving. Participants will come away with an understanding of the principles of iterative systems engineering, popularly known as “Non-abstract large systems design”.
This workshop covers material critical for SRE, an increasingly-broad field that combines software engineering and systems design.
Alex Perry, Google LLC
Alex Perry is a Staff SRE in Los Angeles for the last 13 years at Google. He has worked on many layers of network infrastructure, from fabrics to beyond corp services, as well as social and other applications. Recently, he's working on migrating internal enterprise systems from existing virtualization infrastructure onto Google Cloud Platform. His interests are reliability, relevant monitoring, and disaster preparedness.
Andrew Suffield, Goldman Sachs
Andrew Suffield is an SRE at Goldman Sachs in London. They tend to focus on production automation, distributed systems design, and teaching.
Liffey Meeting Room 2
Implementing Distributed Consensus
Dan Lüdtke and Kordian Bruck, Google
May we introduce "Skinny," an education-focused, distributed lock service.
With the help of Skinny, we will:
- briefly look at the Paxos protocol
- see an example of a typical Paxos run
- design a simple distributed consensus protocol
- learn the tricky parts of implementing our simple distributed consensus protocol
- gradually move from theory-level to coding-level, solving small challenges (network, availability, fault-tolerance) along the way
This short workshop addresses engineers who have had little exposure to the inner workings of distributed consensus, who want to learn about distributed consensus as they start building distributed systems, and who have worked with ready-made distributed consensus solutions such as Zookeper and etcd but strive to understand the underlying theory as well.
Disclaimer: This work is not affiliated with any company (including Google) and is purely educational!
Dan Lüdtke, Google
Dan is a Site Reliability Manager in Munich. He contributes to open source software projects, regularly helps to organize large hacker events, runs an autonomous system for fun, and dreams of space travel. Prior to Google, Dan served his country, worked as a security consultant, joined a start-up, and wrote a book about IPv6. Dan earned a master's degree in Computer Aided Engineering from the Munich University of the German Federal Armed Forces.
Kordian Bruck, Google
As an Site Reliability Engineer, Kordian is touching production systems every day to prevent disasters. He loves iterating over architecture and organization structure to overcome Conway's Law. Pizza and funny cat videos enabled him to get a masters degree in computer science from the Technical University of Munich.
15:30–16:00
Break with Refreshments
The Forum
16:00–17:30
The Liffey B
Zero Touch Prod: Towards Safer and More Secure Production Environments
Michał Czapiński and Rainer Wolafka, Google Switzerland
Many outages are caused by human mistakes when interacting with the production environment: typos when running the tools, accidentally running tests against production systems, errors in configuration files etc. In addition, there is a risk of such outages being caused by malicious insider actors. Zero Touch Prod (ZTP) mitigates those risks by providing principles and tooling to make all production changes via automation, safe proxies, or audited break-glass.
Michał Czapiński, Google Switzerland
Michał Czapiński is a senior SRE focusing on security and safety of the compute infrastructure at Google. Before joining Google Switzerland in 2014, he had received a PhD in High-Performance Computing at Cranfield University (UK). Outside of work he loves mountaineering, race snowboard, and windsurfing.
Rainer Wolafka, Google Switzerland
Rainer Wolafka is a Site Reliability Manager focusing on planet scale technical infrastructure and production safety at Google. Before joining Google Switzerland in 2015, Rainer worked on distributed file systems for IBM's Research and Development organization. Outside of work he likes mountain biking, snowboarding and playing the drums.
Zero-Downtime Rebalancing and Data Migration of a Mature Multi-Shard Platform
Justin Li and Florian Weingarten, Shopify
Application-level sharding is a common pattern for scaling multi-tenant architectures. However, once it has been put into production, you inevitably run into follow-up problems that aren't as widely discussed. In this talk, we will share years worth of experience and connect the dots to outline a full sharding solution that goes beyond the initial implementation and deployment. At the core of our toolkit is the "binlog", an event stream used by the MySQL replication protocol. The tooling we've built on top of this idea is being used in production at Shopify to balance hundreds of MySQL shards for uniform load distribution, isolate heavy tenants from each other, and has in the past been used to safely transfer the entire dataset of our over 800.000 tenants from physical datacenters to a cloud environment. All of this happens online, without downtime, and is practically invisible to the tenants.
Justin Li, Shopify
Justin is a production engineer at Shopify. He likes performance problems, parsers, and distributed systems, and has worked on many aspects of Shopify’s production system, notably resiliency, sharding, flash sale preparations, scriptable load balancing and routing, and optimizing Shopify’s storefront rendering engine.
Florian Weingarten, Shopify
Florian is a production engineer at Shopify. For the past 5 years, he has been working on all aspects of Shopify's sharding and multi-tenancy stack, including resiliency, region failovers, load distribution and isolation, shard rebalancing, as well as Shopify's migration to Google Cloud.
The Liffey A
All of Our ML Ideas Are Bad (and We Should Feel Bad)
Todd Underwood, Google
The vast majority of proposed production engineering uses of Machine Learning (ML) will never work. They are structurally unsuited to their intended purposes. There are many key problem domains where SREs want to apply ML but most of them do not have the right characteristics to be feasible in the way that we hope. After addressing the most common proposed uses of ML for production engineering and explaining why they won't work, several options will be considered, including approaches to evaluating proposed applications of ML for feasibility. ML cannot solve most of the problems most people want it to, but it can solve some problems. Probably.
Todd Underwood, Google
Todd Underwood is an SRE Director for Google in Pittsburgh and leads Machine Learning for SRE for Google. He's been working on making ML work better at Google since 2009. It's still not done. He last presented at SREcon in 2015.
Fast, Available, Catastrophically Failing? Safely Avoiding Behavioral Incidents in Complex Production Systems
Ramin Keene, fuzzbox.io
Operators are increasingly being asked to release and manage services that behave in ways that are increasingly difficult to reason about compared to traditional application services. Data products, model based machine learning services, ensemble models, and large microservices architectures are founded on deliberate complexity in such a way that their availability is only correctly measured via an SLA/QOS around their behavior, but also threatened by the unknown unknowns emergent behavior from their interactions.
Incidents move from being about general service availability, to behavioral.
Safely operating these types of service in production presents a host of challenges that even the most experienced SRE may not expect. Severe incidents with stable infrastructure, invisible errors rates, IMPROVING response times, but the business failing catastrophically losing millions of dollars? Absolutely!
Ramin Keene, fuzzbox.io
Ramin has helped enterprises large and small to put machine learning, a/b testing, and data science products into production. He’s made ALL the mistakes and then some, helping companies lose thousands, if not millions, of dollars along the way. He is currently based in Los Angeles and spends his time working on adversarial experimentation tools that target infrastructure and release artifacts to help teams inspect and learn about how their systems behave AFTER it has been baked and released.
Liffey Hall 2
(Continued from previous session)
SRE Classroom, Or, How to Design a Reliable Distributed System in 3 Hours
Alex Perry, Google LLC, and Andrew Suffield, Goldman Sachs
This workshop ties together academic and practical aspects of systems engineering, with an emphasis on applying principles of systems design to a production service. We will analyze the service to quantify its performance, and iteratively improve the design.
Participants will work together in small groups to sketch out the design, identify components and their relationships, and to assess the suitability of the design to the system’s Service Level Objective (SLO). Participants will have a system design and bill of materials at the conclusion of this workshop.
Participants will not need laptops or specific coding experience; participants will need enthusiasm for collaborating in small groups, and for discussion-based problem-solving. Participants will come away with an understanding of the principles of iterative systems engineering, popularly known as “Non-abstract large systems design”.
This workshop covers material critical for SRE, an increasingly-broad field that combines software engineering and systems design.
Alex Perry, Google LLC
Alex Perry is a Staff SRE in Los Angeles for the last 13 years at Google. He has worked on many layers of network infrastructure, from fabrics to beyond corp services, as well as social and other applications. Recently, he's working on migrating internal enterprise systems from existing virtualization infrastructure onto Google Cloud Platform. His interests are reliability, relevant monitoring, and disaster preparedness.
Andrew Suffield, Goldman Sachs
Andrew Suffield is an SRE at Goldman Sachs in London. They tend to focus on production automation, distributed systems design, and teaching.
17:30–18:30
Social Hour
The Forum
Sponsored by Microsoft Azure
Come for the refreshments and the opportunity to meet and network with other attendees, speakers, and conference organizers at the Wednesday Social Hour.
Thursday, 3 October
08:00–09:00
Morning Coffee and Tea
The Forum
Sponsored by Bloomberg
09:00–10:30
The Liffey B
Advanced Napkin Math: Estimating System Performance from First Principles
Simon Eskildsen, Shopify
Ever stood in front of the whiteboard with a group of your co-workers designing a system, but found yourself in that awkward position where none of you were able to answer whether something would be fast enough? In this talk, you'll learn how to combine base rates to answer challenging questions in a jiffy on pull requests, technical reviews, or in meetings, for example: Can we in the 5-second critical window on a regional failover, do a snapshot-and-restore of Memcached to the other region? How much overhead should we expect a proxy to incur? How far is this system from the optimum performance? With the methodology from this talk, you'll learn to quickly estimate expected system performance instead of building them first!
Simon Eskildsen, Shopify
As Director of Production Engineering at Shopify Simon works with teams to increase the performance, scalability, and resiliency of Shopify. Other than that, as a new resident of Canada, fulfilling his obligation to call everyone out when they think they've experienced "cold weather."
The Map Is Not the Territory: How SLOs Lead Us Astray, and What We Can Do about It
Narayan Desai, Google
SLOs are a wonderfully intuitive concept: a quantitative contract that describes expected service behavior. These are often used in order to build feedback loops that prioritize reliability, communicate expected behavior when taking on a new dependency, and synchronize priorities across teams with specialized responsibilities when problems occur, among other use cases. However, SLOs are built on an implicit model of service behavior, with a raft of simplifying assumptions that don't universally hold.
These simplifying assumptions make SLO rules of thumb fall apart with complex modern services, which can result in bad decision making. In this talk, I will catalog a range of these issues with SLOs and demonstrate how they cause systematic failures of SLO-based processes. Armed with the knowledge of these failure modes, I'll present a set of best practices for understanding when SLOs produce incorrect and unexpected results and a set of techniques for constructing robust SLOs.
Narayan Desai, Google
Narayan Desai is an SRE at Google, where he focuses on the reliability of Google Cloud Platform Data Analytics products. He has a checkered past, having worked on scheduling, configuration management, supercomputers, and metagenomics—always in the context of production systems.
The Liffey A
Deploying SRE Training Best Practices to Production: How We SRE'ed Our SRE Education Program
Jennifer Petoff, Google Ireland, and JC van Winkel, Google Switzerland
Structured education is important for ramping up new SREs to build confidence and fight imposter syndrome. In this talk, we take a look behind the scenes of the SRE EDU Orientation curriculum at Google from a technical standpoint and organizational point of view while highlighting best practices that can be applied at organizations of all sizes. We’ll show how we applied SRE best practices to the program itself to minimize toil for the organizers (keyword: automation!) and keep the training software reliable and up to date.
By implementing judicious monitoring, we learned that hands-on exercises are a more successful way to ramp people up than one-way lectures. We built a rigged production system where an instructor can trigger outages that the students need to triage, mitigate and resolve. As the system is internal only, students cannot cause externally visible harm, creating a safe learning environment that allows for experimentation.
Jennifer Petoff, Google Ireland
Jennifer Petoff is a Senior Program Manager for Google's Site Reliability Engineering team based in Dublin, Ireland. She is the global lead for Google’s SRE EDU program and is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production Systems.
JC van Winkel, Google Switzerland
JC has been teaching UNIX and programming languages since 1992, working for AT Computing, a small courseware spin-off of the University of Nijmegen, the Netherlands. JC joined Google's Site Reliability Engineering team in 2010 and is both a founding member and lead educator of the SRE education team, SRE EDU.
SRE by Influence, Not Authority: How the New York Times Prepares for Large-Scale Events
Vinessa Wan and Brett Haranin, The New York Times
How do you SRE in a large decentralized organization where development teams manage their own deployments and infrastructure? In this session, we’ll cover how we formed our team at The New York Times and our rationale behind it, talk through our challenges, and how we leveraged an incident to kick off Elections readiness, our largest SRE effort to date.
Attendees will understand how we organized this effort and integrated our team to partner with application teams. We will detail how we increased reliability through a combination of architecture reviews, monitoring improvements, and stress testing.
Vinessa Wan, The New York Times
Vinessa Wan has been working in project management for the past 10 years. In her past 5 years at The New York Times, she has worked in R&D, product discovery, and now oversees the SRE and internal tooling & automation portfolio.
Brett Haranin, The New York Times
Brett Haranin Brett Haranin has been working as a software engineer and tech lead at various companies, large and small, for the last 17 years. Currently, he works as an SRE at The New York Times and is focused on helping teams mature the security and reliability of their systems. In his spare time, Brett is an avid beekeeper and helps develop IoT systems for beehive monitoring.
Liffey Hall 2
Effective Distributed Tracing Workshop
Pedro Alves, Serbay Arslanhan, and Luis Mineiro, Zalando SE
If you're working in a large organization which is using a micro-service architecture, you can find it hard to keep tabs on what is going on under the hood. Performing root cause analysis of incidents can be as complicated as your organization makes it. Traditional metrics and logging, although essential, are also somewhat limited in some regards. To help with some of the issues mentioned before we can turn to Distributed Tracing and the view it gives us into our services.
In this workshop, we will give an introduction into Distributed Tracing, and OpenTracing, the open specification for vendor-neutral APIs for Distributed Tracing. After the introduction, there will be a hands-on opportunity to see how a distributed system is instrumented. Finally, we will break those applications and we will use Distributed Tracing to help us figure out what is going on.
For the hands on part, make sure to have either Java, Golang, or Docker to run the test application.
Pedro Alves, Zalando SE
Pedro has been focusing on developing back end code for webapps since 2008. In Zalando since 2013, he has worked in different areas of Zalando’s business, and is now working in the SRE team, making sure people can buy shoes reliably.
Serbay Arslanhan, Zalando SE
Serbay is learning to be a Site Reliability Engineer at Zalando in Berlin. Before joining the SRE team in Zalando, he worked on building systems related to customer facing Checkout solutions at Zalando and personal health, social networks, news aggregators in different companies & locations.
Luis Mineiro, Zalando SE
Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. Luis has been with Zalando since 2013—shaving yaks and creating the most beautiful bike sheds in the Shop team, later joining Platform Infrastructure to support the company’s move to the Cloud and currently heading Site Reliability Engineering.
10:30–11:00
Break with Refreshments
The Forum
11:00–12:30
The Liffey B
Load Balancing Building Blocks
Kyle Lexmond, Facebook
Load balancing is often presented as a simple solution for difficult application problems, like providing redundancy and smooth blue/green application upgrades. But not all load balancers are created equal. Is a L7 load balancer better than an L4 one? What makes DNS a load balancing technique? Does using a CDN help?
This talk answers these questions and more. It covers 3 common variants of load balancing (L4, L7 and DNS) in a product agnostic manner, important properties of each variant, and why you would consider using them. It concludes with an overview of how Facebook uses all 3 variants to manage and control traffic flows globally.
Kyle Lexmond, Facebook
Kyle is a Production Engineer on the Traffic Applications team at Facebook Seattle, working to make sure requests from people get a 200 OK
, not an error or vanishing into the ether(net).
Previously at Twitter and AWS, he focuses on simplifying systems and making them more resilient to failure, ideally fixing more things than he breaks.
What Happens When You Type en.wikipedia.org?
Effie Mouzeli and Alexandros Kosiaris, Wikimedia Foundation
What happens when you type en.wikipedia.org? One of the most popular interview questions we have been asked quite a few times. But what about what happens on the server side? What happens on our end?
At Wikimedia, we run the world’s favourite encyclopædia and one of the top 5 websites of the Internet! In our talk, we will describe the architecture of Wikipedia, how routers, load balancers, caching, a bit more caching, message queues, databases, microservices, and containers are pieced together to serve you, and how open source plays a master role in it.
Furthermore, we will briefly talk about our transition from a monolith, to service-oriented architecture and microservices, to migrating them to Kubernetes.
Wikipedia is a very good example of a complex system; joining this talk will help you demystify one in an understandable way.
Effie Mouzeli, Wikimedia Foundation
Effie studied physics and scientific computing but decided to follow neither. Instead she became a sysadmin, later systems engineer, now SRE. She has worked in a number of startups and small organisations, where her responsibilities were usually automation, infrastructure architecture, and working closely with developers. Currently, she is one of the newer members of SRE team at the Wikimedia Foundation. Away from work, she loves camping, concerts, and dressmaking.
Alexandros Kosiaris, Wikimedia Foundation
A Linux sysadmin, turned FreeBSD sysadmin, turned Linux sysadmin, turned systems engineer (somewhere along that path there’s a Devops hat as well), Alexandros has been in the space since 1999, starting as a hobbyist, then a professional. Currently working with the Wikimedia Foundation, he has pushed forward for more virtualization and better orchestrated microservices and environments for their execution. The kubernetes project is a current passion.
The Liffey A
Are We All on the Same Page? Let's Fix That
Luis Mineiro, Zalando
The industry defined as good practice to have as few alerts as possible, by alerting on symptoms that are associated with end-user pain rather than trying to catch every possible way that pain could be caused.
Organizations with complex distributed systems that span dozens of teams can have a hard time following such practice without burning out the teams owning the client-facing services. A typical solution is to have alerts on all the layers of their distributed systems. This approach almost always leads to an excessive number of alerts and results in alert fatigue.
Adaptive Paging is an alert handler that leverages the causality from tracing and OpenTracing's semantic conventions to page the team closest the problem. From a single alerting rule, a set of heuristics can be applied to identify the most probable cause, paging the respective team instead of the alert owner.
Luis Mineiro, Zalando SE
Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. Luis has been with Zalando since 2013—shaving yaks and creating the most beautiful bike sheds in the Shop team, later joining Platform Infrastructure to support the company’s move to the Cloud and currently heading Site Reliability Engineering.
Weathering the Storm: How Early Warnings Save the Farm
Brian Sherwin, LinkedIn Corporation
LinkedIn’s production stack consists of over thousands of different applications and associated with complex dependencies. In this environment, when a production issue is caused due to a misbehaving microservice(s), finding the right culprit can be both challenging and time consuming.
At LinkedIn, we have built a framework to automate the incident correlation process by ingesting data pertaining to incidents and associated dependencies to identify the the unhealthy microservice(s). This gives us the ability to directly escalate an incident to the corresponding team thus cutting down MTTD/MTTR while improving quality of life of the oncall engineers.
In this talk, we will give a higher level overview of the correlation engine, how we are doing correlations, how we reduce false positives and increase the accuracy of the correlated results and finally lessons learned.
Brian Sherwin, LinkedIn
Brian Cory Sherwin is a Sr. SRE at LinkedIn since 2012. Brian has had many responsibilities at LinkedIn ranging from autoremediation, business metric collection and analysis, host level monitoring, disaster recovery, data center decommissions, and incident command. The common thread between all these is the need to find a solution to a problem that needed a solution yesterday.
Outside of solving whatever problems get thrown at him, Brian enjoys spending time with his wife and son, coffee, learning Spanish, and classic science fiction.
Liffey Hall 2 (Continued from previous session)
Effective Distributed Tracing Workshop
Pedro Alves, Serbay Arslanhan, and Luis Mineiro, Zalando SE
If you're working in a large organization which is using a micro-service architecture, you can find it hard to keep tabs on what is going on under the hood. Performing root cause analysis of incidents can be as complicated as your organization makes it. Traditional metrics and logging, although essential, are also somewhat limited in some regards. To help with some of the issues mentioned before we can turn to Distributed Tracing and the view it gives us into our services.
In this workshop, we will give an introduction into Distributed Tracing, and OpenTracing, the open specification for vendor-neutral APIs for Distributed Tracing. After the introduction, there will be a hands-on opportunity to see how a distributed system is instrumented. Finally, we will break those applications and we will use Distributed Tracing to help us figure out what is going on.
For the hands on part, make sure to have either Java, Golang, or Docker to run the test application.
Pedro Alves, Zalando SE
Pedro has been focusing on developing back end code for webapps since 2008. In Zalando since 2013, he has worked in different areas of Zalando’s business, and is now working in the SRE team, making sure people can buy shoes reliably.
Serbay Arslanhan, Zalando SE
Serbay is learning to be a Site Reliability Engineer at Zalando in Berlin. Before joining the SRE team in Zalando, he worked on building systems related to customer facing Checkout solutions at Zalando and personal health, social networks, news aggregators in different companies & locations.
Luis Mineiro, Zalando SE
Luis's broad background in software engineering includes experience in DevOps, networks, mobile development, and more. Luis has been with Zalando since 2013—shaving yaks and creating the most beautiful bike sheds in the Shop team, later joining Platform Infrastructure to support the company’s move to the Cloud and currently heading Site Reliability Engineering.
12:30–14:00
Luncheon
The Forum
Sponsored by Blameless
14:00–15:30
The Liffey B
Refining Systems Data without Losing Fidelity
Liz Fong-Jones, honeycomb.io
It is not feasible to run an observability infrastructure that is the same size as your production infrastructure. Past a certain scale, the cost to collect, process, and save every log entry, every event, and every trace that your systems generate dramatically outweighs the benefits. If your SLO is 99.95%, then you'll be naively collecting 2,000 times as much data about requests that satisfied your SLI than those that burnt error budget. How do you scale back the flood of data without losing the crucial information your engineering team needs to troubleshoot and understand your system's production behaviors?
Statistics can come to our rescue, enabling us to gather accurate, specific, and error-bounded data on our services' top-level performance and inner workings. We can keep the context of the anomalous data flows and cases in our supported services while not allowing the volume of ordinary data to drown it out.
Liz Fong-Jones, honeycomb.io
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 15+ years of experience. She is an advocate at Honeycomb.io for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
Tracing Real-Time Distributed Systems
Evgeny Yakimov, Bloomberg LP
The concept of distributed tracing has often been explored in the context of web-based microservices in predominantly request/response style systems. But, what if you're dealing with a real-time data streaming system? How do you even start to model strongly asynchronous message flows, consisting of multi-service pipelines originating from many sources and distributed to even more consumers? These are the general characteristics of trading systems, which make tracing incredibly challenging.
This talk will explore our approach to applying these concepts to latency-sensitive real-time data streaming in large scale distributed systems. We will discuss the challenges of tracking long-running sessions, handling fan-in/fan-out data flows, and reducing storage costs while still capturing granular in-process tracing data. We will demonstrate how we utilise tracing to diagnose issues and measure service level indicators, as well as share our thoughts on how to further improve observability by applying these concepts on the client-side.
Evgeny Yakimov, Bloomberg LP
Evgeny is a software engineer turned SRE working at Bloomberg London with a focus on real-time distributed systems. He is a keen technology enthusiast, exploring how to apply SRE concepts such as tracing to the area of trading systems. He advocates for an SRE culture shift at Bloomberg throughout engineering and product management, utilising methods like SLIs and SLOs to put reliability at the heart of the organisation.
The Liffey A
How to Do SRE When You Have No SRE
Joan O'Callaghan, Udemy
This talk is for engineering organisations that don't have anyone dedicated to SRE type work, but you know enough to know that you really REALLY need it.
Where on earth can you even start? You want to make things better but you already have at least 2 other jobs to do as well. It doesn't seem possible.
Unfortunately, you're also the person that will be stuck fixing things when it all breaks so it's doubly in your best interest to try to prevent disasters if you can.
This talk gives extremely practical and realistic advice of how to get started, even if you only have 1 hour a week to dedicate to it. It'll help you find your weakest points and make things a little better, even without dedicated resources. Step by step, you'll be able to make your org much more reliable and reduce your stress levels. Also, less things will be on fire.
Joan O'Callaghan, Udemy
Joan O'Callaghan is a Site Operations Lead at Udemy. She has worked in SRE and Incident Management (in one form or another), for 13 years. She likes to host and write blameless post-mortems and take long walks on the beach where she has imaginary arguments with people that don't like reliability as much as she does. She is always very happy when she meets people more paranoid than her.
One on One SRE
Amy Tobey
As someone who has carried the title of Site Reliability Engineer at many companies, I have struggled with how to influence an organization to make the changes necessary to ensure high availability, sustainably. Fixing things directly doesn't scale with the organization, so a broader approach is needed. Leadership usually desires the outcome of better availability and sometimes even say so publicly, but then what? In this talk, I will discuss the one-on-one approach I created for SRE outreach both proactively and in incident debriefs. I will demonstrate how this approach enables vulnerability and better information gathering through application of psychological safety principles.
Liffey Hall 2
Statistics for Engineers
Heinrich Hartmann, Circonus
Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set-up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
- Are we fulfilling our SLO/SLA?
- How did our query response times change with the last update?
- When will I run out of disk space, when we continue to grow like this?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as a system operator. From the mathematical side, we will cover probabilistic models, summarising distributions with mean values, quantiles, and histograms and their relations. From the technological side, we will discuss metrics vs. event data, the effects of sub-sampling, how not to aggregate percentiles, t-digest and histogram summaries.
The tutorial will be tool agnostic, but tailored towards applications. In the computational examples we will be using Python and data from our production systems. At the end of the workshop attendees should have a clear picture of the mathematical features they need from their monitoring tools, for their application at hand.
Heinrich Hartmann, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician. Later he transitioned into computer science and worked as a consultant for a number of different companies and research institutions.
Liffey Hall 1
Managing Microservices with Istio Service Mesh
Rafik Harabi, Innovsquare
Managing disparate microservices at scale is a real challenge for Ops and SRE teams. The workshop will explain and demonstrate the service mesh patterns implemented by Istio using the same declarative approach as kubernetes to implement microservices concerns without affecting your services.
Requirements
- Participants should be able to install gcloud CLI on their laptops.
- Participants should have a Google Cloud Platform account. If you don’t yet have one, you could create one using Google Cloud Free Tier.
Workshop: https://www.istioworkshop.io
Rafik Harabi, Innovsquare
Rafik Harabi is a Solution Architect devoted to help customers in their digital transformation journey. He currently spends most of his time architecting and deploying Cloud Native Services Platforms using Kubernetes and recently Istio Service Mesh. Before working on cloud migration projects, Rafik had a long experience in designing, building and deploying high available web platforms serving thousands of users.
15:30–16:00
Break with Refreshments
The Forum
16:00-17:30
The Liffey B
A Customer Service Approach to SRE
John Looney, Facebook
SREs are highly technical people, and have a bias toward technical solutions to technical problems. They enjoy well crafted APIs that they can build solid SLAs around, and allow teams work out where a problem lies.
This strength can hide an antipattern; being able to tell ourselves that our system is OK, it's everyone else who has the problem. This talk will take some case-studies from Facebook's "Server Lifecycle" team, to show how engineers can pretend that the systems they have built are perfect, and that it's actually the rest of the world that is to blame for misusing them.
I will talk about how the team used a customer service ethos to redesign their metrics, their service and the support methods, to build something that really served its customers.
John Looney, Facebook
John Looney has been an SRE since 2005, working with large distributed systems for Google and Facebook. He enjoys teaching SRE concepts with concrete examples. His day job is supporting teams that manage and deploy operating systems and firmware for Facebook.
SRE & Product Management: How to Level up Your Team (and Career!) by Thinking like a Product Manager
Jen Wohlner, Livepeer
SRE and product management—do those even go together? Yes! In this talk, we'll go over small ways and big strategies to form sustainable, impactful relationships with your users and build products that they love whether or not your SRE team has an official product manager. SRE teams' users are other engineers, data scientists, designers, and anyone else who pushes code at your company. It's not enough to build perfectly engineered platforms and tooling. SRE teams must build scalable, opinionated, USABLE products and workflows. This talk will give you the framework to get there and show you what traits translate to good product managers.
Jen Wohlner, Livepeer
Jen Wohlner leads product management at Livepeer, a decentralized video transcoding and live-streaming platform built on the Ethereum blockchain. Before Livepeer, Jen was the product manager for platform engineering at Fastly, an edge cloud platform that provides a content delivery network, Internet security products, load balancing, and video and streaming services for major companies across the globe. Previously, Jen worked as a senior technical program manager for site reliability engineering at LinkedIn where she had a special focus on resiliency projects across LinkedIn's stack and at BuzzFeed where she led the software infrastructure and tools infrastructure groups. In her spare time, Jen runs marathons, cooks feasts for her wife, draws, makes ceramics, and serves as vice chair on the board of directors for the Point Foundation, the nation's largest LGBT scholarship fund for higher education.
The Liffey A
Prioritizing Trust While Creating Applications
Jennifer Davis, Microsoft
Managing risk needs to scale as your product grows in popularity and complexity. In traditional software development, often security was treated as a last gating factor at best and post-incident concern at worst. How do we shift our security processes left—in other words, earlier in the development lifecycle? The cost of applying security practices too late can be catastrophic to a company, leading to the loss of customer trust and affecting the bottom line.
Join me in this session to learn how to leverage security tools and recommended practices to enable everyone to play a part in securing your application from discovery to the operation of your application.
Jennifer Davis, Microsoft
Jennifer Davis is the co-author of Effective DevOps. She is a senior cloud advocate at Microsoft. Previously, she was a principal site reliability engineer at RealSelf, developed cookbooks to simplify building and managing infrastructure at Chef, and built reliable service platforms at Yahoo. She is a core organizer of devopsdays and organizes the Silicon Valley event. She is the founder of CoffeeOps.
Software Patching Needn't Be a Can of Worms
Philip Rowlands
Let's face it—manual patching is no fun, all toil. The solution? Better automation, better patching.
"There's no record of what third-party software/versions we use. I don't know what updates are available, and of those, which are the most important. It's hard to get downtime on production systems. There's no test environment for this. I'm scared the upgrade will break stuff, and when it does, rolling back will be even harder."
If these complaints ring true for your organisation, this talk chases each one away with examples of applying automation for an easier life.
Philip Rowlands[node:field-speakers-institution]
Philip Rowlands has been an SRE since before he really understood what it was. Because he doesn't scale, he relies on software for leverage. He has worked over the years on automated telephony, Google Production SRE, Mainframe Linux, and more recently for various financial firms. He cannot juggle.
Liffey Hall 2
(Continued from previous session)
Statistics for Engineers
Heinrich Hartmann, Circonus
Gathering all kinds of telemetry data is key to operating reliable distributed systems at scale. Once you have set-up your monitoring systems and recorded all relevant data, the challenge becomes to make sense of it and extract valuable information, like:
- Are we fulfilling our SLO/SLA?
- How did our query response times change with the last update?
- When will I run out of disk space, when we continue to grow like this?
Statistics is the art of extracting information from data. In this tutorial, we address the basic statistical knowledge that helps you at your daily work as a system operator. From the mathematical side, we will cover probabilistic models, summarising distributions with mean values, quantiles, and histograms and their relations. From the technological side, we will discuss metrics vs. event data, the effects of sub-sampling, how not to aggregate percentiles, t-digest and histogram summaries.
The tutorial will be tool agnostic, but tailored towards applications. In the computational examples we will be using Python and data from our production systems. At the end of the workshop attendees should have a clear picture of the mathematical features they need from their monitoring tools, for their application at hand.
Heinrich Hartmann, Circonus
Heinrich Hartmann is the Analytics Lead at Circonus. He is driving the development of analytics methods that transform monitoring data into actionable information as part of the Circonus monitoring platform. In his prior life, Heinrich pursued an academic career as a mathematician. Later he transitioned into computer science and worked as a consultant for a number of different companies and research institutions.
Liffey Hall 1
(Continued from previous session)
Managing Microservices with Istio Service Mesh
Rafik Harabi, Innovsquare
Managing disparate microservices at scale is a real challenge for Ops and SRE teams. The workshop will explain and demonstrate the service mesh patterns implemented by Istio using the same declarative approach as kubernetes to implement microservices concerns without affecting your services.
Requirements
- Participants should be able to install gcloud CLI on their laptops.
- Participants should have a Google Cloud Platform account. If you don’t yet have one, you could create one using Google Cloud Free Tier.
Workshop: https://www.istioworkshop.io
Rafik Harabi, Innovsquare
Rafik Harabi is a Solution Architect devoted to help customers in their digital transformation journey. He currently spends most of his time architecting and deploying Cloud Native Services Platforms using Kubernetes and recently Istio Service Mesh. Before working on cloud migration projects, Rafik had a long experience in designing, building and deploying high available web platforms serving thousands of users.
17:30–18:30
The Liffey B
Lightning Talks
- How to Achieve "100%" Availability
Igor Ebner de Carvalho, Microsoft - Understanding Vicious Cycles with Causal Loop Diagrams
Laura Nolan, Slack - Flamegraphs—A Meeting Point between SRE and Developers
Amir Langer and Doron Sekler, eBay - Smart and Effective Way to Reduce Distributed Tracing Overhead
Susobhit Panigrahi, VMware - No, Your Hardware is Still Mostly Software: Handling FW in Your Fleet
Yannick Brosseau, Facebook - TLS Certificate Issuance Controls
James Renken, Let’s Encrypt -
Copilot: Stateless Service Mesh Routing for Performance and Resiliency
Brennen Smith, Ookla (Speedtest.net) - How Shopify Launched the Welcome Back Returnship
Jane Maguire - Managing On-Call Atrophy
James Wynne, Pivotal
18:30–20:00
Conference Reception
Level 3 Foyer
Friday, 4 October
08:00-09:00
Morning Coffee and Tea
The Forum
09:00–10:30
The Liffey B
Bigtable: A Journey from Binary to Service and the Lessons Learned along the Way
Brendan Gleason and Gaurav Prabhu Gaonkar, Google
In this talk, we'll examine the development of a global multi-tenant "Bigtable Service" based on Bigtable, a highly scalable wide column store originally developed for single user, single cluster instances. Because SREs value deduplication of effort, this type of service development work is often undertaken by SREs, but building a service is far more complicated than just wrapping "deploy" in a for loop. We'll discuss the challenges of correctly defining your "product", the revelation that the service layer wrapped around the core is a complex distributed system itself, some common traps that SREs fall into when designing services, and the challenges of migrating users to a central service. Finally, we will describe how the relationship between the core product development team and the SRE team has evolved and highlight best practices and anti-patterns for the developer: SRE relationship that we've learned on our journey.
Brendan Gleason, Google
Brendan is a Site Reliability Engineer in Google's New York City office. He primarily worked on Bigtable during his first six years at Google, first on the Bigtable Service and eventually on Cloud Bigtable. More recently Brendan has worked on Google Cloud Platform's developer and monitoring tools.
Gaurav Prabhu Gaonkar, Google
Gaurav Prabhu Gaonkar leads a team of site reliability engineers responsible for running Bigtable as a service at Google. He is passionate about problem solving and has held multiple roles across various companies. Gaurav has deep experience in designing, building and scaling distributed systems. He is currently based in New York City.
SDKs Are Not Services and What This Means for SREs
Justin Coffey, Criteo
Building an SDK or embedded libraries for client teams to integrate into their software seems like a straightforward approach to onboarding a large number of teams onto your infrastructure. Depending on the domain, it can even seem like an obvious and durable decision. Unfortunately, this path is mired with hidden complexity that can have serious productivity consequences for client teams, increasing support related toil for the SRE team significantly and imperiling the viability of both the team and the service they provide.
Justin Coffey, Criteo
Justin Coffey is an engineering director in the SRE department of Criteo where he has led efforts in building out much of Criteo's data processing platform. In past lives he has built ecommerce, emailing and real estate platforms. He got his start in the industry way back in 1996 as a sysadmin working for a small ISP in San Diego, CONNECTnet.
The Liffey A
Building Resilience: How to Learn More from Incidents
Nick Stenning, Microsoft
Learning from incidents: it's not as easy as it sounds! Research from numerous safety-critical industries (aviation! healthcare! firefighting!) is changing what we know about how to build resilient systems and organizations in a turbulent world. This talk is going to share some of that research with you in a direct and practically-applicable way.
One major obstacle to building resilience in an engineering organization is the traditional approach to post-incident review, which focuses heavily on incident prevention. Come and learn:
- that there is and always will be more to incident response and review than prevention,
- how to recognize and avoid four common traps during incident investigations, and
- when to apply four concrete recommendations on how to learn more from incidents in your organization.
Nick Stenning, Microsoft
Nick Stenning is a Site Reliability Engineer on Azure, poking and prodding at the internals of "somebody else's computers." He previously worked at the UK's Government Digital Service and at open-source startup Travis CI. He's been talking his colleagues' ears off on the topic of post-incident review for close to a decade.
How Stripe Invests in Technical Infrastructure
Will Larson, Stripe
Deciding what to work on is always difficult and is especially treacherous for folks working as infrastructure engineers and leaders. Will Larson unpacks the process of picking and prioritizing technical infrastructure work, which is essential to long-term company success but discussed infrequently. Will shares Stripe's approaches to prioritizing infrastructure as your company scales, justifying—and maybe even expanding—your company's spend on technical infrastructure, exploring the whole range of possible areas to invest into infrastructure, adapting your approach between periods of firefighting and periods of innovation, and balancing investment in supporting existing products and enabling new product development.
Will Larson, Stripe
Will Larson has been an engineering leader and software engineer at a number of technology companies including Digg, Uber, and Stripe. He is also the author of An Elegant Puzzle: Systems of Engineering Management.
Liffey Hall 2
What I Wish I Knew before Going On-Call
Chie Shu, Yelp
Firefighting a broken system is time-sensitive and stressful, but becomes even more challenging as teams and systems evolve. As an on-call engineer, scaling processes among humans is an important problem to solve. How do we ramp up new engineers effectively? How can we bring existing on-call engineers up to speed? In this workshop we’ll share common myths among new on-call engineers and the Do’s and Don’ts of on-call onboarding, as well as run through hands-on activities you can take back to work and apply directly to your own on-call processes.
Chie Shu, Yelp
Chie Shu is a backend Software Engineer at Yelp. She has worked on improving Yelp's revenue-critical Ads data pipeline to be more resilient to system failures, and designed heuristics used internally by executives and Product Managers to assess the financial impact of on-call incidents. Chie holds a bachelor's degree in Computational Biology from Cornell University.
10:30–11:00
Break with Refreshments
The Forum
11:00–12:30
The Liffey B
Why Automating Everything Adds to Your Toil
Colin Thorne and Cameron McCallister, IBM
An often-heard phrase whenever there is toil is just automate it. You would expect that from an SRE who has a software engineering background. It is what distinguishes Site Reliability Engineering from operations. But the wrong automation and too much automation can replace the existing toil with different toil, or in the worst case grow the existing mound of toil.
Find out when you should or shouldn't add automation, and how to build the right, sustainable automation.
Colin Thorne, IBM
Colin is the worldwide SRE lead for IBM's Kubernetes Service with a career-long dedication to clean code and architecture. He loves learning and applying new practices and technologies, and then sharing them with anyone who happens to be near.
Colin runs new graduate education, cloud native workshops and code retreats in IBM to help promote good Software Engineering practices. He is on a mission to get Software Engineering seen as a proper engineering discipline within the world of engineers.
Cameron McAllister, IBM
The hardest thing for Cameron McAllister when automating something is coming up with a good name for it! With that problem solved, he has been responsible for many automated systems, most of them with slack integrations, providing a self-service model to enhance development and SRE alike. Ultimately his vision for automation over the last 2 years has proved successful, allowing huge growth in customers with no SRE growth.
Cameron is an SRE for IBM's Kubernetes Service and a lifelong advocate for automation. He hates doing anything more than once and ultimately wants to automate himself out of a job. He loves the SRE discipline, with many deep and difficult problems to solve, and great flexibility on how to go about solving them; this keeps him happy as he can bring his creativity to the fore.
Prior to working on IBM Kubernetes Service, Cameron had over a decade of experience in IBM Storage Systems, leading test and development teams across both clustered file and block storage controllers.
Autopsy of a MySQL Automation Disaster
Jean-François Gagné, MessageBird
You deployed automation, enabled automatic database master failover and tested it many times: great, you can now sleep at night without being paged by a failing server. However, when you wake up in the morning, things might not have gone the way you expect. This talk will be about such a surprise.
Once upon a time, a failure brought down a MySQL master database. Automation kicked in and fixed things. However, a fancy failure, combined with human errors, an edge-case recovery, and a lack of oversight in tooling and scripting lead to a split-brain and data corruption. This talk will go into details about the convoluted—but still real-world—sequence of events that lead to this disaster. I cover what could have avoided the split-brain and what could have made data reconciliation easier.
Jean-François Gagné, MessageBird
Jean-François is a System/Infrastructure Engineer and MySQL Expert. One year ago, he joined MessageBird, an IT telco startup in Amsterdam, with the mission of scaling the MySQL infrastructure. Before that, J-F worked on growing the Booking.com MySQL and MariaDB installations (he also works on many other non-MySQL related projects). Some of his latest projects are finding the best way to automate MySQL master failover, making Parallel Replication run faster and promoting Binlog Servers. He also has a good understanding of replication in general and a respectable understanding of InnoDB, MySQL, Linux, and TCP/IP. Before Booking.com, he worked as a System/Network/Storage Administrator in a Linux/VMWare environment, as an Architect for a Mobile Network and Service Provider, and as a C and Java Programmer in an IT Service Company. Even before that, when he was learning computer science, Jeff studied cache consistency in distributed systems and network group communication protocols.
The Liffey A
Pushing through Friction
Dan Na, Squarespace
Things are broken. The deployment pipeline is painfully slow. Your engineering team has doubled in the last year and there's a lack of sufficient process and management. You git blame a file that's used everywhere but nobody understands it; the person who wrote it left the company five years ago.
As a senior-level engineering leader, experience tells you things could be better. You see the gaps. If only the company adopted policy A or dumped technology B, everyone would benefit. But there's so much inertia. The company has always used B. You are frustrated. Can you actually make a difference?
Yes. You are encountering organizational friction, and learning to identify, accept and push through friction is a key skill of engineering leaders. In this talk, Dan will talk about why organizational friction occurs and how to mitigate it. The ability to push through friction will distinguish you throughout your career.
Note: This talk contains language that some viewers may find objectionable.
Dan Na, Squarespace
Dan Na is a Staff Engineer and Team Lead on the Internationalization Platform team at Squarespace in NYC. Previously he was an Engineering Manager and Senior Software Engineer at Etsy. He loves learning and teaching in a collaborative environment and solving both the technical and people problems of producing software. He’s a fan of NBA basketball, iced coffee, New York City and exploring the boroughs with his wife and 1.5-year-old son.
Perks and Pitfalls of Building a Remote First Team
Ryan Neal, Netlify
Building teams is hard. Building remote teams is harder but definitely worth it. You get access to a global talent pool. You get engineering coverage that follows the sun. And you get to build a much more diverse and inclusive team. A remote team isn't without its perils though, it is easy to build silos, burn out engineers, inflame imposter syndrome, and starve community building.
Ryan Neal, Netlify
Ryan Neal is Head of Infrastructure and part of the founding team at Netlify. Previously, he worked on the infrastructure team at Yelp and worked in the defense sector at Palantir in the Middle East. Ryan is based in the Bay Area loves distributed systems, firespinning, and his golden retriever.
Liffey Hall 2
Unconference: Unsolved Problems in SRE
Kurt Andersen, LinkedIn
There are a variety of unsolved problems in and around site reliability engineering. Some of these have been addressed during the conference, some deserve additional discussion, and some may have become apparent during the hallway tracks here in Dublin. If you would like to discuss any of these this is your session. You can suggest topics on the USENIX-SREcon slack #unsolved_problems channel.
Kurt Andersen, LinkedIn
Kurt Andersen has been one the co-chairs for SREcon Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.
12:30–14:00
Luncheon
The Forum
14:00–15:30
The Liffey B
Expect the Unexpected: Preparing SRE Teams for Responding to Novel Failures
John Arthorne, Shopify
A core concept in SRE is that we learn from major system failures, using the experience gained to improve the resiliency of our systems. If we are successful at this, we avoid repeating the same customer impact the next time our systems fail in a similar way. This means when the next big failure happens, it will often be a novel problem. This talk will focus on how to prepare for novel large scale failures. I will start by summarizing common methods of incident training. This includes simulated disaster scenarios, and live system exercises involving controlled but real production system failures. I will outline the benefits of each approach, and our experience in employing them at Shopify as our team has grown. This talk will wrap up with a summary of a large scale incident exercise we ran involving a hundred people, an office building, and 20,000 pieces of lego.
John Arthorne, Shopify
John leads a developer team within the Shopify Production Engineering group, with a focus on building tools to improve the quality of production systems, and on engineering incident response. John is a frequent speaker at technical conferences, including most recently SRECon, DevOps Days, and GitHub Universe. He has served on conference program committees and was voted a JavaOne Rock Star. His current interests are in tools and practices for infrastructure automation, and incident response. Before joining Shopify, John led a team-building cloud-based developer tooling for IBM Bluemix and was a prominent leader within the Eclipse open source community.
Hiring Great SREs
Brian Rutkin, Twitter, Inc.
Hiring is hard. Hiring in tech is often harder because we tend to focus on concrete, measurable skills and often ignore or devalue soft skills since they're not as easy to evaluate.
Geared at both IC's and Managers, come learn some directed ways of thinking about hiring, conducting interviews, and performing valuations with concrete examples that can be used in practice to improve your hiring pipeline.
Brian Rutkin, Twitter, Inc.
Brian is an SRE at Twitter where he works on Core Services and all the things they touch (so pretty much everything). Often that means just trying to ensure all the different services and people get along together.
SRE in the Third Age
Björn Rabenstein, Grafana Labs
In the first age, SRE was proprietary to Google. As a term, it was so puzzling, that the Google recruiters tried for a while to avoid it in job descriptions because nobody would apply for such a mystery job.
In the second age, SRE became a well-known discipline in the tech community, including books and conferences (like this one). Organizations that were distinctly different from Google, not only in terms of scale but also culturally, adopted SRE for their own circumstances and needs.
These days, it appears we are approaching the late stage of the second age. Signs are that recruiters now use the term SRE in job descriptions to attract applicants and that we can pride ourselves on our desirability in the work market.
The time is ripe to think about the third age—it might very well mean the end of SRE as we know it!
Björn Rabenstein, Grafana Labs
Björn is an engineer at Grafana and a Prometheus developer. Previously, he was a Production Engineer at SoundCloud, a Site Reliability Engineer at Google, and a number cruncher for science.
The Liffey A
Evolution of Observability Tools at Pinterest
Naoman Abbas, Pinterest
This talk will cover how observability tools at Pinterest evolved over time to fulfill the changing requirements as we grew from a small startup to web scale. These tools include metrics system, log search and distributed tracing.
Naoman Abbas, Pinterest
Naoman Abbas is the engineering manager for Pinterest's Observability team, which is responsible for building and operating observability tools like the company's metrics system, logsearch, and distributed tracing. Previously, Naoman was a software engineer building cloud platform components at Netflix and Microsoft.
How to SRE When Everything's Already on Fire
Alex Hidalgo and Alex Lee, Squarespace
We've all read the SRE books and heard stories of a magical land of Engineering organizations with functioning SRE; one where following SRE best practices will lead to a better reality for both you and your users. But how do we get there? And, what does that road look like?
This talk presents a case study on how our team, stuck in a deep reliability hole maintaining our company's centralized logging platform, adopted many SRE best practices to resolve a several-months-long incident. It's the story of how we took the highest-trafficked system in our infrastructure from being reliable ~85% of the time to a trusted and documented 99.9%.
Alex Hidalgo, Squarespace
Alex Hidalgo has been a Site Reliability Engineer since 2011. During that time he has developed a deep love for sustainable operations, metrics and monitoring, and using error budgets to drive almost every decision. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Alex Lee, Squarespace
Alex Lee is an SRE at Squarespace, where he's spent the past 5 years working on systems and processes that enable more reliable engineering. He currently leads the Observability Team, building and maintaining the tools that monitor Squarespace. Based out of New York City, Alex is passionate about work that empowers others to more effectively succeed in their own goals.
15:30–16:00
Break with Refreshments
Level 1 Foyer
16:00–17:35
The Liffey B
Fault Tree Analysis Applied to Apache Kafka
Andrey Falko, Lyft
This talk should provide a framework for answers the following common questions a Kafka operator or user might have: What should your replication factor be for your Kafka topics? How many partitions should you have? How many consumers should I provision? What should my ISR setting be? Should I use RAID or not?
Andrey Falko, Lyft
Andrey Falko is one of the first Reliability Software Engineers at hired at Lyft, where he has been for more than a year. He is currently focused on building and scaling reliable PubSub systems for Lyft's Data Platform. Prior to Lyft, Andrey worked at Salesforce for nine years where he researched Kafka and Pulsar performance and reliability. While there, he also built an IaaS system, many CI/CD systems, a Zipkin service, and features for the Salesforce platform.
Applicable and Achievable Formal Verification
Heidy Khlaaf, Adelard LLP
Formal verification is often considered an overly rigorous, and potentially unnecessary technique to be deployed on everyday systems. There are numerous misconceptions about the capability and automation of formal verification techniques, and when and how they can be deployed. This talk will thus provide an introductory overview of the verification tools and techniques deployed in industry, specifically, the safety critical industry, at different rigour levels, and how these techniques can be adapted to your current existing system infrastructure.
Heidy Khlaaf, Adelard LLP
Heidy Khlaaf is a Research Consultant at Adelard LLP where she evaluates, specifies, and verifies the implementations of safety-critical systems. She received her Ph.D. from University College London where she developed novel research methodologies, in part with Microsoft Research, to fully-automate the verification of temporal properties over software systems.
Closing Remarks
Program Co-chairs: Emil Stolarsky, Incident Labs, and Murali Suriar, Google