SREcon22 Europe/Middle East/Africa Conference Program

Tuesday, 25 October

08:15–08:45

Morning Coffee and Tea

Grote Zaal

08:45–09:00

Opening Remarks

Effectenbeurszaal

Daria Barteneva, Microsoft, and Niall Murphy, Stanza

09:00–10:30

Plenary Session

Room Captains: Daria Barteneva, Microsoft, and Niall Murphy, Stanza

Effectenbeurszaal

Knowledge and Power: A Sociotechnical Systems Discussion on the Future of SRE

Tuesday, 09:00–09:45 CEST

Dr. Laura Maguire, Jeli, and Lorin Hochstein, Netflix

Available Media

This talk shares the findings from a series of exploratory discussions between two prominent Site Reliability Engineers from industry-leading organizations and two socio-technical systems researchers with extensive experience in distributed human-machine teaming.

Drawing from both academic and industry perspectives, this talk elaborates on topics relevant to socio-technical software systems - both practical considerations and philosophical concerns. The practical includes: the tradeoffs inherent in balancing operational load and work in support of feature delivery with the less-immediately-tangible - but no less important - work of learning about our systems; sharing knowledge as a team and; using that knowledge to reduce risk. Our philosophical inquiries relate to the impact of the history of SRE on its future, meditations on the ‘practice’ and values of SRE and provocative promising new directions.

Attendees will come away with new perspectives on how knowledge and power structures operate in their organizations, shaping the ways that we conduct and understand our work.

Laura Maguire - Laura leads the research program at Jeli.io. She has a Master’s degree in Human Factors & Systems Safety and a PhD in Cognitive Systems Engineering. Her doctoral work focused on distributed incident response practices in DevOps teams responsible for critical digital services. As a backcountry skier and alpine climber, she also studies cognition & resilient performance in high risk, high consequence mountain environments.

Connect:

@LauraMDMaguire

Lorin Hochstein - Software engineer (distributed systems) in the overlap between software engineering and operations, currently working on the Managed Delivery team. Advocate of resilience engineering and cognitive systems engineering. Fascinated by complex systems, how they work, succeed, change, and fail. Once upon a time I was an academic, but don't hold that against me.

Connect:

@norootcause

SRE as She Is Spoke

Tuesday, 09:45–10:30 CEST

Andrew Clay Shafer

Available Media

Some things get lost in translation. Words have meaning but all language is alive and ever changing. In order to better understand what SRE 'could be' in an imagined future, what tools do we have to understand what SRE 'is'? Now and forever? The co-evolution of SRE language and practice suggests there are already obvious points of divergence in the understanding and application of both. A naive analysis of the evolution of similar movements suggests some divergence may be inevitable, but at what point do we lose the 'essence of SRE'. In every possible SRE 'could' is there ever an SRE 'should'? Are we able to move SRE beyond eternal 'it depends'? If we are, who counts as part of 'we'? Is my SRE your SRE? Would either of us benefit if this was the case?

A familiar face in the DevOps community, Andrew Clay Shafer evangelized DevOps tools and practices before DevOps was a word. Having experience in every role in software delivery across two decades, Andrew focuses on building socio-technical systems and communities of practice. Beyond the buzzwords, you haven’t learned anything until you change your behavior.

Connect:

@littleidea

10:30–11:00

Break with Refreshments

Grote Zaal

11:00–12:30

Track 1

Room Captains: Murali Suriar, Snowflake, and Björn Rabenstein, Grafana Labs

Effectenbeurszaal

Oncall: An Equal Opportunity Waste of Time

Tuesday, 11:00–11:40 CEST

Dave O'Connor, Twilio

Available Media

Live 24/7 support of production services is often completely ingrained into the model in stakeholders' minds of what SRE does. It remains a huge pert of the "value" of most SRE groups. This talk explores what might happen if it wasn't. How do SRE demonstrate their value in a post-oncall world? How do we aim toward that place? Also, even if you don't get to throw your pager in the sea tomorrow, how can you apply these principles anyway?

Dave is an SRE practitioner based in Dublin, Ireland. He's currently VP of Engineering at Twilio, leading Twilio's SRE group. Previous to that, he led SRE at Elastic, building the Elastic cloud. Previous to that, Dave spent 16 years as an SRE at Google, failing to prevent and even being complicit in the development of the function of SRE from its inception. His interests include organisation and team development, leadership coaching, and telling you about problems you didn't know you had.

Connect:

@gerrowadat

Financial Regulators Worldwide Are Getting the Legal Right to Regulate the Operational Resilience of Big Cloud Service Providers

Tuesday, 11:45–12:30 CEST

Andrew Ellam, Monzo Bank

Available Media

New legal powers are coming for financial regulators worldwide. New laws are being written and passed now.

Financial regulators are being given powers to regulate the "critical" cloud service providers.

The things on which they will enforce standards are absolutely central to the SRE discipline (e.g. availability/uptime).

Cloud service providers could face large fines or being barred from the market, if they fail to meet required standards.

We (the SRE discipline) will need to successfully collaborate with "risk and compliance" disciplines from the finance field.

We need to represent to the regulators the SRE way of doing things and how effective it can be.

Software engineer turned Technical Program Manager turned Chief of Staff to CTO of UK bank. Ran my own businesses for a while, worked for startups and scaleups and for big tech companies.

Track 2

Room Captains: Stephane Dudzinski, Reddit, and Effie Mouzeli, Wikimedia Foundation

Graanbeurszaal

Statistics for Engineers

Tuesday, 11:00–11:40 CEST

Heinrich Hartmann, Zalando

Available Media

As an SRE you are constantly confronted with a wealth of telemetry data collected from your systems. Interpreting this data to extract operational information is a key part of your job as an SRE. Statistics is here to help! Statistics is the art of extracting information from data. In this talk, we will discuss the statistical methods that are most relevant to your daily work as an SRE. You will get up to speed with the basics and see how they apply to the operational domain. We will discuss statistical pitfalls that are commonly found in telemetry systems. Specifically, we will cover the following subjects:

Summarizing and Visualizing data with Mean values, Percentiles, and Histograms
Implementing Latency-SLOs
Impact of Sampling to Rate, Error, and Duration (RED) metrics

Heinrich Hartmann is leading the Site Reliability department at Zalando and is responsible for all telemetry tooling at the company. Before joining Zalando, he designed and built telemetry analysis systems at the monitoring vendor Circonus. Heinrich holds a Ph.D. in mathematics and has been frequently talking about statistical analysis of telemetry data over the past 8 years.

Connect:

@HeinrichHartman

Measuring Reliability: What Got Us Here Won't Get Us There

Tuesday, 11:45–12:30 CEST

Štěpán Davidovič, Google

Available Media

Do you use data to decide if your system has been recently unreliable to your users? What data, and how do you interpret it?

Measuring reliability provided by our production systems has made great strides, and frequently we focus on metrics much closer to the business. But the appeal of now-common metrics like SLI, or models like SLO, can hide our inability to answer our questions with consistent application of data.

This talk will show some questions about reliability we might want answered, discuss the ways we might be coming short today, and how can we improve our ability to make data-inspired decisions going forward.

Štěpán is currently Senior Staff SRE at Google, working in the office of the technical advisor to the CFO. Prior to that, he worked among other things on building internal monitoring, reliability insights and canarying systems. He obtained his BSc from Czech Technical University in Prague.

12:30–14:00

Luncheon

Grote Zaal

Sponsored by DataSet

14:00–15:30

Track 1

Room Captains: Avishai Ish-Shalom and Ralph Bateman, IBM

Effectenbeurszaal

Crayon Drawing Is a Vital Engineering Skill

Tuesday, 14:00–14:20 CEST

Murali Suriar, Snowflake

Available Media

Complex systems are difficult to understand, and such understanding, once obtained, is difficult to communicate to others. Both of these increase the difficulty of learning about the system, and increase the time it takes for new team members to feel comfortable proposing and making changes to such systems.

System overview diagrams are one tool which can be very helpful in mitigating these challenges. This talk will walk through system overview diagrams from different problem domains, and discuss the benefits such diagrams can yield.

Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Working on traffic management at Snowflake after 12 years at Google. Currently learning what "the cloud is just someone else's computer" means.

Connect:

@msuriar

Building Dynamic Configuration into Terraform

Tuesday, 14:25–14:40 CEST

Isabelle Miller and Hosh Sadiq, LaunchDarkly

Available Media

What if platform teams could define safe, controlled entry points for developers and managers to update and manage infrastructure and services without directly touching the configurations or the repositories themselves?

This talk covers the complexities of managing Terraform at scale and how you can consolidate and simplify certain processes by extracting configuration values into an external interface. We'll detail the process we went through to enable platform engineers to dynamically edit their Terraform configurations on the fly without having to push it through a whole release and apply process.

Isabelle Miller is an engineer on the Integrations team at LaunchDarkly. In a previous life, she worked in governance and the public sector. Currently, she enjoys ceramics, novels, and an unhealthy obsession with her cat.

Connect:

@sl0loris

Hosh is a Platform Engineer at LaunchDarkly. He is an open source and Linux enthusiast, and is an avid gamer.

Hunting for Risky Dependencies in the World of Microservices

Tuesday, 14:45–15:05 CEST

Theo Klein, Google LLC

Available Media

How many of your internal-only backends are actually exposed to the outside world? Probably more than you think. With the rise of microservices and complex systems, service owners are less aware of the critical user journeys depending on their systems.

In this talk, you will learn about a simple yet powerful application of OpenTelemetry to find and fix major serving outages before they occur. I will also share several high risk dependencies within Google Maps that we caught by using this tool.

Theo Klein is a Site Reliability Engineer working on Google Maps. Over the past year, he has led a team that systematically removed unneeded dependencies on critical systems which de-risked Google's many serving layers from global outages. Previously, he developed a system that validated datasets to ensure they have no anomalies.

His primary interests are in dependency management and horizontal analyses of large-scale systems.

How We Implemented High Throughput Logging at Spotify

Tuesday, 15:10–15:30 CEST

Lauren Muhlhauser, Spotify

Available Media

As a company scales, observability tooling is not always able to scale with it. This talk will cover how we discovered a majority of our logs were unintentionally being dropped and the steps we took to find a logging solution that worked at our scale in a multi-tenant Kubernetes set up. It will also cover steps taken to control throughput and cost once all the logs were able to flood in.

Lauren Muhlhauser is the Engineering Manager for the Observability team at Spotify based in New York. She has supported metrics, tracing, and logging infrastructure both internally and open source.

Track 2

Room Captains: Björn Rabenstein, Grafana Labs, and Daria Barteneva, Microsoft

Graanbeurszaal

Engineering for Sustainability

Tuesday, 14:00–14:20 CEST

Namrata Namrata, Workday

Available Media

Many businesses and institutions are taking active measures to help achieve carbon neutrality. However, we have a lot of work to do in software and systems engineering. The industry is increasingly working with huge data volumes, which make the sustainability of our technology choices all the more important for the overall picture. As a "horizontal" function that addresses the whole software lifecycle, SRE is well positioned in many organisations to bring the necessary people together to examine and act on making infrastructure more sustainable.

Namrata is a Senior Software Engineer at Workday, working as a member of the Metrics, Insights and Alerts team. As part of her work experience in cost and resource optimisation of large-scale data products, she found an interest in the sustainability of modern software infrastructure. Also, a keen advocate of Women in Tech initiatives and collective action towards climate change. She writes about various topics at namc.in, loves swans, and enjoys baking and knitting.

SLOs, SREs, and GHGs

Tuesday, 14:25–14:40 CEST

Bill Johnson, Microsoft

Available Media

With all the different technologies and methodologies available to us, we still share one thing -- the planet. This talk will show how software is impacting the planet through greenhouse gas (GHG) emissions and how you can use SLOs to measure and drive down the emissions that your systems are responsible for. You will see a real-world example of this and come away with many different ways to set and effect your own GHG SLOs!

Bill is a Principal Software Engineering Manager at Microsoft focused on building frontend applications in the Office for the Web organization. Previously, he was a Site Reliability Engineer for the Azure Reliability team working with various Azure Kubernetes Service, Networking, and IoT teams. He spends a lot of time outdoors (usually running or hiking in the mountains) and wants to use his "day job" to help improve the planet.

Connect:

@dubrie

The Biases Confronting SREs

Tuesday, 14:45–15:05 CEST

David Owczarek

Available Media

SRE and the underlying practices have evolved over decades, from the early days of Unix wizards through the more recent SRE and devops movements. SREs play a key role in operating services that drive massive amounts of global business and trade. But the role itself is still awkwardly incorporated into so many organizations. And with this awkwardness comes bias.

To correct for these biases, we must raise awareness by identifying them. Then we can find allies to help in the day to day work of correcting them. That means acknowledging them, having candid and open conversations about them, and then making decisions on how to mitigate them.

Dave Owczarek (he/him) is a reliability expert currently running global SRE for Flexport. He has a side mission to use the arts - speaking, writing, and illustrating - to bring clarity to confusing concepts and inspire innovation in how we build and operate services. Dave has operated lots of services at scale over a 30+ year career, from the first versions of Monster board back in the 90s to some of the largest web hosting providers. When not focused on the SRE experience, you will probably find him playing a guitar somewhere, most likely a Fender.

Connect:

@thatdaveo

Market Data: Applying SRE Techniques to Legacy Designs

Tuesday, 15:10–15:30 CEST

George Brighton, Goldman Sachs

Available Media

Market Data is the lifeblood of financial applications. Real-time ticker plants must distribute billions of updates per day quickly and robustly, with latency tightly correlated with business performance. Vendor products, rather than in-house software, are employed to handle this task. Due to the criticality of the platform and associated risk aversion, ticker plants evolve slowly, and are behind on modern operational techniques. This talk will convey Goldman Sachs’ key lessons learned while implementing a ticker plant in a cloud environment, including how we increased the number of failure domains, and the techniques we found most useful when overlaying observability.

George Brighton is a Vice President at Goldman Sachs, where he leads the Market Data SRE team. A Prometheus and OTel committer, he is responsible for uplifting observability and navigating towards operational excellence. Besides his day job, George is a volunteer instructor for Code First: Girls, a social enterprise aiming to encourage more women into technology.

Connect:

@gbrightn

15:30–16:00

Break with Refreshments

Grote Zaal

16:00–17:45

Plenary Session

Room Captains: Björn Rabenstein, Grafana Labs, and Ralph Bateman, IBM

Effectenbeurszaal

Life after The Chocolate Factory

Tuesday, 16:00–16:30 CEST

Murali Suriar, Snowflake, and Emil Stolarsky, Wave Mobile Money

Available Media

It’s no secret that in the SRE community, for better or worse, we often look up to large companies (almost always Google), on how we should go about building secure, reliable & maintainable systems. Lots has been said about what they do right but in reality, these approaches are rarely applicable to SREs at smaller organizations who just don’t have the resources to replicate the same systems.

Here’s the thing - it’s not just the core technology that ex-Google SREs miss the most: it’s all the little things. How do you make on-boarding not suck? What’s an effective way of securing systems, but not getting in the way of people doing their jobs? And who the hell ever thought JIRA was a good idea?

This talk is structured as a comedy double act: a recovering Google Engineer seeks therapy from someone who’s lived their whole career outside the walled garden that is Google.

Murali Suriar is a lapsed computer science graduate, turned network engineer, turned SRE. Working on traffic management at Snowflake after 12 years at Google. Currently learning what "the cloud is just someone else's computer" means.

Connect:

@msuriar

Emil is an SRE at Wave Mobile Money, helping make Africa the first cashless continent. Previously he worked on caching, performance, & disaster recovery at Shopify, the internal Kubernetes platform at DigitalOcean, and everything in between at Cheddar. In addition to speaking at & organizing a number of conferences, he was a contributor to Seeking SRE and co-authored 97 Things Every SRE Should Know.

Connect:

@EmilStolarsky

Is Our Team as Resilient as Our Systems?

Tuesday, 16:30–17:00 CEST

Effie Mouzeli, Wikimedia Foundation

Available Media

Are our teammates learning from each other? Are our teammates comfortable to ask literally anything? Does our team have heroes? Are they overwhelmed? Are our new members getting the support they need to succeed? Are new members living up to the team’s expectations? Is our team growing but not scaling? Most importantly, is our team ready for future challenges and changes?

Our organisations may set their own sets of skills and characteristics that define us as SREs, but we must not forget: The path that led each one of us here, is different, and we need our paths to converge. In this talk, we will discuss what threatens an SRE team's well-being, as well as what we can do to bring some balance, and make it resilient for the future.

Effie studied Physics, and Distributed Scientific Computing, but didn't turn out to be a physicist or a scientific computer scientist. In previous roles, she worked at small organisations as a general purpose systems engineer, either solo or in small teams. Currently, she is a Site Reliability Engineer at the Wikimedia Foundation, focusing on Mediawiki and friends.

Connect:

@manjiki

What SRE Could Be: Systems Reliability Engineering

Tuesday, 17:00–17:45 CEST

Laura Nolan, Stanza

Available Media

As a profession, SRE is still in its infancy, but it isn't too young to be experiencing a profound identity crisis.

Google's Ben Treynor originally said that SRE is what you get when you ask software engineers to do operations (but recently contradicted himself by saying that SRE isn't operations). Google later described SRE as an implementation of DevOps. Lots of people see SRE as holding the pager, or as use of SLOs.

SRE is bigger than any of the definitions above. I define SRE as systems thinking applied to software in production, including the sociotechnical aspects of running software systems. Systems thinking has a set of associated tools and methodologies, but, at its core, it is a philosophy that is aware that systems have underlying structures that cause particular patterns of behaviour. Systems thinking aims to model complex problems and make them tractable.

The best SREs and the best SRE organisations, are thoroughly immersed in systems thinking but it largely remains an implicit organising principle, not something we make a first-class citizen in our profession.

Niall Murphy, at SREcon 2021, asked what SRE 2.0 should be. My answer is that SRE 2.0 should be a profession where we put systems thinking first. This talk will explore how we might do that.

Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal (and principled) Engineer at Stanza Systems, where she is building software to help humans understand and control their production systems. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

Connect:

@lauralifts

17:45–18:45

Social Hour

Grote Zaal

Sponsored by Lightstep

Wednesday, 26 October

08:30–09:00

Morning Coffee and Tea

Grote Zaal

09:00–10:30

Plenary Session

Room Captains: Daria Barteneva, Microsoft, and Niall Murphy, Stanza

Effectenbeurszaal

Diamonds with Flaws: Examining the Pressures, Realities, and Future of Site Reliability Engineering

Wednesday, 09:00–09:45 CEST

Alex Hidalgo, Nobl9

Available Media

The technology industry moves at an incredible pace. Innovation and change are always at the forefront of everyone's mind. Especially in the Site Reliability Engineering space, people feel pressured more than ever to keep up with all of the newest tools, processes, and philosophies. For many organizations, however, chasing all of the shiny things can end up being a detriment as opposed to a benefit. Let's examine these pressures, what the realities of most SRE organizations are, and how we can all best move into the future -- together, thoughtfully, and meaningfully.

Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and author of "Implementing Service Level Objectives." During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching Premier League football. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

Connect:

@ahidalgosre

How We Drained Every Backbone Router Simultaneously

Wednesday, 09:45–10:30 CEST

Francois Richard, Meta

Available Media

On October 4, 2021, we experienced a severe outage lasting approximately 6 hours.

Our engineering teams learned that configuration changes from commands issued as part of routine infrastructure maintenance on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on how our data centers communicate, bringing our services to a halt.

During this presentation, we aim to describe the chain of events that led us to this situation and how the underlying cause of this outage also impacted internal tools and systems we use in our day-to-day operations. We will also delve into our reflections after the event, how continuous validation of support structures, DR capabilities tooling, and processes have helped us and how we are thinking about the future.

Francois currently supports the Reliability Infra team at Meta. The team focuses on both proactive and reactive reliability: from reacting and managing incidents to planning and testing for disaster to validating for resilience & fault tolerance including delivering realistic environments enabling services to truly certify their recovery procedures. Francois has been at Meta for 5+ years previously working in Core Systems focusing on the reliability and security of the control plan components. He is an incident manager oncall and also a crisis manager.

10:30–11:00

Break with Refreshments

11:00–12:30

Track 1

Room Captains: Murali Suriar, Snowflake, and Effie Mouzeli, Wikimedia Foundation

Effectenbeurszaal

Break Free of the Template: Incident Writeups They Want to Read

Wednesday, 11:00–11:40 CEST

Laura Nolan, Stanza

Available Media

Most of us write incident reviews (IRs) or postmortems occasionally. Unfortunately, many IRs are never read by anyone other than those involved in the incident, and therefore have limited benefit to an organisation.

However, IRs that are well-crafted can create learning that will last in your organisation for years (and maybe even beyond). This talk will give practical advice on how to write the most engaging and valuable IR possible.

Laura Nolan is a software engineer and SRE. She has contributed to several books on SRE, such as the Site Reliability Engineering book, Seeking SRE, and 97 Things Every SRE Should Know. Laura is a Principal (and principled) Engineer at Stanza Systems, where she is building software to help humans understand and control their production systems. Laura is a member of the USENIX board of directors and a long-time SREcon volunteer. She lives in rural Ireland in a small village full of medieval ruins.

Connect:

@lauralifts

Making the Impossible Impossible: Improving Reliability by Preventing Classes of Problems

Wednesday, 11:45–12:30 CEST

Chris Sinjakli, PlanetScale

Available Media

Service Level Objectives (SLOs) are a familiar topic in SRE circles. They provide a framework for measuring and thinking about the reliability of a service in terms of a percentage of successful operations, such as HTTP requests.

That key strength of SLOs - viewing reliability as a percentage game - can also also be a weakness. Within that framing, there are certain solutions we're likely to overlook.

This talk explores another lens for reliability - one that's complementary to SLOs: structuring software in a way that rules out entire classes of problem.

We'll explore this idea via three worked examples, and finish with some concrete take-aways, including how to spot problems that fit this shape.

Chris enjoys working on the strange parts of computing where software and systems meet. He particularly likes the challenges of databases and distributed systems.

All his programs are made from organic, hand-picked, artisanal keypresses.

Connect:

@ChrisSinjo

Track 2

Room Captains: Dan Fainstein, The D. E. Shaw Group, and Daria Barteneva, Microsoft

Graanbeurszaal

Deep Dive: Azure Resource Manager Outage

Wednesday, 11:00–11:40 CEST

Benjamin Pannell and Brendan Burns, Microsoft

Available Media

Microsoft Azure Resource Manager is the globally distributed system through which customers purchase, configure, and maintain their Azure workloads. When services were recently interrupted, a temporary outage impacted the ability of some customers to manage their workloads. This situation was particularly interesting because the system architecture both helped reduce customer impact and simultaneously made it challenging for engineers to understand and mitigate the issue.

Through this talk, we’ll share our post-incident findings to give audience members insight into how this specific system operates and how emergent behavior in a socio-technical system can evade a system’s defenses. We’ll also demonstrate an incident report structure which helps prevent common problems like after-the-fact reasoning and instead helps readers effectively learn from the investigation.

Ben is the technical lead of Microsoft Azure's Control Plane SRE team. Based out of Dublin, Ireland; he has helped guide improvements to the operability, resiliency, and performance of mission-critical control plane services including Azure Resource Manager. Prior to his role at Microsoft, he worked as an SRE on a global gaming platform where he helped automate away the role of a NOC, and as a software engineer building an agricultural GIS platform in South Africa.

Connect:

@sierra563

Brendan Burns is a co-founder of the Kubernetes open source project and a corporate vice president at Microsoft where his teams are responsible for Microsoft Azure APIs, governance and management, as well as the Azure Kubernetes Service and cloud-native open source. He built and has run high-scale, mission-critical distributed systems for more than a decade. Prior to working on distributed systems, he was a professor of computer science at Union College in Schenectady, New York. He received a Ph.D. in computer science with a specialty in robotics from the University of Massachusetts Amherst and an undergraduate degree in computer science and studio art from Williams College.

Connect:

@brendandburns

Commas Save Lives, or at Least LinkedIn

Wednesday, 11:45–12:30 CEST

Todd Palino, LinkedIn

Available Media

What happens when the only good thing you can say about a site outage is that it was detected quickly? In February of 2021, LinkedIn was taken down by the smallest of things - a comma (or the lack thereof). Through a number of contributing factors, including challenges with the incident response process itself, this knocked out the public site and significant amounts of internal tooling, spiraling into a much longer time to mitigation.

Not all is bleak, however. All problems can be fixed when there is a solid foundation to build on. For LinkedIn, this includes bricks laid down by our former head of SRE where he clearly states "we are here to attack the problem, not the person." And it includes the culture and values that not only focus us on getting things done, but on having fun while we do it.

Todd Palino is a Principal Staff Engineer in Site Reliability at LinkedIn, focused on Efficiency Engineering, Resilience, and Incident Response. Prior to that, he was responsible for architecture, day-to-day operations, and tools development for one of the largest Apache Kafka deployments. He is also the co-author of Kafka: The Definitive Guide (O’Reilly Media). Out of the office, you can find him sharing his experience from years in SRE technical leadership with conference audiences. Or out on the trails, training for the next marathon.

Connect:

@bonkoif

12:30–14:00

Luncheon

Grote Zaal

Sponsored by LaunchDarkly

14:00–15:30

Track 1

Room Captains: Avishai Ish-Shalom and Ralph Bateman, IBM

Effectenbeurszaal

Passing the Torch - Building a New Grad Program to Mentor the Next Generation of SRE

Wednesday, 14:00–14:40 CEST

Chris Stankaitis, The Pythian Group

Available Media

We all stand upon the shoulders of giants, SRE/DevOps remains one of the last areas of our industry where you do not learn what we do in school. Each of us are here because someone took us under their wing and mentored us. By building a new grad program we can pay back the investment that the previous generation made in us, create a culture of learning and growth, mature our existing processes and workflows, and set the next generation of SRE/DevOps people up for success in a demanding industry.

Chris Stankaitis is a passionate leader in the SRE and DevOps communities. Focusing on culture and transformation Chris is thrilled to be speaking at SRECon again and is excited to have the chance to drive discussions around the evolving role SRE/DevOps plays in tech. Chris leads a large, diverse, high change/pace, fully remote DevOps consulting group for Pythian that help design, build and maintain many of the services and apps that we use daily.

Going from 30 to 30 Million SLOs

Wednesday, 14:45–15:30 CEST

Alex Palcuie, Google

Available Media

I will be presenting the evolution of Service Level Objectives (SLO) for the GCE Compute API for the past 6 years. Starting from the initial 30 or so SLOs, going through a mid-term phase of about a thousand and ending with millions of per-customer SLOs. I will be sharing anecdotes, better techniques on how to handle low-QPS (think continuous over discrete metrics) and how to aggregate the data for better leadership visibility.

Alex has been working as a Site Reliability Engineer in the team that takes care of the GCE Compute API for over 5 years. He’s also been part of the team that built a control plane framework that’s now powering over 20 products in Google Cloud. His current 20% project is helping with huge outages in the Tech Incident Response Team (Tech IRT), like powering down computers in a data centre when the weather is too hot.

Connect:

@AlexPalcuie

Track 2

Room Captains: Murali Suriar, Snowflake, and Laura Nolan, Stanza

Graanbeurszaal

Disaster Recovery Testing at Booking.com

Wednesday, 14:00–14:40 CEST

Yoann Fouquet, Booking.com, and Paola Martinucci, Mollie.com

Available Media

Large disasters can be due to equipment failures, user errors, natural disasters, malware and other unexpected events. At Booking.com, we have established a program to test the impact of these disasters and the recovery mechanism that we have in place. Those tests started years ago with simple region evacuation in normal conditions, and later expanded to injection of latency at the network level, packet dropping, cut of inter-datacenter connection, cut of power feed or even region-wide shutdown.

In this talk, we are not giving a master class on disaster recovery testing, but rather sharing 4 years of knowledge acquired during this program: the improvement of the reliability of our platform, the organisational impact, the automation created for or after the tests, … but also a few things that actually went wrong. We will finally discuss how this was applied to mitigate real incidents that have happened since the start of the project.

Yoann Fouquet is a Senior Site Reliability Engineering Manager, with experience in building and operating resilient applications at high-scale. He joined Booking.com in 2018, where he is supporting company core services on performance, reliability, disaster recovery and security topics with a continuous focus on making SRE practices scale through efficient tooling and processes.

Paola has recently joined the Engineering Team as Technical Program Manager at Mollie.com in Amsterdam.

Paola is an enthusiastic and passionate woman in Tech, and mainly a happy family woman and mother of 3 wonderful children. Venezuelan-italian Industrial Engineer, she has gathered a long combined experience as manufacturing planner and controller, project manager, in administration and has a strong customer oriented and team-work mindset. In the past two and a half years, she had the opportunity to be part of the transformation of the AZ Failover Project in Booking.com, into the established large-company scale Disaster Recovery Testing Program that it is today.

Slack's DNSSEC Rollout: Third Time's the Outage

Wednesday, 14:45–15:30 CEST

Rafael Elvira, Slack

Available Media

We all have to manage DNS. DNS changes are inherently high-blast-radius and high-visibility.

We present a case study of what happened when a large SaaS company enabled DNSSEC. We did significant planning and testing beforehand. The rollout went smoothly for most of our domains, but one domain caused problems. We attempted three times to enable DNSSEC on this domain. Twice we rolled back after a partial rollout because of actual (or suspected) customer impact.

On the third occasion, we rolled out DNSSEC fully determined that the change had broken a small subset of our customers. While attempting to roll back… we made it worse. This talk will describe what happened.

Main Takeaways

A better appreciation of DNSSEC’s workings, including how various DNS TTLs work between root, TLD name servers and recursive resolvers
Strategies for mitigating risk of DNS changes to critical/high impact zones (and some areas we missed)
An appreciation of some of the long-tail problems with DNS that are difficult to de-risk entirely with current tooling
An entertaining outage story

Rafael is a Staff Software Engineer for the Demand Engineering team at Slack based in Madrid, Spain. The Demand Engineering team enables fast and reliable delivery of Slack to our 12M+ globally distributed daily active users.

Outside work, Rafa enjoys traveling, cooking and spending time in the mountains: climbing, hiking, mountain biking or skiing with friends.

Connect:

@rdelvira

15:30–16:00

Break with Refreshments

16:00–17:10

Track 1

Room Captains: Murali Suriar, Snowflake, and Laura Nolan, Stanza

Effectenbeurszaal

Meatbag Systems: How Our Reliability Culture & Practice Evolved over Time

Wednesday, 16:00–16:40 CEST

Andrew Howden and Salomé Santos, Zalando

Available Media

The reliability of production systems is not only an property of how the system is designed and implemented, but also how the meatbags (people) who maintain those systems understand them, the value of those systems to the end consumers, and their roles and responsibilities.

Our journey to become more reliable has been a wild ride. Through our journey, we've had to examine what the value of reliability is, anchor the view of "reliable systems" in the value our users realize, and create a common understanding about what is "sufficiently reliable". We've had to develop a language, culture, management, and support network to create this common model of reliability and help teams up the "operational maturity curve" --- All, ultimately, in service of customers.

Within this talk, we'll examine this journey by reviewing three major incidents that happened at this organization over some time, and examine how we've evolved our approach.

Andrew is a failed sports science student who wandered into software engineering by virtue of luck and the necessity to find a job in a hurry. Through the grace and patience of his colleagues, he has spent nearly the past decade learning how to be a software engineer, systems engineer, site reliability engineer and student of human factors. Most recently, he is learning how to become an engineering manager and trying to pass on what knowledge he gained so far to the next generation of software adventurers.

Connect:

@andrewhowdencom

Salomé was a math student, who followed a career in software engineering after having had her first contact with programming in the university and fallen in love with it. She started as a shy person, and in many ways still is. She grew and developed herself in a world dominated by men, taking on opportunities and improving her self confidence, and along the way she felt equal to others around her. Nowadays she is an Engineering Manager, applying her passion for people on a daily basis and aiming at have a positive impact on those she works with.

Principled Identification of "Root Causes" Using Techniques from Safety Engineering

Wednesday, 16:45–17:10 CEST

Laura de Vesine, Datadog Inc

Available Media

Industry approaches to "root cause analysis" are often ad-hoc and unprincipled. Causes lie somewhere in the space between "the last change before it broke" and "The Big Bang", but how do we decide which to focus on? "Five whys" - or relying on our best guesses - are unsatisfying: we try to identify and prevent specific occurrences of worst-case conditions, and end up playing "whack-a-mole" with outage triggers. This talk draws from safety engineering to present a framework for choosing the "right" root causes, with a worked example. We sharply distinguish our system from our environment, allowing us to design and build safe system behavior under any reasonable conditions - including worst-case.

Laura de Vesine is a 20+ year software industry veteran. She has spent the last 6 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.

Track 2

Room Captains: Dan Fainstein, The D. E. Shaw Group, and Ralph Bateman, IBM

Graanbeurszaal

A Case Study in Chaos Testing: Uncovering Kernel Scaling Issues

Wednesday, 16:00–16:40 CEST

Gary Liku, Bloomberg LP

Available Media

Chaos testing has become a popular approach to conducting reliability experiments on distributed systems, but can it be used to discover issues as low level as the kernel? In this talk we present a case study using chaos testing to uncover scaling issues from an unexpected source. You will learn about wide ranging investigative techniques from the cluster to the node level.

Gary Liku started his career in 2017 as a software engineer in the hedge fund space. After building experience with both applications and systems he joined the Bloomberg team as an SRE working on the large scale distributed systems that make up a trading system. He also currently leads the chaos testing initiative for the Trading Systems SRE team.

A Better Way to Manage Command Line Tools: What We Learned from the Desktop Software Industry to Improve Command Line Application Development

Wednesday, 16:45–17:10 CEST

Bo Hou, Criteo

Available Media

Let us show more love to command line tools. Automation is at the heart of SRE. Command line tools to a site reliability engineer are like hammers to a constructor. However, command line tools are often not considered first-class citizens like web applications. Web application engineers can focus on the business logic and get features like continuous delivery, progress rollout, live monitoring, and A/B testing for free. We don't make enough effort to build infrastructures and services to help command line tools access such features. In this talk, we will study and be inspired by how web browsers and SaaS changed the desktop software industry and introduce our solution to bring these features to command line tools for free and offer even more: auto-update, auto-complete, credential management, command sharing, and discovery.

Bo HOU, Ph.D., Engineer. He currently leads a team at Criteo providing development tools and CI services for more than 500 engineers to apply development best practices. Before Criteo, he developed the high-performance ad-server at Dailymotion. He also had several years of engineering experience at IBM China Development Lab, Research Lab, and France Lab.

Connect:

@bhoustudio

17:15–17:50

Plenary Session

Room Captains: Daria Barteneva, Microsoft, and Niall Murphy, Stanza

Effectenbeurszaal

Honey, I Broke the Things: Debugging Gray Failures in Production!

Wednesday, 17:15–17:30 CEST

Radha Kumari, Slack Technologies

Available Media

Migrations are one of the most challenging tasks we do as infrastructure engineers. At Slack, we switched from HAProxy to Envoy Proxy. Overall, this migration was a success, and did not cause any downtime, but even so, we ran into several interesting edge cases that caused minor problems.

Troubleshooting these sorts of 'gray' failures can be a difficult technical challenge. So this talk will discuss some of those facepalm moments: how they were detected, steps taken to investigate them, and how they were solved.

Takeaways from this talk include a specific set of approaches for debugging such problems with Envoy Proxy and other web proxies that we learnt via these events along with some engineering practices that eases the stress during a large migration.

Radha is a Staff Software Engineer for the Demand Engineering team at Slack (Ireland) where she focuses on ensuring "bytes" move in and out of Slack as expected.

Outside work, she loves travelling around the world and has been to over 25 countries since 2013. She also has a passion for collecting shoes.

Connect:

@KumariRadha3

The Repeat Incident Fallacy: What Jurassic Park Can Teach Us about Incidents

Wednesday, 17:30–17:50 CEST

Emily Ruppe, Jeli.io

Available Media

“...to prevent this incident from ever happening again.” Every SRE has seen this phrase in almost every public incident write up of the modern tech age. But the terrible truth is, no matter how thorough the action items are, we can’t actually prevent incidents from happening again. Why? Because truly “repeated” incidents within complex systems at scale, are as likely as finding dozens of mosquitoes full of viable "Dino DNA" preserved in amber (they don’t actually exist).

Emily Ruppe is a Solutions Engineer at Jeli.io whose greatest accomplishment was once being referred to as “the Bob Ross of incident reviews.” Previously Emily has written hundreds of status posts, incident timelines and analyses at SendGrid, and was a founding member of the Incident Command team at Twilio. She’s written on human centered incident management and facilitating incident reviews. Emily believes the most important thing in both life and incidents is having enough snacks.

Connect:

@themortalemily

17:50–20:20

Conference Reception

Grote Zaal

Sponsored by Jeli.io

Thursday, 27 October

08:30–09:00

Morning Coffee and Tea

09:00–10:30

Plenary Session

Room Captains: Daria Barteneva, Microsoft, and Niall Murphy, Stanza

Effectenbeurszaal

SRE in Enterprise

Thursday, 09:00–09:45 CEST

Steve McGhee and James Brookbank, Google Cloud

Available Media

A talk on Enterprise Roadmap for SRE, a new O'Reilly report authored by Steve and James. We review challenges that enterprises are having with adopting SRE today and how to overcome them. Paper copies of the report will be available.

Steve was an SRE at Google for about 10 years in Android, YouTube and Cloud. He then joined a company to build reliable systems on the Cloud. Now he's back at Google, helping more companies do that.

Connect:

@stevemcghee

James Brookbank is a cloud solutions architect at Google. Solution architects help make cloud easier for Google’s customers by solving complex technical problems and providing expert architectural guidance. Before joining Google, James worked at a number of large enterprises with a focus on IT infrastructure and financial services.

Connect:

@jamesbrookbank

Unified Theory of SRE

Thursday, 09:45–10:30 CEST

Emil Stolarsky, Wave Mobile Money

Available Media

Site Reliability Engineering was born at a large company, but it was enshrined at a massive company. When you have over 70 SREs, from a single organization, contributing to a book that documents their approach to running infrastructure, you’re going to be getting a very particular snapshot of the world: the perspective of a place that’s big enough to support over 70 SREs. This is the crux of Niall Murphy’s call to action from SREcon21: what we’ve been treating as gospel can’t possibly be true for everyone.

Since the start, our dialogue has always been rooted in this level of scale. Even when we venture into discussions of solo SREs or bootstrapping a team, it’s stretching the “traditional” model down. We’ve never truly grappled with what SRE would look like at a startup (the 4-person kind, not the billion dollar with hundreds of people, “startup”). But forcing ourselves to work from first principles, to understand what SRE is like at that scale, could be just as insightful as it was for the physics community to attempt to bridge their classic and quantum models.

In this talk, I’d like to do just that. Let’s start at a hypothetical 4-person startup where the default is running out of money within 18 months. From all our cherished SRE ideals, what truly matters at this stage? Do you run incident reviews, do you make SLOs, what is on-call even like? From there, we’ll start turning the dial and watch our startup grow. At a certain point, we’ll reach a size where we have everything one might expect: established SLOs, proper incident response, etc. Throughout that growth, if we take a look at every significant stage and ask ourselves what’s changed, what matters, what needs to be done - we can glean a better understanding of what SRE really is and continue to outline the elephant.

Emil is an SRE at Wave Mobile Money, helping make Africa the first cashless continent. Previously he worked on caching, performance, & disaster recovery at Shopify, the internal Kubernetes platform at DigitalOcean, and everything in between at Cheddar. In addition to speaking at & organizing a number of conferences, he was a contributor to Seeking SRE and co-authored 97 Things Every SRE Should Know.

Connect:

@EmilStolarsky

10:30–11:00

Break with Refreshments

Grote Zaal

11:00–12:30

Track 1

Room Captains: Murali Suriar, Snowflake, and Laura Nolan, Stanza

Effectenbeurszaal

Dissecting the Humble LSM Tree and SSTable

Thursday, 11:00–11:40 CEST

Suhail Patel

Available Media

As reliability engineers, part of our role is to provide guidance on how systems should be developed and implemented.

In this talk, we will start from the ground up, describing what the LSM data structure is and implementing it using code. We'll then talk about the operational characteristics and how they are used in practice for high performance applications that need to store data.

Suhail is a Staff Engineer at Monzo Bank focused on building the Core Platform. His role involves building and maintaining Monzo's infrastructure which spans over two thousand microservices and leverages key infrastructure components like Kubernetes, Cassandra, Etcd and more. He focuses specifically in investigating deviant behaviour and ensuring services continue to work reliably in the face of a constantly shifting environment in the cloud

Connect:

@suhailpatel

Caching Entire Systems without Invalidation

Thursday, 11:45–12:30 CEST

Peter Sperl, Bloomberg

Available Media

Caching is fundamental to scaling large systems and keeping them fast. The viability of caching is often constrained by design decisions made early in a system's life cycle. In this talk, we advocate that engineers consider maintainable and thorough caching as a core design goal of a system, and present numerous patterns to enable that.

Caching is simplest when dealing with stateless components, so the solutions presented revolve around controlling for states that are sometimes not accounted for, such as user settings, software versions, dependencies, wall time, and database state.

The "punch line" is that resolving state early in the request flow of your system will allow you to divide your system into two parts, the "state gathering" portion and the "immutable" or stateless portion. The immutable portion would have excellent cache characteristics, such as the ability to transparently cache responses throughout multiple service layers without an active cache invalidation strategy.

Peter Sperl is an Engineering Manager in the Structured Products group at Bloomberg. Before joining Bloomberg, Peter worked as an engineer in the aerospace, medical, and information technology sectors, including at two startups. Peter has a bachelor's degree in electrical and computer engineering from Carnegie Mellon University.

Track 2

Room Captains: Stephane Dudzinski, Reddit, and Avishai Ish-Shalom

Graanbeurszaal

Over Nine Billion Dollars of SRE Lessons - the James Webb Space Telescope

Thursday, 11:00–11:40 CEST

Robert Barron, IBM

Available Media

Over a decade of hard labor and over nine billion dollars invested... it was imperative that the James Webb Space Telescope launched perfectly and that every possible point of failure passed successfully. This was a deployment that any SRE would have been proud of!

After only a few months of activity, the James Webb Space Telescope has already proven to be a spectacular engineering and scientific success. Webb's predecessor, the Hubble Space Telescope, famously required many astronaut visits to repair and upgrade it. But Webb was designed to fly further away from Earth, out of reach of repair by either astronaut or SRE. Any failure would be final and public.

How did NASA succeed in building the confidence it needed to deploy a completely automated, self-healing, observatory?

In this session you will learn SRE lessons through the lens of NASA's experiences in developing and launching exploratory probes.

Robert is an SRE Architect in the office of the IBM CIO where he enjoys helping others solve problems even more than he enjoys solving them himself. Robert has over 20 years of experience in IT and is still happiest when learning something new. He lives in Israel with his wonderful wife and two children. His hobbies include history, space exploration, and bird photography.

Connect:

@flyingbarron

Emotional Disaster Recovery: Debugging the Self with Effective Monitoring

Thursday, 11:45–12:30 CEST

Tao Hansen, garden.io

Available Media

We may tend to a cosmos of computers in our day jobs but we still suffer from an abundance of feelings its constellations lack. It's easy to lose ourselves in anger or stress or feel overwhelmed. Totally normal, healthy and human. How you react to those big feelings is a choice. From Western thinkers like Marcus Aurelius to Eastern contemplatives like The Buddha, we unearth practical techniques you can adopt in the workplace and on Twitter to keep yourself sane, kind and effective.

Tao Hansen is a Developer Advocate at garden.io. Formerly a short film maker he stumbled into a monastery after a heartbreak where he stayed for months practicing silent meditation. He keeps an OpenShift cluster in his electrical closet just for the thrill of it.

Connect:

@worldofgeese

12:30–14:00

Luncheon

Grote Zaal

14:00–15:30

Track 1

Room Captains: Effie Mouzeli, Wikimedia Foundation, and Björn Rabenstein, Grafana Labs

Effectenbeurszaal

An SRE Guide to Linux Kernel Upgrades

Thursday, 14:00–14:40 CEST

Ignat Korchagin, Cloudflare

Available Media

The Linux kernel lies at the heart of many high profile services and applications. And since the kernel code executes at the highest privilege level, it is very important to keep up with kernel updates to ensure the production systems are patched in a timely manner for numerous security vulnerabilities. Yet, because the kernel code executes at the highest privilege level and a kernel bug usually crashes the whole system, many engineers try to avoid upgrading the kernel too often just for the sake of stability. But not every kernel update is dangerous: there are bugfix/security releases (which should be applied ASAP) and feature releases (which should be tested better). This talk tries to demystify Linux kernel releases and provides guidance on how to safely and timely update your Linux kernel.

Ignat Korchagin is a systems engineer at Cloudflare working mostly on platform and hardware security. Ignat's interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as a senior security engineer for Samsung Electronics' mobile communications division. His solutions may be found in many older Samsung smart phones and tablets. Ignat started his career as a security researcher in the Ukrainian government's communications services.

Connect:

@ignatkn

The Math of Scalability

Thursday, 14:45–15:30 CEST

Avishai Ish-Shalom

Available Media

Math is often viewed as "too academic" and detached from the realities of engineering. Not so! Math is about rigorous modeling and flushing out hidden assumptions and is an invaluable tool for understanding the world. This talk showcases mathematical models that provide insights on how things scale

"In a world where anything has an API, everything is a software problem."

This insight has guided Avishai Ish-Shalom throughout his diverse career working on improving the complex socio-technical systems that create and operate modern software and promoting the use of Mathematics in system design and operations. Spending 17 years in various software fields and capacities, Avishai has served as Developer advocate for ScyllaDB (the boring database), Engineer in Residence in Aleph VC, engineering manager at Wix.com and co-founder of Fewbytes. Currently Avishai is an independent researcher and consultant.

Connect:

@nukemberg

Track 2

Room Captains: Murali Suriar, Snowflake, and Laura Nolan, Stanza

Graanbeurszaal

Rock Fishing and Incident Analysis: Increasing Insight

Thursday, 14:00–14:40 CEST

Thai Wood

Available Media

Rock fishing is a dangerous sport with recent efforts to improve safety, this starts with an understanding of what creates safety. I'll explore how incident analysis can help us learn more about what creates safety and how to contrast these different ways of knowing, through the lens of these rock fishers and researchers. Whether someone is an experienced incident responder, has never had to be on call, or is looking to start an SRE program, there will be something for everyone in learning how to surface more information from incidents and how to better understand the systems at work.

Thai helps teams build better systems and improve their ability to effectively respond to incidents. A former EMT, he applies his experience managing emergency situations to the software industry. He writes about resilience engineering at ResilienceRoundup.com

Connect:

@ThaiWoodHere

How Can SRE Help Security Governance? Sub-title: How to Unstuck GRC with SRE

Thursday, 14:45–15:30 CEST

Mario Platt

Available Media

Governance, Risk Management and Compliance (GRC) have been largely stuck in the same way of doing things for decades. The rise of SRE and its methods and practices provides a unique opportunity for GRC functions to radically re-think their role and what would be better managed by SRE functions in keeping organisations secure, by leveraging new ways to think about operational risk, being able to answer "how much security?" and integrating analysis of trade-offs and constraints which SRE already figured out in the context of reliability. Security needs that too

With over 20 years of security experience, and with roles spanning penetration testing, operations, engineering and Governance, Risk Management and Compliance, Mario is known for his Strategic thinking and pragmatic approaches often bridging the communication gap between technical and governance professionals to enable real collaboration. Mario is the Director of GRC for LastPass and owns the blog www.securitydifferently.com where he talks about different ways to think about security management

Connect:

@madplatt

15:30–16:00

Break with Refreshments

Grote Zaal

16:00–16:25

Track 1

Room Captains: Murali Suriar, Snowflake, and Björn Rabenstein, Grafana Labs

Effectenbeurszaal

Schema-First Application Telemetry

Thursday, 16:00–16:25 CEST

Yuri Shkuro, Meta

Available Media

The journey to cloud-native observability starts with basic telemetry gathering: metrics, logs, events, traces, etc. Teams soon realize that they are collecting a lot of fragmented telemetry data that is difficult to interpret without additional metadata. Instead, users often need to rely on tribal knowledge about telemetry datasets. The missing metadata can range from simple things like descriptions, types, and units of measure, to machine-readable semantic data identifying joinable dimensions, privacy policies, etc. In this talk we present a schema-first approach to application telemetry, including an improved developer experience that minimizes the initial overhead of authoring telemetry signals, and the schema definition language based on Thrift IDL with annotations and rich types that capture semantic meaning and other metadata suitable for automated processing. We contrast this approach with the existing techniques popular in the industry, including the OpenTelemetry Semantic Conventions, to demonstrate the benefits and the trade-offs.

Yuri is a software engineer who works on distributed tracing, observability, reliability, and performance problems; author of the book "Mastering Distributed Tracing"; creator Jaeger, an open source distributed tracing platform; co-founder of the OpenTelemetry project; member of the W3C Distributed Tracing Working Group.

Connect:

@yurishkuro

Track 2

Room Captains: Effie Mouzeli, Wikimedia Foundation, and Laura Nolan, Stanza

Graanbeurszaal

Navigating in the Dark

Thursday, 16:00–16:25 CEST

Nati Cohen, AWS

Available Media

In recent years an increasing amount of resources is invested in system observability, notification management and incident response. While these systems provide us with better visibility into our applications and shorten the time to mitigation, what happens when both suffer an outage? In this talk we will review the different ways we can find ourselves flying blind, what other systems and processes are likely to suffer a correlated failure, and what can we do about it.

Nati is a Solutions Architect with AWS. He delights in helping customers simplify complex systems, teaching them about the inner workings of cloud services and debugging annoying technical oddities. When he is not at his computer he is soldering electronic kits, tinkering with smaller computers and drumming on a Taiko.

Connect:

@nocoot

16:30–17:45

Plenary Session

Room Captains: Daria Barteneva, Microsoft, and Niall Murphy, Stanza

Effectenbeurszaal

SRE Is Weird, Down the Stack

Thursday, 16:30–17:00 CEST

John P. Looney, Reddit, Inc.

Available Media

Most SREs are "product" SRE; concerned with customers, external to your organisation. They care about latency, reliability, ensuring you have capacity enough to keep your customers happy. Most product teams work in human-reaction-speed latency; 30 to 500 milliseconds. They direction from a VP somewhere that moves a lever between RELIABILITY and VELOCITY while trying to work out how to make both better. Happy product SREs are able to ship changes multiple times an hour.

A good number of SRECon folk are "infrastructure" SRE. The doozers who ensure that their product teams spend most of their time caring about external matters, not internals that don't add customer value. This talk explores why that world is different, and why those SREs need different skills.

John Looney is an infrastructure software development manager for Reddit, based in Dublin. Before that, he has supported various large-scale distributed systems vital to Google, Facebook and Intercom. He has taught classes in site reliability, computer hardware, networking distributed systems, and sits on the Steering Committee for SREcon.

Connect:

@john_p_looney

SRE and ML: Why It Matters

Thursday, 17:00–17:45 CEST

Todd Underwood, Google

Available Media

Machine Learning is an incredibly hyped set of technologies. It seems that ML is becoming an important part of distributed computing. I'll review whether SREs need to know anything about ML yet (probably you do—sorry!). And since ML reliability is challenging, I'll suggest some changes required for most SREs and even some significant changes to our profession. Finally, I'll review the state of using ML to automate production with an extremely skeptical eye.

Todd Underwood is a Senior Director at Google and the founder of Google's Machine Learning SRE team, that supports many of Google's internal ML services as well as our Cloud AI products. He is also the Site Lead for Google’s Pittsburgh office in Pennsylvania, US. He is interested in how to make computers and people work much, much better together.

Connect:

@tmu

17:45–18:00

Closing Remarks

Effectenbeurszaal

Daria Barteneva, Microsoft, and Niall Murphy, Stanza