SREcon21 Conference Program

Tuesday, October 12 (Day 1 - EMEA/Americas East)

14:00–14:15

Opening Remarks

Program Co-Chairs: Frances Rees, Google, and Vanessa Yiu, Goldman Sachs

14:15–15:00

Opening Plenary Session

"Don't Follow Leaders" or "All Models Are Wrong (and So Am I)"

Tuesday, 14:15–15:00

Niall Murphy, RelyAbility

Available Media

Five years after the publication of the SRE book, it's a good time to reflect on what it did—the good and the bad, the ugly and the beautiful, and relate it to what is going on in production engineering in general, SRE in particular, and the problems in the field we've yet unaddressed and/or created for ourselves.

Niall Murphy has worked in Internet infrastructure since the mid-1990s, specializing in large online services. He has worked with all of the major cloud providers from their Dublin, Ireland offices, and most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). He is the instigator, co-author, and editor of the two Google SRE books, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.

Connect:

@niallm

15:00–16:00

Track 1

10 Lessons Learned in 10 Years of SRE

Tuesday, 15:00–15:30

Andrea Spadaccini, Microsoft Azure

Available Media

In this talk we'll discuss some key principles and lessons learned that I've developed and refined in more than 10 years of experience as a Site Reliability Engineer across several teams within Google and Microsoft.

These are topics that often come up as I discuss Site Reliability Engineering with Microsoft customers that are at different stages of their own SRE journey, and that they—hopefully!—find insightful. They broadly belong to the areas of "Starting SRE" and "Steady-state SRE."

Please join us if you want to discuss fundamental principles of adopting SRE, want to listen to my mistakes (so you can avoid making them!), and want to compare notes on different ways of doing SRE.

Andrea is a Principal Software Engineer in SRE at Microsoft Azure, where he currently acts as a tech lead for all the Azure SRE teams. He works on cross-team projects (currently, SLOs for all Azure products), while being on-call for Azure Resource Manager—the entry point for most Control Plane operations in Azure.

He joined Microsoft in 2018. Before that, he worked as a Site Reliability Engineer for Google since 2011, in various technical and management roles across teams in CorpEng, Ads, and GCP. He's been lucky enough to contribute to the first and second SRE books, mostly to the chapters about on-call.

He received his Ph.D. in Computer Engineering from the University of Catania in 2012, with a thesis on novel traits for biometric recognition, and he's the maintainer of the free CPU simulator EduMIPS64.

Connect:

@lupino3

Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity

Tuesday, 15:30–16:00

Francesco Sbaraglia and Adriana Petrich, Accenture

Available Media

Security Chaos Engineering is built around observability and cyber resiliency practices, aiming to uncover the "unknown unknowns" and build confidence in the system. Engineering teams will progressively work against missing understanding for security concerns within complex infrastructure and distributed systems.

This session with enable you to formulate a valuable simple hypothesis that can be verified based on security chaos experimentation.

Francesco is a Site Reliability Engineering and DevSecOps Coach and SME. He has over 20 years of experience solving production problems in corporate, startups, and governments. He has deep experience in automation, security, observability, multi-cloud, and chaos engineering. He is currently growing the SRE Capability at Accenture Germany.

Adriana has a background within the broader context of social sciences, politics, and security. She joined Accenture Germany to kickstart her SRE/DevOps Engineer career. Since Adriana has a passion for Security and Risks of all kinds, she tried to implement those topics in everything she does. Therefore, Adriana established expertise in SRE and its practices such as Chaos Engineering, Monitoring, and Observability within Public Services.

Track 2 (Core Principles)

What's the Cost of a Millisecond?

Tuesday, 15:00–15:30

Avishai Ish-Shalom, ScyllaDB

Available Media

We all want fast services, but how fast is fast? Would you work hard to shave off a millisecond off the mean latency? The 99th percentile? If aiming for 300ms latency, you might answer "probably not." However, due to various phenomena collectively known as "latency amplification," a single millisecond deep in your stack can turn into a large increase in user-visible latency—and this is very common in microservices-based systems. What is the true cost of a millisecond?

"In a world where anything has an API, everything is a software problem," this insight has guided Avishai Ish-Shalom throughout his diverse career working on improving the complex socio-technical systems that create and operate modern software and promoting the use of Mathematics in system design and operations. Spending 15 years in various software fields and capacities, Avishai has served as Engineer in Residence in Aleph VC, engineering manager at Wix.com, co-founded Fewbytes, and consulted many other companies on software operations, reliability, design, and culture. Currently, Avishai is a Developer Advocate for ScyllaDB (the boring database ;-)

Connect:

@nukemberg

Spike Detection in Alert Correlation at LinkedIn

Tuesday, 15:30–16:00

Nishant Singh, LinkedIn

Available Media

LinkedIn's stack consists of thousands of different microservices and their associated complex dependencies among them. When a production outage happens due to an issue with misbehaving services, finding the exact service responsible for the outage is challenging and time-consuming. Although each service has multiple alerts configured in a distributed infrastructure, during an outage finding the real root cause of the issue is like finding a needle in a haystack even with all the right instrumentation. Since every service in the critical path of client request would have multiple active alerts. Lack of proper mechanism to derive meaningful information from these disjoint alerts often leads to false escalations causing increased issue resolution time. In this talk, we will showcase how we used Spike (Anomaly) detection on the alert correlation system at LinkedIn which helps us find alerts from false positives alerts and help reduce toil on engineers.

Nishant Singh is a Site Reliability Engineer at LinkedIn, where he works toward improving the reliability of the site with a focus on reducing the MTTD and MTTR of incidents. Prior to joining LinkedIn, he worked at companies like PayTM and Gemalto as a DevOps Engineer, spending his time building custom solutions for clients and managing, maintaining services over the public cloud. Nishant loves building distributed systems and exploring the breadth of technologies to support business needs along with a focus on the usage of modern scalable solutions in SRE/DevOps environments.

16:00–16:30

Break

16:30–18:30

Track 1

What To Do When SRE is Just a New Job Title?

Tuesday, 16:30–16:45

Benjamin Bütikofer, Ricardo.ch

Available Media

When the SRE Book was published in 2016 the job title of SRE was not widely used outside Google. Fast-forward five years and it seems like every company is hiring SREs. Did the System Administrator and Operations jobs disappear or have their job titles simply changed?

At the end of the talk, you will know one way of transforming a disjoint team of engineers into a high-performing SRE team. If you are a manager of a team or you are interested in team building this talk is for you. This is not a technical talk; I will focus solely on how to set up a team for success.

Benjamin Bütikofer is the Head of Platform Services at the Swi ss online marketplace Ricardo. He started working in the systems administration space in 1999. He worked in datacenters, managed Unix mainframes, and was a Linux admin. After an excursion into Software Development and via an acquisition he ended up at Microsoft. At Microsoft he built his first team of Site Reliability Engineers.

From 2017 to 2020 did an overland trip through North and South America with his wife and their dog. This journey has greatly influenced his approach to team building.

Connect:

@gran_viaje

Capacity Management for Fun & Profit

Tuesday, 16:45–17:15

Aly Fulton, Elastic

Available Media

Are you looking to move past the "throw unoptimized infrastructure at a problem and worry about waste later" stage? Join me as I talk about my journey green fielding all things infrastructure capacity for Elastic's growing multi-cloud-based SaaS. This talk is NOT about cost savings or cost optimization directly in the traditional sense, but you will discover that proper capacity management and planning do lead to increasing profit margins!

Aly Fulton is a Senior Site Reliability Engineer at Elastic focusing on all things capacity for the Cloud team. She loves infrastructure, numbers, and when her code works the first time (that one time). When she's not preventing Cloud fires, you can find her geeking out about native trees, exploring the world with her family, or living a digital alter ego in some MMORPG.

Connect:

@sinthetix

A Political Scientist's View on Site Reliability

Tuesday, 17:15–17:45

Dr. Michael Krax, Google

Available Media

Political science can provide novel and fresh insights into software engineering problems:

Empirical research on social change is helpful to understand team dynamics and how to evolve teams.
Analyzing political systems as self-organizing systems provides insight on how to simplify modern production environments.

If you are interested in a different look at your everyday questions, join us. No prior political science training or knowledge expected.

Structured talk, conceptual session

Michael works as an SRM with Google in Dublin. In a former life, he researched the different methods that European governments use to coordinate European policymaking. He has been working in different engineering roles (mostly in Berlin) and holds degrees from Freie Universitaet Berlin and Sciences Po Paris.

Panel: Engineering Onboarding

Tuesday, 17:45–18:30

Moderator: Daria Barteneva, Microsoft

Panelists: Jennifer Petoff, Google; Anne Hamilton, Microsoft; Sandi Friend, LinkedIn; Ilse White, Learnovate

Available Media

In this panel on Engineering Onboarding we will discuss with a few industry experts their thoughts on what are the big questions and challenges in this field? What have been the significant changes in the past few years? And, finally, what next?

Daria Barteneva is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organizational culture, processes, and platforms to improve service reliability and on-call experience.

Jennifer Petoff is Google's Director of SRE Education and is based in Dublin, Ireland. She leads the SRE EDU program globally and is one of the co-editors of the best-selling book, Site Reliability Engineering: How Google Runs Production Systems.

Anne Hamilton leads the Azure Engineering Learning (AEL) team at Microsoft. AEL creates and delivers training on Azure and related topics for internal engineering teams at Microsoft to help them learn new techniques and improve practices. When not working, Anne x/c skis (slowly), cycles (sparingly), and golfs (poorly).

Sandi Friend, now the manager of Technical Training and Documentation at LinkedIn, has been onboarding LinkedIn engineers for 9 years. During this time she has established several programs including an award winning Engineer Bootcamp that has proven to reduce onboarding time, increase productivity and prepare the engineer for embedding in their team. She has also created and scaled programs to facilitate knowledge sharing through Tech Talks, labs, videos and other methods completely generated by the engineers themselves. Prior to joining LinkedIn she spent twenty plus years in the high tech industry creating and delivering technical training for various customers throughout the world. To better help the engineers at LinkedIn, Sandi is expanding her expertise to include accessibility so she is able to foster training, documentation and tooling that is available to everyone. She is also evaluating options for improving training, documentation and development processes to accommodate the hybrid workforce while still maintaining quality and engagement. She is a world traveller that resides in Austin, TX with her teenage daughter. She enjoys exploring new places and experiencing different cultures. But right now she is settling for a good movie.

Ilse White is a Corporate Learning Researcher at Learnovate since February 2021. Prior to Learnovate, she spent 15 years working on the People Development team in Google where she gained extensive experience in learning design, learning strategy & program management as well as technology-enhanced learning across a wide range of topics like onboarding and management/leadership development. She cares about delivering great learning experiences and has a strong client focus.

Ilse holds an MSc in Business Communication from Radboud University Nijmegen (Netherlands) and a Graduate Diploma in Education & eLearning from Dublin City University (DCU) and is currently considering her next educational focus, possibly psychology or organizational development.

Originally from a small town in the Netherlands she has made Dublin's northside her home for the past 16 years, where she lives with her husband and 3 children. If she had any spare time, she would be walking and cycling all around the country and writing children's books.

Track 2 (Core Principles)

Sparking Joy for Engineers with Observability

Tuesday, 16:30–16:45

Zac Delagrange, Area 1 Security

Available Media

Too often we're concerned with sparking joy/delight for our Customers that we forget that our Developer's experience and joy has a direct impact on the end product and customer happiness. We found that without an Exception Reporting tool, we were relying on customer's unhappiness to report issues in our web application, and by integrating Exception Reporting with Distributed Tracing we could get ahead of any customer issues while improving our developer's experience.

You'll learn that time to resolution can be improved as well as elevating your engineers out of a sea of logs and hearsay and into context-driven fixes in this story about exercising distributed services with Observability tools.

Zac has been improving product quality through continuous integration, test automation, transparent metrics, collaborative and social development processes. Specializing in enabling peers to ship code quickly and confidently, Zac is a self-taught software engineer with an academic background in Information Systems and Marketing, he has proven experience as a DevOps engineer with a big Quality voice by engineering solutions at Cyber Security start-ups for the last 6 years.

Latency Distributions and Micro-Benchmarking to Identify and Characterize Kernel Hotspots

Tuesday, 16:45–17:15

Danny Chen, Bloomberg LP

Available Media

For many systems, a fully distributed microservices architecture that scales perfectly horizontally remains an unrealized goal. Many systems still co-locate processes on hosts because of performance considerations and/or a local on-host state must be shared. In our department at Bloomberg LP, architectural and performance constraints force us to run on bare metal hardware with many cores and terabytes of main memory.

Operating systems for bare metal hardware have developed and grown in order to enable greater scale (e.g., large numbers of processes/threads, large numbers of open files, etc). But the OS doesn't always scale correspondingly for runtime scale (i.e., large levels of concurrency/contention). Furthermore, the systems we manage don't always scale with newer, larger, and faster hardware.

In this talk, we will present some case studies across a variety of operating systems that illustrate how we run into scale limits in the OS and how we used micro-benchmarks to collect insights into the nature of these scale limits in order to develop fixes and workarounds. These micro-benchmarks also complement the wonderful new tracing facilities in modern operating systems by eliminating "noise" and focusing data collection on kernel "hot spots."

Danny Chen has been involved in UNIX performance engineering for over 40 years. He's worked on the UNIX SVR3 and SVR4 kernels, market data, messaging and transactional systems, and enterprise systems monitoring. Most recently, he has been applying performance ENGINEERING (not art) principles to his SRE responsibilities as a member of the Trading Solutions SRE team at Bloomberg LP.

Connect:

@malaclypsedjung

When Linux Memory Accounting Goes Wrong

Tuesday, 17:15–17:45

Minhaj Ahammed, LinkedIn

Available Media

This talk is about debugging an issue where the hosts ran out of memory and went inaccessible even though the applications are limited by cgroups. This covers some topics like memory accounting in cgroups and it is not always straightforward when there are multiple variables at play. We discuss a case where cgroups may not properly account for memory usage, which can be disastrous for cohosted applications or the host itself.

Minhaj Ahammed is a Site Reliability Engineer at LinkedIn, where he works as a part of the Search Infrastructure team. Prior to joining LinkedIn, he worked at Yahoo as a Production Engineer, spending his time building automations and managing the Advertising and Reporting pipeline. Minhaj likes spending time debugging issues and deep-diving into Linux internals.

Panel: Observability

Tuesday, 17:45–18:30

Moderator: Daria Barteneva, Microsoft

Panelists: Liz Fong-Jones, honeycomb.io; Gabe Wishnie, Microsoft; Štěpán Davidovič, Google; Richard Waid, LinkedIn; Partha Kanuparthy, Facebook

Available Media

In this panel on Observability we will discuss with a few industry experts their thoughts on what are the big questions and challenges in this field, what have been the significant changes in the past few years, and, finally, what next?

Daria Barteneva is a Principal Site Reliability Engineer in Observability Engineering in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organizational culture, processes, and platforms to improve service reliability and on-call experience.

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Gabe Wishnie is a Partner Engineering Manager at Microsoft. He leads teams responsible for both the metrics and distributed tracing capabilities for Microsoft. The products are utilized across the company for large internal workloads and externally by Azure customers.

Štěpán Davidovič is a site reliability engineer at Google. He currently works on internal infrastructure for automatic monitoring. In previous Google SRE roles, he developed Canary Analysis Service and has worked on both a wide range of shared infrastructure projects and AdSense reliability. He obtained his bachelor's degree from Czech Technical University, Prague, in 2010.

Richard Waid is the Director of Monitoring Infrastructure at LinkedIn, encompassing emission and storage of time series telemetry, as well as alerting through triage, auto-remediation, and notifications. In addition, he is leading the team to automate the migration of LinkedIn to Azure.

Partha Kanuparthy is a Software Engineer in the Monitoring area at Facebook. His work covers the overall Observability space: scalable systems for logs, relational data, traces, metrics, events, and metadata; leveraging them for real-time automated analyses and data interfaces; and domain-specific observability use cases.

Wednesday, October 13 (Day 1 - APAC/Americas West)

01:00–01:15

Opening Remarks

Program Co-Chairs: Heidi Waterhouse, LaunchDarkly, and Daria Barteneva, Microsoft

01:15–02:00

Opening Plenary Session

Rethinking the SDLC

Wednesday, 01:15–02:00

Emily Freeman, AWS

Available Media

The software (or systems) development lifecycle has been in use since the 1960s. And it’s remained more or less the same since before color television and the touchtone phone. While it’s been looped it into circles and infinity loops and designed with trendy color palettes, the stages of the SDLC remain almost identical to its original layout.

Yet the ecosystem in which we develop software is radically different. We work in systems that are distributed, decoupled, complex and can no longer be captured in an archaic model. It’s time to think different. It’s time for a revolution.

The Revolution model of the SDLC captures the multi-threaded, nonsequential nature of modern software development. It embodies the roles engineers take on and the considerations they encounter along the way. It builds on Agile and DevOps to capture the concerns of DevOps derivatives like DevSecOps and AIOps. And it, well, revolves to embrace the iterative nature of continuous innovation. This talk introduces this new model and discusses the need for how we talk about software to match the experience of development.

02:00–03:00

Track 1

Elephant in the Blameless War Room—Accountability

Wednesday, 02:00–02:30

Christina Tan and Emily Arnott, Blameless

Available Media

How do you reconcile the ideal of blamelessness with the demand for blame? When is it constructive to hold someone accountable, and how? To change a blameful culture, we must empathize with those that point the finger and see how their goals align with our own. We'll show you how to communicate that their goals can be achieved blamelessly. Lastly, we'll share how to hold true accountability well.

Christina is on the strategy team at Blameless, architecting interpersonal dynamics for conflict resolution, high performance teams, and executive alignment. Prior to that, Christina coached TED speakers for public speaking and startup founders for fundraising. Her clients have collectively raise over $240M. In her spare time, she runs the mindfulness community Serenity Lounge.

Emily Arnott is a writer for Blameless. Although she's an outsider to the SRE space, she's been eagerly developing her own perspectives on its practices and culture. You can find her writing at Blameless' blog Failure is Inevitable.

How LinkedIn Performs Maintenances at Scale

Wednesday, 02:30–03:00

Akash Vacher, LinkedIn

Available Media

LinkedIn runs on a fleet of hundreds of thousands of dedicated servers and network devices distributed across the globe which are used to serve the website. Any downtime for these devices may result in disruption to the applications running on top of this infrastructure. Hence, it's crucial to ensure that all maintenances on these devices, such as firmware upgrades or hardware replacements, are performed without impacting the overall availability of services running on top of these devices.

This talk describes the various SRE principles that guided the inception and development of a platform to schedule and execute infrastructure maintenances and share the learnings we had along the way.

Akash Vacher is a Site Reliability Engineer at LinkedIn. He worked on large-scale streaming data infrastructure services such as Kafka, Samza, and Brooklin before transitioning over to help facilitate infrastructure maintenance at scale at LinkedIn.

Connect:

@AkashVacher

Track 2 (Core Principles)

Take Me Down to the Paradise City Where the Metric Is Green and Traces Are Pretty

Wednesday, 02:00–02:30

Ricardo Ferreira, Elastic

Available Media

Observability is a software discipline that goes back to when virtually any problem could be solved by tailing web server logs. But the world has changed, and systems these days are comprised of different services running in their own stacks who cooperatively build up what we understand as the end-to-end architecture. Thus, observability had to evolve as well.

Today we have OpenTelemetry—an observability framework for cloud-native software. OpenTelemetry provides the tools, APIs, and SDKs to create a reusable, robust, and non-vendor-driven observability strategy. But the reality is that most developers are still confused about the lines that separate OpenTelemetry from the past and which parts of the framework are stable enough to be used in production. This talk will explain how OpenTelemetry works and provide examples in Java and Go to illustrate the APIs you can use to produce traces and metrics.

Ricardo is Principal Developer Advocate at Elastic—the company behind the Elastic Stack (Elasticsearch, Kibana, Beats, and Logstash), where he does community advocacy for North America. With +20 years of experience, he may have learned a thing or two about Distributed Systems, Observability, Streaming Systems, and Databases. Before Elastic, he worked for other vendors such as Confluent, Oracle, Red Hat, and different consulting firms. These days Ricardo spends most of his time making developers fall in love with technology.

While not working, he loves barbecuing in his backyard with his family and friends, where he gets the chance to talk about anything that is not IT-related. He lives in North Carolina, USA, with his wife and son.

Connect:

@riferrei

Need for SPEED: Site Performance Efficiency, Evaluation and Decision

Wednesday, 02:30–03:00

Kingsum Chow, Alibaba, and Zhihao Chang, Zhejiang University

Available Media

When you are tackling many servers in the data center, saving a small percentage of servers would bring significant return. We will describe how we evaluate performance at scale, and also how it is different from optimization on a single system.

The emergence of large-scale software deployments in the data center has led to several challenges: (1) measuring software performance in the data center, and (2) evaluating performance impact of software or hardware changes. We will highlight a couple of problems that may lead to wrong conclusions. We will present a sketch of our solutions.

Kingsum is a principal engineer at Alibaba CTO Line Technology Risk and Efficiency Group. Since receiving Ph.D. in Computer Science and Engineering from the University of Washington in 1996, he has been working on performance, modeling and analysis of software applications. After working at Intel for 20 years, Kingsum joined Alibaba in 2016. Since then, he has been driving software performance optimization at the scale of data center. He has been issued more than 23 patents. He has presented more than 110 technical papers.

Zhihao Chang is a PhD student in the College of Computer Science and Technology, Zhejiang University. His research interests include spatial query optimization.

03:00–03:30

Break

03:30–05:30

Track 1

SLX: An Extended SLO Framework to Expedite Incident Recovery

Wednesday, 03:30–04:00

Qian Ding and Xuan Zhang, Ant Group

Available Media

This talk is based on a real journey on establishing SLOs for an infrastructure SRE team whose availability target is higher than 99.999%. First, we reveal our process on defining SLOs and demonstrate the gaps between expectations and reality on using SLOs with dev teams. Secondly, we present a uniformed SLO framework (SLX) design to facilitate SREs to manage hundreds of SLOs. For example, other than using SLO data for basic alerting and weekly reporting, we combine the SLO framework with statistical anomaly detection algorithms to locate the pitfalls automatically. To achieve that, we introduce several new concepts like Service-Level-Factor (SLF) and Service-Level-Dependency (SLD) and use them to build SLO knowledge graphs across multiple infrastructure systems. Finally, we present our intent-driven SLX implementation inspired by the Kubernetes design and the Gitops paradigm.

Qian works at Ant Group as a staff engineer focusing on site reliability engineering. He is the SRE tech lead of adopting Kubernetes in Ant Group production environment. He is passionate about adopting and promoting SRE's philosophy for managing large-scale production systems. His current interest includes designing SLOs from end-user's perspective for using Kubernetes as well as using SLOs to drive reliability feature development for Kubernetes.

Xuan Zhang works at Ant Group as an SRE. He is a full-stack engineer with a passion for coding and building all kinds of systems. He has been focusing on automations, and with his out-of-the-box thinking, pushed through the boundaries of automating processes that deemed implausible.

Watching the Watchers: Generating Absent Alerts for Prometheus

Wednesday, 04:00–04:15

Nick Spain, Stile Education

Available Media

You've written some great recording rules and alerts for your Prometheus monitoring system, you've carefully recreated scenarios to check that the alerts fire—awesome! Your app is never failing silently again! And yet, months later you realize that your system has silently fallen over. How? The cron job that exports the metrics just didn't run, the collector changed its labels: the metrics are missing. Your Prometheus alerts aren't going to fire and you won't know that they've gone away. You could write the alerts manually, but that's a lot of toil and you don't trust yourself not to forget—let's automate it! At Stile Education, we built a tool for generating these alerts automatically. Come along to find out what we did, why we did it, and how it's been useful in the 6 months since we introduced it.

Nick is a Software Engineer working at Stile Education helping build a platform facilitating teachers to provide a world-class science education to their students. He loves automating things and getting out for a good hike.

Connect:

@nick_espana

A Principled Approach to Monitoring Streaming Data Infrastructure at Scale

Wednesday, 04:15–04:30

Eric Schow and Praveen Yedidi, CrowdStrike

Available Media

We ingest over a trillion events per day into our cloud platform and it is very important that this platform is available, operational, reliable, and maintainable.

In creating a comprehensive monitoring strategy for our data processing platform, we found it strategic to model our platform's efficiency and resilience along two axes—complexity of implementation and engineer experience—from which we can define four quadrants—observability, operability, availability, and quality.

In this talk, we present how we've employed this four-quadrant model to establish key indicators and enforceable quality SLAs in order to improve the resilience of our cloud platform while reducing operational complexity.

Computational Biophysicist turned Mobile Engineer turned Cloud Engineer turned Site Reliability aficionado. Currently on a mission to stop breaches at CrowdStrike, where I lead the Site Reliability team.

Connect:

@ericvschow

Distributed systems Developer with experience in mentoring, facilitating, and leading teams offering a decade of experience in Large Scale cloud-native application and tooling development. Possessing excellent analytical skills summed up with strong knowledge in Go, JavaScript, Kubernetes, AWS, Terraform, Vault, Consul, Service Meshes, Observability, and monitoring tools. Active open-source contributor and contributed to projects like Kubernetes, gvisor, grafana, terraform, firecracker-containerd. I enjoy speaking and spoke at conferences like Kafka Summit, JS Conf, ContainerCamp AU, DDD Sydney, and Go Days. Organizer of Serverless Days Melbourne.

Connect:

@geek4evr

Let's Bring System Dynamics Back to CS!

Wednesday, 04:30–05:00

Marianne Bellotti, Rebellion Defense

Available Media

System Dynamics is the process of modeling systems in feedback loops. Developed by computer scientists at MIT, System Dynamics as a technical approach eventually fell out of fashion in favor of formal verification. And yet as we build distributed applications and the environment of automation to support them we see more and more outages triggered by dysfunctional feedback loops. These problems are impossible to model in formal verification but can be reasoned about in System Dynamics. This talk will discuss the history of System Dynamics, what tooling is available for software engineers to build and run models, and how to represent various architectures using the abstractions of System Dynamics.

Marianne Bellotti has worked as a software engineer for over 15 years. She built data infrastructure for the United Nations to help humanitarian organizations share crisis data and tackled some of the oldest and most complicated computer systems in the world as part of the United States Digital Service. At Auth0 she ran Platform Services, a portfolio that included shared services, untrusted code execution, and developer tools. Currently, she runs engineering teams at Rebellion Defense.

Connect:

@bellmar

From 15,000 Database Connections to under 100—A Tech Debt Tale

Wednesday, 05:00–05:30

Sunny Beatteay, DigitalOcean

Available Media

Whenever a company scales quickly, they invariably take on technical debt. It's unavoidable. However, the existence of tech debt isn't the problem, it's how you handle it when you can't put it off any longer.

In this talk, I will be telling the story of one company's largest technical re-architecture to date. It was a company-wide effort that extended over multiple years and taught us many lessons. And it all revolved around a single, overloaded database.

This talk is geared towards mid-level engineers and up who have a solid understanding of tech debt and distributed systems. I discuss some advanced topics and show various architectural diagrams. Though beginners will likely it interesting as well.

Sunny is a software engineer at DigitalOcean where he works on building managed storage products. He has a passion for whiskey and working on distributed systems—preferably in that order. When he's not breaking production, he can be found trying to figure out how to pipe all his troubles into /dev/null.

Connect:

@SunnyPaxos

Track 2 (Core Principles)

MySQL and InnoDB Performance for the Rest of Us

Wednesday, 03:30–04:00

Shaun O'Keefe

Available Media

I know you.

You lose sleep wondering what your SQL is actually doing as you lay in bed at night. You quickly leave the room after saying "storage engine" lest someone asks you to explain what one actually is. You plead with your queries to use your indexes or whisper secret oaths to deities so they may grant you the clarity to understand the results of your EXPLAINs. When asked where your data is stored on disk, your answer is most likely going to be the name of a city.

I know you because I once was you. But then I read the docs. And I'm here to share what I learned.

Shaun O'Keefe is the Head of Platform Engineering (HoPE) at Stile Education, a small but growing group of people who feel very strongly about improving scientific literacy, and about being very, very nice to one another. Shaun likes his infrastructure boring, his pager quiet, and his engineers as friendly as circumstances allow. For the past 10 years, Shaun has slowly been fleeing down the east coast of Australia to finally settle in Melbourne, where he lives with his wife and two kids. Almost all of the peanut butter on his screen is from the two kids.

Connect:

@yomilk

Cache Strategies with Best Practices

Wednesday, 04:00–04:30

Tao Cai, LinkedIn

Available Media

Problem statement
It's expensive to retrieve remote data in many data-intensive and latency-sensitive online services. A cache is widely used as a common practice to speed up the service. It's challenging to build an efficient cache and some bad cache implementations may break the system even more.

Proposal
This is a technical deep-dive about a set of cache strategies with real examples. We will describe Cache Item strategies, Multiple Cache TTL strategies, and Cache warm-up strategies. We'll explain how they significantly improve system performance while maintaining cache efficiency and increase availability. We will also share our practices in adopting those strategies.

Tao Cai is a Staff Software Engineer in the Ads Serving Infra in LinkedIn, former Site Reliability Engineer in the Ads team. Focus on the ads system's scalability, reliability, and latency improvement.

Trustworthy Graceful Degradation: Fault Tolerance across Service Boundaries

Wednesday, 04:30–05:00

Daniel Rodgers-Pryor, Stile Education

Available Media

Does your service gracefully degrade when a database, cache, or remote service becomes unavailable? Are you sure?

Graceful failure when downstream services are overloaded or down is critical to maintaining availability in complex systems, but because each of these failures is (hopefully!) rare, it can be hard to really trust that your systems will respond as expected in an emergency.

Come along to learn: how easy it is to mess this up (repeatedly), what it takes to maintain a trustworthy system, and a set of simple approaches for testing system failures in both CI and production.

Daniel's academic background in Physics and Computer Science gave him a passion for promoting scientific literacy and an interest in manage complex computing systems. Since joining Stile as a junior engineer in 2014, Daniel has spent his time developing features, optimizing processes, and battling fires. Since stepping into the role of CTO in 2017, Daniel now spearheads the technical and organizational challenges of delivering exciting interactive science lessons to more than 1 in 3 Australian high school students, while expanding internationally.

Optimizing Cost and Performance with arm64

Wednesday, 05:00–05:30

Liz Fong-Jones, honeycomb.io

Available Media

Honeycomb.io, a Series B startup in the observability space, evaluated the arm64 processor architecture in order to improve cost and performance of its telemetry ingest and indexing workload. Over a year, 92% of all its compute workloads migrated successfully to arm64, cost of compute dropped by 40%, and end-user visible latency improved modestly. However, the journey was not without roadblocks and challenges such as lack of full software compatibility, hidden performance quirks, and additional complexity. This talk describes the process of setting up the evaluation, full migration, and improvements made to the ecosystem to make the workload run smoothly at scale in the end.

Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.

Wednesday, October 13 (Day 2 - EMEA/Americas East)

14:00–14:55

Plenary Session

DevOps Ten Years After: Review of a Failure with John Allspaw and Paul Hammond

Wednesday, 14:00–14:55

Thomas Depierre, Liveware Problems; John Allspaw, Adaptive Capacity Labs; Paul Hammond

Available Media

Ten years after a talk that started the DevOps movement, we are bringing John Allspaw and Paul Hammond for a discussion with old men yelling at clouds.

From EE to launching a company focused on dev tools, Thomas Depierre's journey has been wide. French, CEO, freelancer sometimes, in general, a lover of cats and tea. Particularly interested in system thinking, incidents, and Resilience Engineering, Thomas loves to listen to war stories. He is now trying to bring Cognitive System Engineering to dev tools.

Connect:

@Di4naO

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John's publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to "The DevOps Handbook." John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Connect:

@allspaw

Paul Hammond is a software engineer, manager, and advisor. His career has spanned twenty years at companies including Slack, Adobe, Typekit, Flickr, and the BBC. His recent work has focused on how software is developed and operated, including building Slack’s continuous deployment pipeline, development environments, and test infrastructure.

Connect:

@ph

14:55–16:05

Track 1

Grand National 2021: Managing Extreme Online Demand at William Hill

Wednesday, 14:55–15:10

Matthew Berridge and Josh Allenby, William Hill

Available Media

The Grand National is the biggest Horse Race in the world, watched by over 500 million people worldwide with 1 in 4 people in the UK placing a bet—it makes a Black Friday look like a wet Tuesday—it is William Hill's biggest betting day of the year. 2021 was a year like no other, with retail closed—online demand was huge. The challenge was coping with the demand of once-a-year customers whilst maintaining service to our long-serving customers. This talk looks at how SRE prepared for and ran the day and how we implemented lambda@edge logic and a queueing system to help us achieve this.

Matt Berridge is Head of Site Reliability UK at William Hill. Matt has spent 15 years in various roles supporting, troubleshooting, and scaling the systems at William Hill. He specializes in building and running teams that prepare for big events and doing proper Post-Mortems.

Josh Allenby is a Site Reliability Engineer at William Hill focusing on improving the performance and availability of William Hills Sportsbook, Retail, and Gaming products. Despite only working in technology for a few years, Josh has a real passion for identifying issues and resolving complex incidents whilst learning and improving with every Post-Mortem.

Microservices above the Cloud—Designing the International Space Station for Reliability

Wednesday, 15:10–15:40

Robert Barron, IBM

Available Media

The International Space Station has been orbiting the Earth for over 20 years. It was not launched fully formed, as a monolith in space. It is built out of dozens of individual modules, each with a dedicated role—life support, engineering, science, commercial applications, and more. Each module (or container) functions as a microservice, adding additional capabilities to the whole. While the modules independently deliver both functional and non-functional capabilities, they were designed, developed, and built by different countries on Earth at different times and once launched into space (deployed in multiple different ways) somehow manage to work together—perfectly.

Despite the many minor reliability issues which have occurred over the decades, the ISS remains a highly reliable platform for cutting-edge scientific and engineering research.

In this talk, I will describe the way the space station was developed and the lessons SREs can learn from it.

Robert works for IBM as an SRE, ChatOps, and AIOps Solution Engineer who enjoys helping others solve problems even more than he enjoys solving them himself. Robert has over 20 years of experience in IT development & operations and is happiest when learning something new. He blogs about operations, space and AIOps at https://flyingbarron.medium.com.

Connect:

@flyingbarron

Horizontal Data Freshness Monitoring in Complex Pipelines

Wednesday, 15:40–16:05

Alexey Skorikov, Google

Available Media

Growing complexity of data pipelines and organizations poses more challenges to Data Reliability. Risk of data incidents multiplies with each hop downstream. Teams see decrease in operational readiness to deal with specific classes of outages: data staleness, corruption and loss, while costs of incident resolution and revenue impact of outages can grow non-linearly. Without understanding the full data dependency graph, it is hard to measure completeness of monitoring, leading to gaps. For any sizable organization, figuring out data dependencies manually is a non-starter. We will talk about:

Understanding critical business data flows, upstream & downstream dependencies;
Holistic data monitoring at scale to eliminate unnoticed data staleness, which potentially; can lead to accumulated negative business impact;
One of the approaches Google took in this direction and lessons learned;
How you can practically instrument this approach in your organization.

Alexey Skorikov is a Technical Program Manager for Google Ads, specializing in Site Reliability Engineering (SRE), and working on programs for data reliability and data pipelines efficiency. Prior to Google, he helped tech companies in the US and EU to manage software programs, cross-functional engineering teams, and organizational changes.

Alexey has over 15 years of hands-on software engineering experience, having specialized in SaaS, network services, business processes automation, and data processing, across several industries such as GIS, AdTech, ISP, E-commerce, Oil & Gas.

As a speaker, Alexey delivered four distinct talks at professional conferences and five workshops for software engineers and managers.

16:05–16:30

Break

16:30–18:30

Track 1

How We Built Out Our SRE Department to Support over 100 Million Users for the World's 3rd Biggest Mobile Marketplace

Wednesday, 16:30–17:00

Sinéad O'Reilly, Aspiegel SE

Available Media

March 2020 was a strange month for everyone—our work and employee interactions changed fundamentally, and perhaps permanently, as the entire office-bound workforce shifted to working from home. Here in Aspiegel, it wasn't the only challenge that came our way. We combined an increased role in Huawei service management, with retiring our managed services SRE team. This meant that over the year we would need to hire aggressively to replace the team, and also to support our new growth. Working through this onboarding over the course of the year would cause some hiccups along the way, but ultimately it would force us to change into a leaner, and more professional SRE Department. Join us, as we talk about what we did, what we learned, and how we can help others get there too!

After studying Engineering in UCD, and a Masters in Telecoms, Sinead spent a few years traveling around the world for Ericsson, teaching local Engineers how to set up 3G UMTS mobile phone networks. This was followed by a move to Salesforce and the consulting sphere for a few years, and then a move into Technical Operations leadership. After a couple of years in Salesforce Security, managing a DevOps SysSec Access Control team, she moved to Aspiegel to help lead the Site Reliability Department. There she enjoys managing several DevOps SRE teams, while also looking after Operations, and Training & Development for the Department as a whole.

You've Lost That Process Feeling: Some Lessons from Resilience Engineering

Wednesday, 17:00–17:30

David D. Woods, Ohio State University and Adaptive Capacity Labs; Laura Nolan, Slack

Available Media

Software systems are brittle in various ways, and prone to failures. We can sometimes improve the robustness of our software systems, but true resilience always requires human involvement: people are the only agents that can detect, analyze, and fix novel problems.

But this is not easy in practice. Woods' Theorem states that as the complexity of a system increases, the accuracy of any single agent's own model of that system—their 'process feel'—decreases rapidly. This matters, because we work in teams, and a sustainable on-call rotation requires several people.

This talk brings a researcher and a practitioner together to discuss some Resilience Engineering concepts as they apply to SRE, with a particular focus on how teams can systematically approach sharing experiences about anomalies in their systems and create ongoing learning from 'weak signals' as well as major incidents.

David Woods (Ph.D., Purdue University) has worked to improve systems safety in high-risk complex settings for 40 years. These include studies of human coordination with automated and intelligent systems and accident investigations in aviation, nuclear power, critical care medicine, crisis response, military operations, and space operations. Beginning in 2000-2003 he developed Resilience Engineering on the dangers of brittle systems and the need to invest in sustaining sources of resilience as part of the response to several NASA accidents. His results on proactive safety and resilience are in the book Resilience Engineering (2006). He developed the first comprehensive theory on how systems can build the potential for resilient performance despite complexity. Recently, he started the "SNAFU Catchers Consortium," an industry-university partnership to build resilience in critical digital services.

Connect:

@ddwoods2

Laura Nolan is a Senior Staff Engineer and tech lead at Slack, working mainly on service networking and ingress load balancing, as well as occasionally writing outage reports for the Slack Engineering blog. Laura has contributed to a number of books on SRE, including Site Reliability Engineering: How Google Runs Production Systems, Seeking SRE, and 97 Things Every SRE Should Know. She also regularly writes for USENIX's ;login: magazine, and is a member of the USENIX board and SREcon Steering Committee.

Connect:

@lauralifts

Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19

Wednesday, 17:30–17:45

Samantha Schaevitz, Google

Available Media

Many teams will have practiced and refined their Incident Management skills and practices over time, but no one had a playbook ready to go to manage the dramatic Coronavirus-driven usage growth of Google Meet without a user-facing incident. The response resembled more a temporary reorganization of more than 100 people than it did your typical page—the fact that there was no user-facing outage (yet), notwithstanding.

This talk will cover what this incident response structure looked like, and what made it successful.

Samantha Schaevitz is a Senior Staff Software Engineer in Site Reliability Engineering (SRE) in Zürich, where she keeps Google Meet, Calendar, Tasks, and Voice running. Originally from California, she enjoys thinking about why complex systems fail, and why airplanes do not. Her maximum latitudinal position is 67.853°.

Ceci N'est Pas un CPU Load

Wednesday, 17:45–18:00

Thomas Depierre, Liveware Problems

Available Media

Why can we not combine and troubleshoot code like we do for analog electronics? Come discover why digital forces us to use limited mental models.

From EE to launching a company focused on dev tools, Thomas Depierre's journey has been wide. French, CEO, freelancer sometimes, in general, a lover of cats and tea. Particularly interested in system thinking, incidents, and Resilience Engineering, Thomas loves to listen to war stories. He is now trying to bring Cognitive System Engineering to dev tools.

Connect:

@Di4naO

What If the Promise of AIOps Was True?

Wednesday, 18:00–18:30

Niall Murphy, RelyAbility

Available Media

Many SREs treat the idea of AIOps as a joke, and the community has good reasons for this. But what if it were actually true? What if our jobs were in danger? What if AI—which can play chess, Go, and Breakout more fluidly than any human being—was poised to conquer the world of production engineering as well? What could we, or should we, do about it?

Join me in this talk as we examine the current state of affairs, and the future, in the light of the promises made by AIOps companies. Or, in short, to ask the question, what if AIOps were true?

Niall Murphy has worked in Internet infrastructure since the mid-1990s, specializing in large online services. He has worked with all of the major cloud providers from their Dublin, Ireland offices, and most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). He is the instigator, co-author, and editor of the two Google SRE books, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.

Connect:

@niallm

Track 2 (OpML)

Model Monitoring: Detecting and Analyzing Data Issues

Wednesday, 16:30–16:45

Dmitri Melikyan, Graphsignal, Inc.

Available Media

Machine learning models are deployed to production environments increasingly more often. Unlike traditional coded applications, serving ML models are fully data-driven. Unexpected input data, which the model has never seen in training and is not prepared to handle, may produce garbage model output that will be treated as valid by the rest of the application without producing any error or exception. For this reason production models introduce a new monitoring requirement, a model monitoring, that is not addressed by the existing tools. This talk focuses on monitoring and detection of production data issues, such as data anomalies and drift.

Dmitri Melikyan is a founder and CEO at Graphsignal, an ML model monitoring company. He spent the last decade developing application monitoring and profiling tools used by thousands of developers and SREs.

Connect:

@GraphsignalAI

Leveraging ML to Detect Application HotSpots [@scale, of Course!]

Wednesday, 16:45–17:00

Sanket Patel, LinkedIn

Available Media

This talk will explore various analyses done on service latency metrics and their correlation, while LinkedIn's data-center is under a stress test.

Note: you do not need to be a machine learning expert to make sense of this talk. We will not be diving deeper into the mathematics part of it but would rather focus on the approach.

Sanket is Site Reliability Engineer at LinkedIn where he is working in infrastructure space with the capacity engineering team. He is also into cycling and blogging [superuser.blog].

Demystifying Machine Learning in Production: Reasoning about a Large-Scale ML Platform

Wednesday, 17:00–17:30

Mary McGlohon, Google

Available Media

Machine Learning is often treated as mysterious or unknowable. This can lead to SREs choosing to work around ML-related reliability problems in their systems rather than through them. This avoidance is not only risky but also unnecessary: Any given SRE operates with systems that they themselves may not know in great depth. To manage risk, they use a series of generalized techniques to understand the properties of the system and its failure modes.

In this talk, we apply this outside-in approach towards ML reliability, drawing from experiences with a large-scale ML production platform. We describe common failure modes (spoiler alert: they tend to be the same things that happen in other large systems), and based on these failure modes, recommend best practices for productionization: Monitor systems and protect them from human error. Understand data integrity needs, and meet them. Prioritize pipeline workloads for efficiency and backlog recovery.

Mary McGlohon is a Site Reliability Engineer at Google, who has worked on large-scale ML systems for the past 4 years. Prior to that, her career included data mining research, software development, and distributed pipeline systems. She completed a B.S. in computer science from the University of Tulsa and a Ph.D. in machine learning from Carnegie Mellon University. She is interested in how production techniques can make ML better for human operators and users.

Designing an Autonomous Workbench for Data Science on AWS

Wednesday, 17:30–17:45

Dipen Chawla, Episource LLC

Available Media

This talk is a synopsis of how we built a self-serving workbench for the team of data scientists at Episource and designed it to be scalable, secure, and cost-efficient. The talk will also include the challenges we faced while navigating the architecture, the lessons learned, and the impact the workbench has had on Episource's ML dev workflows. If you are an organization looking to improve autonomy and promote rapid experimentation within your data science ranks, this talk will help you in your journey.

Dipen is a member of the MLOps and Engineering team at Episource, where he works on the deployment of scalable and secure architectures to the cloud. His primary areas of interest include container tech and ML in production.

Connect:

@dipen_chawla

Panel: OpML

Wednesday, 17:45–18:30

Moderator: Vanessa Yiu, Goldman Sachs

Panelists: Todd Underwood, Google; Josh Hartman, LinkedIn; Zhangwei Xu, Azure; Nisha Talagala, Pyxeda AI

Available Media

Vanessa Yiu leads Enterprise Architecture and Engineering Risk Management at Goldman Sachs, and is Co-Chair of SREcon21. She has worked in, and managed teams across many engineering disciplines including SRE, platform security, infrastructure, data, and electronic trading platform. Vanessa is a contributor to O’Reilly’s "97 Things Every SRE Should Know", and has been active with SREcon as speaker, program committee member and co-chair since 2018.

Todd Underwood is a Director at Google. He leads Machine Learning for Site Reliability Engineering (SRE) for Google. ML SRE teams build and scale internal and external ML services and are critical to almost every Product Area at Google. He is also the Engineering Site Lead for Google's Pittsburgh office.

Connect:

@tmu

Josh Hartman currently leads the ML infrastructure team at LinkedIn and is responsible for the Pro-ML platform, which is an opinionated framework for industrial AI. The platform encompasses model creation and training pipelines for a variety of model classes, a feature store, a model cloud for hosted inference, and a health assurance suite to ensure the system is functioning correctly on an ongoing basis. Before leading ML infrastructure, Josh was the first architect for the systems that powered AI for LinkedIn's feed and notifications systems. Josh then led the Careers AI organization, where his team introduced the first DL ranking model at LinkedIn to recruiter-search. In his free time, Josh enjoys playing with his three young kids and lifting weights.

Connect:

@hartmanster

Zhangwei Xu is a Director of Engineering in the Azure Edge+Platform team responsible for production infrastructure and engineering services for Azure and Microsoft including Engineering Pipeline, Incident Management Platform, AIOps, DevOps experiences and etc. Before joining Azure, he worked in Windows, Xbox, and Bing as a software engineer and engineering manager.

Nisha Talagala is the CEO and founder of AIClub.World which is bringing AI Literacy to K-12 students and individuals worldwide. Nisha has significant experience in introducing technologies like Artificial Intelligence to new learners from students to professionals. Previously, Nisha co-founded ParallelM which pioneered the MLOps practice of managing Machine Learning in production for enterprises - acquired by DataRobot. Nisha is a recognized leader in the operational machine learning space, having also driven the USENIX Operational ML Conference, the first industry/academic conference on production AI/ML. Nisha was previously a Fellow at SanDisk and Fellow/Lead Architect at Fusion-io, where she worked on innovation in non-volatile memory technologies and applications. Nisha has more than 20 years of expertise in enterprise software development, distributed systems, technical strategy and product leadership. She has worked as technology lead for server flash at Intel - where she led server platform non-volatile memory technology development, storage-memory convergence, and partnerships. Prior to Intel, Nisha was the CTO of Gear6, where she designed and built clustered computing caches for high performance I/O environments. Nisha earned her PhD at UC Berkeley where she did research on clusters and distributed systems. Nisha holds 73 patents in distributed systems and software, over 25 refereed research publications, is a frequent speaker at industry and academic events, and is a contributing writer to Forbes and other publications.

Thursday, October 14 (Day 2 - APAC/Americas West)

01:00–01:45

Plenary Session

SRE for ML: The First 10 Years and the Next 10

Thursday, 01:00–01:45

Todd Underwood, Google

Available Media

Over 10 years ago we started building SRE for a large multi-model ML service at Google. We faced many interesting challenges including:

Defining scope: Why do these services need ML anyway?
Unclear SLOs: What are we measuring and how can we actually be responsible for those things?
Fuzzy demarcation with our modeling teams: What is a model quality problem caused by infrastructure vs a model quality problem caused by the model or the data?

With the explosion of ML training and serving platforms, the choices we faced are now confronting many SRE teams across the industry. I will review the history focusing on the decisions we made and why those made sense to us at the time and might make sense for others. And I'll try to answer the question of whether there is a real need for SRE for ML at all.

Todd Underwood is a Director at Google. He leads Machine Learning for Site Reliability Engineering (SRE) for Google. ML SRE teams build and scale internal and external ML services and are critical to almost every Product Area at Google. He is also the Engineering Site Lead for Google's Pittsburgh office.

Connect:

@tmu

01:45–03:00

Track 1

When Systems Flatline—Enhancing Incident Response with Learnings from the Medical Field

Thursday, 01:45–02:00

Sarah Butt, Salesforce

Available Media

In many ways, incident management is the "emergency room" for technical systems. As technology has evolved, it has progressed from auxiliary systems, to essential business systems of record, to critical systems of engagement across multiple industries. As these systems become increasingly critical, SRE's role in incident management and resolution has become vital for any essential technical system. This talk focuses on how various strategies used in the medical field can be applied to incident response. From looking at algorithm guided decisions (and learning a bit about what "code blue" really means) to discussing approaches to triage and stabilization based on the ATLS protocol, to considering the role of response standardization such as surgical checklists in reducing cognitive overhead (especially when PagerDuty goes off at 2 a.m.!), this talk aims to take key learnings from the medical field and apply it in practical ways to incident management and response. This talk is largely conceptual in nature, with takeaways for attendees from a wide variety of backgrounds and technical experience levels.

Sarah is a former audio engineer turned technology professional who has spent the past 6 years of her career at Salesforce and Dell devoted to customer-perceived reliability. She is a 2021 MBA graduate from The University of Texas (Hook'em!) where she did graduate work studying the intersection of technology, business, and people in the context of SRE. A few of her favorite topics include user-centric monitoring, intelligent alerting, and using innovative technology to drive high availability of complex distributed systems. Sarah is currently part of Salesforce's SRE organization, where you'll likely find her talking about topics such as resilience, observability, and incident management and response. In her free time, you'll often find her hiking in the Texas Hill Country with Rosie, her yellow lab.

Evolution of Incident Management at Slack

Thursday, 02:00–02:30

D. Brent Chapman, Slack

Available Media

At Slack, we deliver over 150 million messages per minute at peak. Some fraction of those messages is us, managing incidents affecting the same platform that so many have come to rely on to manage their own incidents. How do we handle dozens of incidents a week, big and small, most of which our users are never aware of? Learn how we've made incident management a core capability of everyone on our engineering team: where we are, how we got here, and where we're going.

Brent Chapman is a Staff Engineer at Slack, with a focus on incident management. This involves building Slack's incident management capabilities; keeping incident management running smoothly day-to-day; helping the company learn from past incidents, and prevent and prepare for future incidents, and sharing Slack's incident management story with customers and the industry at large. Brent leads the training program for incident management processes throughout Slack Engineering, is the lead developer and instructor of Slack's classes for incident commanders, and is the co-developer and lead instructor of our classes for incident responders (which includes all engineers at Slack). He frequently coaches incident commanders throughout the company, at all levels, especially when they are new and growing into the role.

Connect:

@brent_chapman

Improving Observability in Your Observability: Simple Tips for SREs

Thursday, 02:30–03:00

Dan Shoop

Available Media

Would it surprise you to know that time-series data dates back hundreds of years to the early days of science, statistics, and data collection? It shouldn't when you think about it and there's plenty of good lessons learned we can re-apply to our dashboards and other engineering presentations today.

"Improving Observability In Your Observability" will present practical takeaway lessons that engineers can immediately utilize across various practices and at all levels to improve the visual understanding and credibility of their informational presentations of metrics and observability data.

As SREs interested in better presenting our telemetry we'll explore what practical lessons we can learn from the works of Galileo, Charles Joesph Minard, and Edward Tufte as well as simple pitfalls to avoid and what we can do to improve the transfer and content density of our dashboards and other engineering graphics.

Dan Shoop is a systems engineer focusing on observability, incident management, and building resilient and evolving systems architectures. He's lead SRE and Production Engineering teams for Venmo, Sesame Street, United States Technical Services, and HBO at Game of Thrones scale, where the team won a technical Emmy award for contributions to HBOGO. He's had a passion for improving the visualization and presentation of information after exploring the work of Edward Tufte and enjoys sharing lessons learned with his fellow engineers. Dan is also an avid photographer and outdoorsman and enjoys rebuilding retro-computers including PDP-11s and original Macintosh systems, as well as classical computing platforms and technologies.

Connect:

@colonelmode

Track 2 (OpML)

Hacking ML into Your Organization

Thursday, 01:45–02:00

Cathy Chen, Capriole Consulting Inc, Google LLC

Available Media

As a future chapter of the upcoming O'Reilly book, Reliable Machine Learning: Applying SRE Principles to ML in Production, this talk is aimed at SRE, developers, data scientists, and business people alike.

Consider SRE and other parts of the business as a business implements or scales ML in their business. What does the business have to learn from ML and what new organizational changes should they think about as they expand support ML as business optimizing work?

Attendees will learn:

How to apply the organization design Star model to ML deployment
Where to build ML skills outside the tech team
Organizational changes needed to integrate ML into the organization

Cathy Chen, CPCC, MA, specializes in coaching tech leaders enabling the development of their own skills in leading teams. She has held the role of technical program manager, product manager, and engineering manager. She has led teams in large tech companies as well as startups launching product features, internal tools, and operating large systems. Cathy has a BS in Electrical Engineering from UC Berkeley and an MA in Organizational Psychology from Teacher's College at Columbia University. Cathy also works at Google in Machine Learning SRE.

Automating Performance Tuning with Machine Learning

Thursday, 02:00–02:30

Stefano Doni, Akamas

Available Media

SRE's main goal is to achieve optimal application performance, stability, and availability. A crucial role is played by configurations (e.g. container resources limits and replicas, runtime settings, etc): wrong settings are among the top causes of poor performance, efficiency, and incidents. But tuning configurations is a very complex and manual task, as there are hundreds of settings in the stack. We present a novel approach that leverages machine learning to find optimal configurations of the tech stack in an automated fashion. This approach leverages reinforcement learning techniques to find the best configurations based on an optimization goal that SREs can define (e.g. minimize service latency or cloud costs). We show an example of optimizing Kubernetes microservice cost and latency tuning container resource and JVM options. We analyze the optimal configurations that were found, the most impactful parameters, and the lesson learned for tuning microservices.

Stefano is obsessed with performance optimization and leads the Akamas vision for Autonomous Performance Optimization powered by AI. With more than 15 years of experience in the performance industry, he has worked on projects for major national and international enterprises. He has presented several talks at the Computer Measurement Group international conference and in 2015, he won the Best Paper award for his contributions to capacity planning and performance optimization of Java applications.

Connect:

@stef3a

Nothing to Recommend It: An Interactive ML Outage Fable

Thursday, 02:30–03:00

Todd Underwood, Google

Available Media

This is the story of an ML outage, based on a real outage, but anonymized and recast to protect the innocent (and the guilty). An ML model is misbehaving, causing serious damage to the company. As with many outages, the underlying cause is unclear. In fact, even the timeline of the outage is unclear. This talk walks the audience through how the outage was detected, how troubleshooting worked, how it was mitigated and resolved, and what follow-up work was scheduled. The talk will aim to be (honor system for asynchrony) interactive!

Todd Underwood is a Director at Google. He leads Machine Learning for Site Reliability Engineering (SRE) for Google. ML SRE teams build and scale internal and external ML services and are critical to almost every Product Area at Google. He is also the Engineering Site Lead for Google's Pittsburgh office.

Connect:

@tmu

03:00–03:30

Break

03:30–05:30

Track 1

Practical TLS Advice for Large Infrastructure

Thursday, 03:30–04:00

Mark Hahn, Ciber Global; Ted Hahn, TCB Technologies

Available Media

We will present practical advice for leveraging TLS to secure communications across your infrastructure. This applies to nodes and pods on Kubernetes or on other large deployment infrastructure. The current tools set for large leaves various gaps for deploying TLS and also causes friction within your infrastructure.

Protocols like ACME and tools like service meshes provide some support for distributing certificates but do not help with the larger problems of certificate authority architecture, nor provide advice for how to build certificates that strengthen your security posture.

PKI can be used to reduce security risks and simplify reporting. Public key infrastructure can be used to identify services to one another with a very different set of tradeoffs than shared-secret infrastructure.

Mark Hahn is Practice Director for Cloud Strategies and DevOps for Ciber Global. He has 25+ years of experience as a Principal Architect delivering large-scale systems, including Wall Street trading systems, multinational retail payments systems and supply chain systems. Mark practices and coaches continuous delivery techniques that improve delivery timelines and increase system reliability, including Lean software development and continuous improvement.

Ted Hahn is an experienced Site Reliability Engineer, having worked at Google, Facebook, and Uber, and most recently having been the primary SRE for Houseparty—maintaining an infrastructure used for thousands of QPS by millions of users in a company of less than 50.

User Uptime in Practice

Thursday, 04:00–04:30

Anika Mukherji, Pinterest

Available Media

As SREs, ultimately user experience is our most important metric. At Pinterest, like many other companies, we were using success rate as a proxy for the quality of our service to our users. However, success rate is fraught with many issues when it comes to representing product quality, which made it difficult for us to understand, measure, and react to changes in "Pinner" experience.

We landed upon User Uptime as our solution. This is a "time-based" metric that presents many advantages over a "count-based" metric like success rate. During this talk, we will discuss how Pinterest went about implementing such a metric—in terms of technological stack and design decisions—and what we learned in the process, about both our product and our users.

Anika is a senior SRE at Pinterest's HQ in San Francisco. She is embedded in several teams, including the API platform team, the web platform team, the traffic team, and the continuous delivery team. She focuses on making the core "Pinner" experience reliable and measurable, with a special emphasis on safe production changes. She also has experience in the performance realm and has worked on improving the speed of the Pinterest product.

Learning More from Complex Systems

Thursday, 04:30–05:00

Andrew Hatch, LinkedIn

Available Media

Complex systems are everywhere. From simple organisms like slime molds, the neural processing power of human brains, navigating rush-hour traffic, or the vast resources of people and technology, interacting with each other in modern organizations. They can be both organic and man-made, they evolve and adapt to their environments and possess a number of common traits, notably emergent behaviors, innate levels of resiliency, a history responsible for their current state, and most importantly are becoming more complex.

But how we do learn about complex systems? Is the pursuit of knowledge through linear analysis enough? Can we cope with complexity by root-cause analyzing our way out of failure? Or is there other approaches to learning we need to do?

This talk will distill a bit of history, science, and philosophy into takeaways that will encourage the audience to think differently about how we can learn from incidents in complex systems

Andrew moved to the Bay Area last year to become an SRE Manager at LinkedIn. Prior to this, Andrew spent over 20 years working in Australia (plus some time in India), working on small through to large scale software systems, in multiple roles and industries. Before migrating to the US, Andrew spent 6 years moving Australia's biggest online jobs and recruitment platform into AWS. Since 2013, Andrew has predominantly worked as an IC and Manager of SRE teams and through this experience developed a passion for learning and adapting to complex systems and helping teams and organizations learn more from incidents to create better software, more resilient systems, and happier empowered teams. Andrew is also a life-long surfer and can now be found adapting to the crowds at Santa Cruz when not at work or at home with his family.

Connect:

@hatchman76

Of Mice & Elephants

Thursday, 05:00–05:30

Koon Seng Lim and Sandeep Hooda, DBS

Available Media

Just as in the well-known myth that the mighty elephant cowers before the tiny mouse; the presence of small files in the Hadoop Distributed File System can literally bring the elephant of big data to its knees! In this talk, we chronicle our journey of managing small files after an unfortunate incident rendered our multi-petabyte cluster inactive for close to 5 days. The aftermath of the incident spawned a flurry of collaborative work between our infrastructure SRE, enterprise SRE, platform SRE and application teams. We discuss the various lessons learned and experience gained from experimenting with and implementing various mitigating measures in the domain of people, process, and technology to combat the scourge of small files; a perennial problem in Hadoop. We show that with the proper mitigating controls and technical capabilities afforded by newer distributed file systems, mice and elephants can coexist happily, just as in real life.

Koon Seng received a Masters in Electrical Engineering from Columbia University in 1996 and a Bachelors in Computer Science from NUS in 1991. By day, he is an Executive Director and heads the SRE team of DBS Middle Office. By night, he prowls through code with his trusty pet snake Python 3, hunting for the occasional bug. In his past life, he spent 17 years founding and working for various startups in the US before returning to Singapore in 2010.

Sandeep leads the SRE team focussed on learning from the incident, containerization, and responsible for various SDLC tools specializing in DevOps. Prior to this, he was managing cloud infrastructure teams where his responsibilities were usually automation, infrastructure architecture, and working closely with solution architects. He is a passionate user of open source with a strong focus on creating quintessential solutions. He conducts workshops to educate and spread awareness on blameless culture "from tech incidents to biz decisions." He is a Sci-Fi lover, with a keen interest in astronomy, dreams of space exploration, and sailing around the world.

Thursday, October 14 (Day 3 - EMEA/Americas East)

14:00–16:00

Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform

Thursday, 14:00–14:30

Pavlos Ratis, Red Hat

Available Media

In Kubernetes, the Operator pattern helps capture good production practices and the expert knowledge of practitioners for managing a service or a set of services.

An Operator acts as an SRE for an application. Instead of manually following playbooks or scripts to deploy databases or mitigate issues in Kubernetes, SRE teams can use off-the-shelf Operators or develop their own to automate these processes and reduce toil work.

In this session, we will explore the Operator pattern with some examples of how we have used them at Red Hat to build OpenShift. We will discuss some lessons learned, common pitfalls running Operators, and when it makes sense to write one.

Pavlos Ratis is a Senior Site Reliability Engineer at Red Hat, where he works on the OpenShift team. He is the creator and curator of awesome-sre and awesome-chaos-engineering Github repositories.

Connect:

@dastergon

Nine Questions to Build Great Infrastructure Automation Pipelines

Thursday, 14:30–15:00

Rob Hirschfeld, RackN

Available Media

Sure we love Infrastructure as Code, but it's not the end of the story. This talk steps back to see how different automation types and components can be connected together to create Infrastructure Pipelines. We'll review the nine essential questions that turn great automation into modular, portal, and continuous infrastructure delivery pipelines.

Do you keep wondering why building automation is so hard and even harder to share as a community? That really bugs Rob too! He has been creating software to collaboratively automate infrastructure for over 20 years. His latest startup, RackN, focuses on providing Distributed IaC automation and abstraction layers for provisioning Cloud, Edge, and Enterprise data centers. He is also building a forward-looking operator community at the2030.cloud with weekly DevOps and future hallway-type discussions.

Connect:

@zehicle

Hard Problems We Handle in Incidents but Aren't Recognized

Thursday, 15:00–15:30

John Allspaw, Adaptive Capacity Labs

Available Media

If we know how and where to look closely, we can find a number of dynamics, dilemmas, and sacrifices that people make in handling incidents. In this talk, we'll highlight a few of these often-missed aspects of incidents, what makes them sometimes difficult to notice, and give some descriptive vocabulary that we can use when we do notice them in the future.

John Allspaw has worked in software systems engineering and operations for over twenty years in many different environments. John's publications include the books The Art of Capacity Planning (2009) and Web Operations (2010) as well as the forward to "The DevOps Handbook." His 2009 Velocity talk with Paul Hammond, "10+ Deploys Per Day: Dev and Ops Cooperation" helped start the DevOps movement. John served as CTO at Etsy, and holds an MSc in Human Factors and Systems Safety from Lund University.

Connect:

@allspaw

Experiments for SRE

Thursday, 15:30–16:00

Debbie Ma, Google LLC

Available Media

Incident management for complex services can be overwhelming. SREs can use experiments to attribute and mitigate production changes that contribute to an outage. With experiments to guard production changes, SREs can also reduce a (potential) outage's impact by preventing further experiment ramp up if the production change is associated with unhealthy metrics. Beyond incident management, SREs can use experiments to ensure that reliable changes are introduced to production.

Debbie is a Site Reliability Engineer/Software Engineer (SRE/SWE) at Google focusing on improving the reliability of experiment infrastructure. Debbie initially worked on experiments for mobile devices but expanded her work area to include server experiments as well. Her current work interest is ensuring SREs and SWEs can easily develop and introduce new features into production safely.

16:00–16:30

Break

16:30–18:15

Reliable Data Processing with Minimal Toil

Thursday, 16:30–17:00

Pieter Coucke, Google; Julia Lee, Slack

Available Media

Learn about the risks involved with data processing pipelines and how Google and Slack mitigate them.

This talk shares insights into making batch jobs safer and less manual. We have researched and implemented ways to do canarying, automated global rollouts on increasingly larger target populations and different kinds of validations. All these are necessary to remove the manual work involved in updating a batch job globally across millions of users.

Pieter Coucke is a Technical Program Manager at Google SRE Zürich, working within Google Workspace with teams like Gmail, Drive, Calendar, Meet, and Docs.

Julia Lee is a Senior Software Engineer in Infrastructure at Slack, where she leads the development of asynchronous compute services.

SRE "Power Words"—the Lexicon of SRE as an Industry

Thursday, 17:00–17:15

Dave O'Connor, Elastic

Available Media

As the SRE Industry develops, we've come to rely on certain words, phrases, and mnemonics as part of our conversations with ourselves and our stakeholders. Words and naming have power, and the collective definition and use of words like 'toil' as a shorthand can help with any SRE practice. This talk will set out the premise and some examples and includes a call to action around thinking how naming and words can strengthen SRE's position as the function continues to develop.

Dave O'Connor is an SRE Leadership practitioner based in Dublin, Ireland. He's currently Sr. Director of Engineering at elastic.co, building and scaling Elastic Cloud.

He was a Site Reliability Engineer/Manager/Director at Google from 2004 to 2021. As well as technical and management/leadership work on products such as Gmail and Google Analytics, Dave was the global lead for Storage and Databases SRE at Google, as well as the head of Engineering at Google Ireland for a time.

Connect:

@gerrowadat

How Our SREs Safeguard Nanosecond Performance—at Scale—in an Environment Built to Fail

Thursday, 17:15–17:30

Jillian Hawker, Optiver

Available Media

The core principles of SRE—automation, error budgets, risk tolerance—are well described, but how can we apply these to a tightly regulated high-frequency trading environment in an increasingly competitive market? How do you maintain sufficient control of your environment while not blocking the innovation cycle? How do you balance efficiency with an environment where misconfigured components can result in huge losses, monetary or otherwise?

Find out about our production environment at Optiver, how we deal with these challenges, and how we have applied (some of) the SRE principles to different areas of our systems.

Jillian started to learn Python during her Masters in Biological Diversity and has never looked back. This led to a 12-year career in financial technology, specializing in electronic trading support. She joined Optiver's SRE team last year after meeting them at SRECon 2019. She lives with her husband and two daughters just outside Amsterdam.

Panel: Unsolved Problems in SRE

Thursday, 17:30–18:15

Moderator: Kurt Andersen, Blameless

Panelists: Niall Murphy, RelyAbility; Narayan Desai, Google; Laura Nolan, Slack; Xiao Li, JP Morgan Chase; Sandhya Ramu, LinkedIn

Available Media

Every field of endeavor has its leading edge where the answers are unclear and active exploration is warranted. Although the phrase "here be dragons" might be an appropriate warning, this panel of intrepid adventurers will venture into that unknown territory.

Kurt Andersen is the head of strategy for Blameless.com. Prior to that, he was one of the leads for the Product-SRE organization at LinkedIn. Across the full spectrum of IT influence, he is strongly committed to developing the best engineers and teams, and enabling them with the right ideas, tools, and connections at the right time. Kurt has been active in the anti-abuse and IETF standards communities for over 20 years. He has spoken at multiple conferences on various aspects of reliability, authentication, and security and written for O'Reilly. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.

Connect:

@drkurta

Niall Murphy has worked in Internet infrastructure since the mid-1990s, specializing in large online services. He has worked with all of the major cloud providers from their Dublin, Ireland offices, and most recently at Microsoft, where he was global head of Azure Site Reliability Engineering (SRE). He is the instigator, co-author, and editor of the two Google SRE books, and he is probably one of the few people in the world to hold degrees in Computer Science, Mathematics, and Poetry Studies. He lives in Dublin with his wife and two children.

Connect:

@niallm

Narayan is an SRE at Google Cloud, where he is responsible for the reliability of GCP Data Analytics products.

Connect:

@nldesai

Laura Nolan is a Senior Staff Engineer and tech lead at Slack, working mainly on service networking and ingress load balancing, as well as occasionally writing outage reports for the Slack Engineering blog. Laura has contributed to a number of books on SRE, including Site Reliability Engineering: How Google Runs Production Systems, Seeking SRE, and 97 Things Every SRE Should Know. She also regularly writes for USENIX's ;login: magazine, and is a member of the USENIX board and SREcon Steering Committee.

Connect:

@lauralifts

Sandhya Ramu is Sr. Director of Engineering at LinkedIn who leads site reliability engineering team focused on big data & AI/ML platforms. She is a seasoned technology leader with close to 2 decades of web industry experience with building and leading cross functional teams. She is also passionate about the role of culture and of diversity and inclusion both at work and outside and actively participates in furthering this cause.

18:15–18:30

Closing Remarks

Program Co-Chairs: Frances Rees, Google, and Vanessa Yiu, Goldman Sachs

Friday, October 15 (Day 3 - APAC/Americas West)

01:00–01:45

Plenary Session

Beyond Goldilocks Reliability

Friday, 01:00–01:45

Narayan Desai, Google

Available Media

Reliability engineering is still in its infancy, with best practices stemming largely from community experiences and hard knocks. SRE practices, including alerting and SLOs, are built around subjective thresholds—institutionalizing an esoteric model of reliability. Increasingly, the cracks in this approach are showing this Goldilocks approach to be insufficient.

Reliability is currently an amorphous concept. If we hope to tackle it robustly, we must first frame reliability concisely. A concrete model provides a foundation for answering complex questions about our services in a principled way. So what is reliability, anyway?

Once we have an understanding of what reliability is, we can scrutinize our current best practices and mitigation strategies. Why do they work, and why are they so effective? Why is aggregation pervasive? Why do backend drains work so well? Identifying underlying mechanisms enables us to reinforce the reliability properties we want, and identify new mitigation strategies when needed.

Narayan is an SRE at Google Cloud, where he is responsible for the reliability of GCP Data Analytics products.

Connect:

@nldesai

01:45–2:45

A Retrospective: Five Years Later, Was Chaos Engineering Worth It?

Friday, 01:45–02:15

Mikolaj Pawlikowski and Sachin Kamboj, Bloomberg

Available Media

We were some of the earliest adopters of Chaos Engineering (especially in the financial industry) as a tool for SRE teams to increase their systems' reliability. We were also lucky enough to contribute to the ecosystem and watch it grow.

This talk will outline what we learned, what worked, and what didn't during the past five years we were practicing Chaos Engineering.

Miko Pawlikowski is an Engineering Team Leader at Bloomberg, author of "Chaos Engineering: Site Reliability Through Controlled Disruption" and speaker. He maintains open source projects like PowerfulSeal, Goldpinger, and Syscall Monkey, which let you implement Chaos Engineering, monitor Kubernetes cluster connectivity, and intercept and modify syscalls, respectively.

Connect:

@mikopawlikowski

Sachin Kamboj is a senior software engineer at Bloomberg, where he is part of the team that's designing and building Bloomberg's on-prem next-generation Platform as a Service platform based upon Kubernetes. He has been using Kubernetes in production since 2016 and has presented at KubeCon and has been nominated for best paper awards twice. Before joining Bloomberg, Sachin was an academic working on distributed systems and multi-agent systems and was the principal software architect behind the University of Delaware's vehicle-to-grid project that lead to a successful startup. Sachin loves breaking things to try to understand how they really work and tries to question why things are built and work the way they do. He is a strong proponent of chaos engineering and has used it successfully to make systems that are more robust and resilient to failures. When not breaking things, he spends his time playing with his two kids and enjoys hiking and ultimate frisbee.

The Origins of USAA's Postmortem of the Week

Friday, 02:15–02:45

Adam Newman, USAA

Available Media

WARNING: Bad Joke Alert! Let USAA tell you the story of their journey to build an IT-wide Postmortem Review meeting. Tips for how you can stand up your own large-scale Postmortem Review meeting. USAA is not responsible for any emotional distress caused by bad jokes.

Adam has been a Site Reliability Engineer at USAA since their SRE organization was founded in 2018. Prior to that, Adam worked for USAA's Bank IT department, helping lead the implementation of Bank's Active/Active infrastructure across multiple platforms. Adam also was a major player in the Remote Deposit Capture world and holds several patents in the space.

Connect:

@Anewman05

02:45–03:15

Break

03:15–05:15

Cache for Cash—Speeding Up Production with Kafka and MySQL binlog

Friday, 03:15–03:45

Barak Luzon, Taboola

Available Media

How can you serve up-to-date information on a high scale production system? Our company provides content recommendations to billions of people. We have hundreds of frontend servers across 7 regions. They need to withstand a massive load of 500K HTTP requests per second while maintaining a fast response time. In order to do so, we rely heavily on in-memory caching. By design, caching has a tradeoff between data freshness and load. What if I tell you that you can have fresh data without creating the extra load of fetching it? Our journey started with a "fast track" we created using Kafka and Mysql binlog, and ended up with a huge performance and yield improvement across thousands of services with blazing fast information updates.

"Trying is the first step towards failure", said the great Homer Simpson, and I would add that "Failing is the first step towards success." I've been around software since 2006, in various companies and positions, from C4 systems for intercepting rockets through E-commerce and Ad-Tech. I'm always keen to learn new technologies and test them to see how far to the edge I can take them. I practice this passion by day at Taboola with our team of rockstars, while by night I spend time on my second passion—brewing my own beer.

Taking Control of Metrics Growth and Cardinality: Tips for Maximizing Your Observability Function

Friday, 03:45–04:15

Rob Skillington, Chronosphere

Available Media

As companies transition to cloud-native architectures, the volume of metrics data being produced is growing exponentially and SRE teams are being forced to adapt to these increased demands, including finding ways to limit or control the cardinality of metrics. As this growth continues, it's critical that cloud-native companies (and their SRE teams) find ways to manage this growth sustainably and reliably.

During this session, Rob will discuss some best practices and tips for efficiently taming metrics data growth and cardinality at scale. He will also share some proven at scale KPIs and metrics to keep in mind when running, maintaining, and growing a world-class observability function. Focusing on real-life examples from leaders and engineers across the observability space, the audience will leave with a better understanding of how to implement these learnings with their existing SRE resources, including some ways for tracking and measuring these efforts.

Rob Skillington is the co-founder and CTO of Chronosphere. He was previously at Uber, where he was the technical lead of the Observability team and creator of M3DB, the time series database at the core of M3.

He has worked in both large engineering organizations such as Microsoft and Groupon and a handful of startups. He and his family are based in NYC where he mainly spends weekends exploring all of New York's playgrounds and also following his wife's jazz adventures.

Connect:

@roskilli

Games We Play to Improve Incident Response Effectiveness

Friday, 04:15–04:45

Austin King, OpsDrill

Available Media

Effective incident management is critical, but how can we practice and improve?

We know team cohesion and culture are important, but how do we grow them?

A fun answer is...Games! From Party Games to Gameday Scenarios, we will survey the skills we use every day. We will identify different games you can play which allows team practice of those skills.

Austin is a recovering Amazon Sr Engineer. Over the past 20+ years, he has contributed to the cloud side of Mozilla, various e-commerce sites, and a streaming music service. Austin is currently helping teams improve their incident response at OpsDrill.com.

Connect:

@OpsDrill

Food for Thought: What Restaurants Can Teach Us about Reliability

Friday, 04:45–05:15

Alex Hidalgo, Nobl9

Available Media

Nothing is ever perfect; all systems will fail at some point. This is true of everything we might define as a complex system. We could be discussing computer services, living organisms, buildings, or societal structures—at some point failure will occur in these systems. It turns out that failure is actually totally fine, and humans know this innately even if they're not always aware of their own fault tolerance. Like so many other things, restaurants are complex systems made up of many independent complex systems that all rely on each other, just like our computer services. In this talk let's use the experiences we've all had dining at, ordering from, or working at restaurants to draw parallels to how we can better think about the reliability of computers. From The Floor to The Bar to The Line: restaurants have many lessons to teach us.

Alex Hidalgo is the Director of Site Reliability Engineering at Nobl9 and author of Implementing Service Level Objectives. During his career, he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching college basketball. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.

Connect:

@ahidalgosre

05:15–05:30

Closing Remarks

Program Co-Chairs: Heidi Waterhouse, LaunchDarkly, and Daria Barteneva, Microsoft