8:45 am–9:00 am
Opening Remarks
Program Co-Chairs: Sarah Butt, Salesforce; Mohit Suley, Microsoft
09:00 am–10:30 am
Opening Plenary Session
The Endgame of SRE
Amy Tobey, Equinix
The containers are deployed and the builds are green. Yaml flows through the system, linted, reviewed, tested, and shipped with ease and regularity. Our intrepid SRE finds themself at a crossroads. The infrastructure is great but teams still struggle to maintain error budgets. These developers need help, and it doesn't seem like anyone else is coming to help them. Priorities seem wrong, people are burning out, there's no obvious fix. Join Amy on an epic quest into sociotechnical engineering, exploring ways SREs can impact reliability at scale beyond the bits and bytes that got us this far.
Amy Tobey, Equinix
Amy Tobey has worked in tech for more than 20 years at companies of every size, working with everything from kernel code to user interfaces. These days she is senior principal engineer leading Applied Resilience Engineering at Equinix. When she's not working, she can be found with her nose in a book, watching anime with her son, making noise with electronics, or doing yoga at home.
SRE's Critical Role in the COVID-19 Pandemic Response in Government
Amy Quispe, U.S. Digital Service, Alum; Marc Alvidrez, U.S. Digital Service; Rick Hawes, U.S. Digital Service, CDC
The COVID-19 pandemic put unprecedented strains on government software services. Now more than ever, software needed to work. How could the government—an institution not known for the speed and flexibility of its software systems—effectively meet the needs of the public?
SREs in government were well positioned to creatively navigate the high burden of regulatory constraints and administrative complexity. Despite the odds, SREs were able to move to quickly employ their service management and incident response skills.
Amy Quispe, U.S. Digital Service, Alum
Amy Quispe was a digital service expert at the United States Digital Service. In her time at USDS, she worked on a variety of projects including the vaccines.gov vaccination finder website. Before joining USDS, she was an engineer at Facebook, Medium, and Google. She serves on the Board of Directors of hackNY.
Marc Alvidrez, U.S. Digital Service
Marc Alvidrez is a digital service expert at the United States Digital Service. In his time at USDS he has worked with the U.S. Department of Health & Human Services and US Postal Service on the launch of covidtests.gov, and on the human organ procurement and transplantation network (OPTN) with Health Resources and Services Administration. Previously Marc spent 17.5 years at Google/Alphabet as an SRE, the last three at Loon.
Rick Hawes, U.S. Digital Service, CDC
Rick Hawes is a digital service expert at the United States Digital Service assigned to the Centers for Disease Control and Prevention. At the CDC, Rick worked on the COVID-19 response and many of the CDC's data modernization efforts. Before joining the USDS, Rick worked as a software architect for many companies, including, most recently, Yahoo and Microsoft.
10:30 am–11:00 am
Break with Refreshments
Grand Ballroom ABGH and Mission City Foyer
11:00 am–12:35 pm
Mission City Ballrooms B1-B3 (SCCC)
We're Still Down: A Metastable Failure Tale
Kyle Lexmond
"The status? The system has been down for hours, and we haven't been able to get it back up yet"—words on an incident conference call that you probably don't want to hear.
This talk explores how a globally distributed CDN experienced a metastable failure, design changes that make future failures less likely, and the unorthodox fix that made a recovery possible (and can hopefully apply to future metastable failures—maybe even yours).
Kyle Lexmond[node:field-speakers-institution]
Kyle is an almost-SWE who learned about Site Reliability Engineering in passing conversation during university, changing the course of his career.
Having worked at big names (Twitter, Amazon, Facebook) and small (CBSA, Kik), he mainly enjoys working on building optimized and efficient systems that break less often after he touches them.
He currently lives in Seattle with a partner and an adorable dog. (Yes, he has pictures.)
Watering the Roots of Resilience: Learning from Failure with Decision Trees
Kelly Shortridge, Fastly, Inc.
Software systems are complex, sociotechnical systems with SREs constituting a critical element; without humans, our software systems can't adapt. Understanding our system's reality and how it adapts in response to changing conditions is no small feat.
In this talk, we'll explore how SREs can align their mental models of the system with reality. We'll start by covering adaptation in complex systems, elucidating the necessity of resilience stress testing to expose our systems' messy reality. After exploring example chaos experiments, we'll discuss how to document and visualize our mental models through decision trees to inform design improvements and further experiments, examining an example tree in detail to inspire application in your own organization.
By the end of the talk, we will understand how decision trees empower us to reason about stressors and surprises in our systems and take away practical, open source tools that we can apply in our everyday work.
Kelly Shortridge, Fastly, Inc.
Kelly Shortridge is a Senior Principal at Fastly. Kelly is co-author of Security Chaos Engineering (O'Reilly Media) and is best known for their work on resilience in complex systems, the application of behavioral economics to cybersecurity, and bringing software systems security out of the dark ages. Kelly has been a successful enterprise product leader as well as a startup founder (with an exit to CrowdStrike) and investment banker. Kelly frequently advises Fortune 500s, investors, startups, and federal agencies and has spoken at major technology conferences internationally, including Black Hat USA, O'Reilly Velocity Conference, and RSA Conference. They are also a member of the ACM Queue Review Board.
Mission City Ballrooms B4-B5 (SCCC)
Scaling Telemetry Systems with Streaming
Liz Fong-Jones and Terra Field, Honeycomb.io
Streaming systems are essential to doing large-scale data analysis and providing real-time insights into data. In this talk, members of Honeycomb's platform engineering team will describe the evolution of our Kafka cluster and producers/consumers over the past 4 years as incoming telemetry volume increased by more than 10x. We'll discuss how we made choices on the broker/server side around instance types and sizing, tradeoffs of local SSD speed vs durability, optimizing network traffic, and doing chaos engineering on brokers to ensure we could survive loss of instances. We'll cover patterns for distributing data between partitions, scheduling updates of consumers, observing results of changes to cluster configuration and more.
Liz Fong-Jones, Honeycomb.io
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 18+ years of experience. She is currently the Field CTO at Honeycomb, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
Hacking the Pachyderm: Scaling Servers and People
Hazel Weakly, Hachyderm; Preston Doster, Hachyderm and Twilio
Cast your mind back to November 2022: Twitter's on fire, and Hachyderm has grown from its small 700 users to a roaring 30,000 users in one month. Now, not only is Twitter on fire, but our servers are too.
In this talk, we're going to go over how we handled this, the decisions we made, the limitations we faced, and the friendships we found in each other.
We migrated a distributed system across the Atlantic ocean while dealing with multiple types of failures and limitations in operational capabilities. Yet, we suffered zero data loss and we kept the trust of our beloved community by not letting them down.
Not only that, but nobody burned out! We came out stronger and more prepared than ever. Come along with us on this journey. Together we'll relive the race against Time and the battle against the very fabric of Systemic Chaos itself.
Hazel Weakly, Hachyderm
Hazel spends her days working on building out teams of humans as well as infrastructure, systems, automation, and tooling to make life better for others. She's worked at a variety of companies, across a wide range of tech, and knows that the hardest problems to solve are the social ones. One of her favorite things is watching someone light up when they understand something for the first time, and a life goal of hers is to help as many people as possible experience that joy. She also loves swing dancing, both as a leader and a follower.
Preston Doster, Hachyderm and Twilio
Preston is actively involved in Hachyderm.io's infrastructure team, keeping its services online and scaled ahead of the community's growth.
Outside of Hachyderm, he works for Twilio, Inc. as an Architect in its Platform Engineering group, responsible for building and maintaining a modern, reliable cloud platform Twilio's customers can depend on. Previously, he was Principal Architect at Toyota Connected, Inc., focusing on designing an in-car telematics platform connecting Toyota's drivers to life-critical and convenience services. Prior to that, he spent many years consulting with Slalom Consulting and Sungard Consulting Services, which saw him through many industries and engagements.
12:35 pm–1:50 pm
Luncheon
Sponsored by Observe
Grand Ballroom ABGH
1:50 pm–3:25 pm
Mission City Ballrooms B1-B3 (SCCC)
Logs Told Us It Was DNS, It Looked like DNS, It Had to Be DNS, It Wasn't DNS
Hemanth Malla and Elijah Andrews, Datadog
It all started with a team reaching out because they had DNS issues during rolling updates. Business as usual… Four weeks later: We are reading kernel code to understand the corner cases of dropping Martian packets. Could this be the connection between gRPC client reconnect algorithms and the overflowing conntrack table we can feel but not see? In time, we solved the issue. And for once… it wasn't DNS!
In this talk, we will focus on one of the most complex incidents we have faced in our Kubernetes environment. We will go through the debugging steps in detail, dive deep into the mysterious behaviors we discovered and explain how we finally addressed the incident by simply removing three lines of code.
Hemanth Malla, Datadog
Hemanth Malla is a Software Engineer working on Kubernetes and container networking at Datadog. Previously he worked on various distributed systems in industries like e-commerce, fintech, and high-frequency trading. Apart from computers, he enjoys all things photography, drones, and dark chocolate.
Elijah Andrews, Datadog
I'm a software engineer at Datadog. I'm currently working on our networks, and previously worked on our data ingestion pipelines. Outside of work, I love playing guitar, going to concerts, and spending time with my cat Bao.
Epic Incidents of History: The 1979 NORAD Nuclear Near Miss
Nick Travaglini, Honeycomb.io
Nov 9, 1979—An officer of the North American Air Defense Command unknowningly input dummy data into a computer system built to provide early warning of incoming intercontinental ballistic missiles. This triggered a set of alerts at several military installations which signaled the imminent arrival of a major nuclear attack against the United States, probably launched by its rival superpower: the USSR.
What followed in the next few minutes involved an intricate investigation and evaluation of the legitimacy of the signals and preparations for a counter-offensive. Fortunately, the US declined to "retaliate" and mutually assured destruction was averted.
This talk will consider this dynamic non-event as a "near miss," asking how this situation could have happened and what lessons it may hold for those who maintain and operate distributed computer systems today.
Nick Travaglini, Honeycomb.io
Nick Travaglini works as a Technical Customer Success Manager at Honeycomb.io. His approach goes beyond helping customers to effectively operate the tool to observe their software systems, working with them to understand the complex sociotechnical dynamics that affect the creation and operation of that software.
Nick has previously held roles in CS and Operations at Domino Data Lab, GE Digital, and Solano Labs. He earned an MA in Liberal Studies from The New School and a BA in Philosophy from UC Berkeley.
Mission City Ballrooms B4-B5 (SCCC)
Scaling Terraform at ThousandEyes
Ricard Bejarano, Cisco Systems Inc.
After years of using Terraform to manage our global infrastructure, we ran into a number of problems: code duplication, drift between environments, hacky solutions to Terraform's shortcomings...
We present to you a new way of scaling Terraform that removes code duplication and boilerplate, eliminates drift, and adds many of Terraform's missing features that we now can't live without.
Ricard Bejarano, Cisco Systems Inc.
Ricard is a Senior Site Reliability Engineer at ThousandEyes' Infrastructure team. His background is mostly networking, monitoring, incident management and hunting down the weirdest bugs. He has led the transformation of how we use Terraform to manage our global infrastructure.
OpenTelemetry Metrics 101
Reese Lee, New Relic
Metrics are an integral part of an overall observability strategy that can help you understand what exactly is going on in your systems. For instance, how much time does a specific request take on average, or at what rate are certain errors occurring? However, there are many mysteries around this signal type – for instance, which metrics instruments should you implement to get certain measurements? In fact, what even are metrics instruments? And which metrics can help you better understand your services?
In this session, you will get clarity around these concepts and the value different metrics and types of metrics can provide, with fun analogies and real world examples.
Reese Lee, New Relic
Reese Lee joined the OpenTelemetry team at New Relic in 2021, bringing along her enthusiasm for providing quality technical support and enablement for observability end users. She primarily works in the OpenTelemetry End User Working Group to help increase awareness and adoption of the software, including running the monthly End User Discussion Group. She has spoken on topics related to the project, and is excited to contribute more to the OpenTelemetry community.
Cypress Room
Next Generation Delivery Working Group
There is a direct correlation between delivering software and restoring services. However, software delivery systems have not kept pace with the evolution of cloud-native software. One of the critical responsibilities of delivery is reliability.
In this invitational unconference session, join industry leaders and practitioners going deep on the role of delivery systems in their production environments. Key questions that will be explored are:
- Where does delivery end or begin in cloud native systems?
- Who are the key stakeholders, how do they stay informed, and what accountability models work in a cloud-native world?
- How have compliance and security impacted your delivery systems (i.e., B2B SaaS single tenant needs)?
Leaders and practitioners that attend this unconference will walk away armed with a deeper understanding of another lever in their reliability toolkit.
This SREcon Working Group will publish a summarized article on the concepts and findings to inform leaders on the value of delivery systems in service of reliability.
Applications for participation in this working group discussion are now closed.
The working group will meet during the same time periods as the talk sessions on Tuesday afternoon of the conference (1:50 pm–5:30 pm).
Accepted participants will be notified via email prior to the conference. If you yourself are not the right person from your company to participate, but you know someone else who would be, please pass this information along to them.
3:25 pm–3:55 pm
Break with Refreshments
Grand Ballroom ABGH and Mission City Foyer
3:55 pm–5:30 pm
Mission City Ballrooms B1-B3 (SCCC)
Incident Commanders to Incident Analysts: How We Got Here
Vanessa Huerta Granda and Emily Ruppe, Jeli.io
You may have heard that the skills required for software development are different from those required for incident management. And that is very much true. Furthermore, the skills required for incident response are different from those for incident analysis even though most organizations expect the same folks to do both jobs.
We are two recovering incident managers turned incident analysts (though we still dabble in the occasional outage). After a combined 20 years of IC experience as well as owning the processes at our individual organizations, we are here to tell you why the fun doesn't end after the incident has been resolved. In fact, true change happens when we are able to look back into what happened during the incident and treat it as a learning opportunity.
Vanessa Huerta Granda, Jeli.io
Vanessa is a Solutions Engineer at Jeli helping companies make the most of their incidents. In 2021 she co-authored Howie: The Post-Incident Guide, an in-depth explanation for how tech organizations can learn from incidents. Previously, she led Resilience Engineering at Enova where she focused on their Production Incident process, learning from incidents, and leading the on-call rotation of Incident Commanders. Vanessa's favorite part about incidents is getting to gossip with everyone.
Emily Ruppe, Jeli.io
Emily Ruppe is a Solutions Engineer at Jeli.io whose greatest accomplishment was once being referred to as "the Bob Ross of incident reviews." Previously Emily has written hundreds of status posts, incident timelines and analyses at SendGrid, and was a founding member of the Incident Command team at Twilio. She's written on human centered incident management and facilitating incident reviews. Emily believes the most important thing in both life and incidents is having enough snacks.
Handover Communications in Software Operations: Findings from the Field
Chad Todd, CrowdStrike
This talk presents collected research from interviews, showcasing the attributes that contribute to an engineers increased or decreased confidence after a handover communication in understanding the current state of the system.
The passing of information through handover communications is essential in many workplaces. Handovers are crucial in industries such as, health care institutions, nuclear power and software operations. Handovers in software operations happen on a daily basis much like in health care.
Examples of handovers in software operations include verbal or digital written. They can occur during shift changes or on-call within a network operations center or customer support center. These handovers can occur during both high-stake and low-stake scenarios. Ultimately, confidence and understanding of the information is of the utmost importance.
Chad Todd, CrowdStrike
Chad Todd has worked in software engineering, systems engineering and operations for over twenty years. Chad has particular interest in systems thinking, incident management, incident response, incident analysis, and Resilience Engineering. Chad holds an MBA and recently graduated (2022) with an MSc in Human Factors and Systems Safety from Lund University.
Mission City Ballrooms B4-B5 (SCCC)
On the Wings of SREs; J.P. Morgan's Journey into the Cloud
Fred Moyer, J.P. Morgan Chase
The SRE function at J.P. Morgan Chase was created in the last couple of years to catalyze the move of many of our applications to the public cloud. Stability and reliability are the forefront of what customers expect and depend on. So a move from on premise applications to the public cloud face challenges on a number of fronts; regulatory, technical, organizational, and external pressure from stakeholders wanting this to succeed in a timely manner.
We'll talk about how the SREs at JPMC are shepherding the move with an SRE mindset in the face of a number of challenges such as building trust between SREs and application developers, transitioning to measuring reliability from incident counting to modern approaches like SLOs, and adoption of modern observability solutions.
Fred Moyer, J.P. Morgan Chase
Fred is an Executive Director at J.P. Morgan, a recovering Perl and C programmer, and has spent most of his career with web services, monitoring, and observability. He is a 2018 Google dev award winner for his Istio observability adapter, a 2013 Perl White Camel award winner, Apache Software Foundation member, and has worked in software engineering and reliability roles since the Dotcom boom.
SRE in Transition: From Startup to Established Business
Laura de Vesine, Datadog
Startups are defined by "ship or die". As a result, SRE teams at a startup should be focused on enabling product engineers to ship features as quickly as possible. As your startup transitions from "we'll run out of money in the next 18 months" to "we have more than 1000 engineers", how should the SRE organization evolve and provide the best value through that transition (including booting one up if you don't have one)?
I will discuss specific ways the organization needs to evolve to meet this challenge, how the SRE org can advocate for and support this change (both in direct actions and in "influence"), and how the overhang of startup technical and cultural debt can make this shift more challenging (but also more necessary). The focus will be largely on my own company's experience with this transition, which are heavily influenced by an intentionally bottom-up culture, somewhat junior overall engineering population, "you build it you run it" philosophy, and long startup phase (leading to substantial technical and cultural debt) with a transition to fairly sustained hypergrowth.
Laura de Vesine, Datadog
Laura de Vesine is a 20+ year software industry veteran. She has spent the last 6 years in SRE working in incident analysis and prevention, chaos engineering, and the intersection of technology and organizational culture. Laura is currently a staff engineer at Datadog, Inc. She also has a PhD in computer science, but mostly her cats nap on her diploma.
Lessons Learned from 7 Years of Running Developer Platforms
Tony McCulley, VMware
This talk covers best and worst practices for Platform engineering. The trick is, we didn't always use this phrase, so we actually have many years of experience to learn from. Who are you? You're in a DevOps, wait, I mean SRE…oh wait…"platform engineering" team. People are coming at you to get kubernetes up and running and then build some kind of platform on-top of kubernetes. But you just got a build pipeline in place! Getting kubernetes ready for developers may be a new problem, but building and running developer platforms has been going on for at least ten years. This talk will cover the lessons those organizations have learned such as: product managing the platform, attracting and retaining developers, seeding trust and skills, re-skilling existing ops staff, and more. Examples are drawn from organizations like Mercedes-Benz, the US Airforce, insurance companies, banks, and more.
Tony McCulley, VMware
Tony McCulley is a member of the Tanzu Value Advisory team at VMWare where he works with the company’s most strategic customers to help accelerate their transformation and modernization initiatives with a focus on helping those customers identify and achieve meaningful outcomes.
Tony has over 20 years of experience in the IT industry ranging from startups to multiple Fortune 50 companies. Prior to joining VMWare, Tony spent 9 years helping The Home Depot transform and modernize as Head of Developer Experience and Tech Insights. Prior to The Home Depot, Tony played a key role in GE Energy’s agile transformation in addition to tours at Booz Allen Hamilton and Comcast.
Tony helped pioneer Platform Engineering / Platform as a Product in 2015 when he took over Home Depot's development platforms with no prior experience in infrastructure. He leveraged his skills in software engineering, product development, and UX to roll out their new platform and achieve nearly 100% adoption by 4000+ engineers in less than 2 years.
Cypress Room
(Continued from previous session)Next Generation Delivery Working Group
There is a direct correlation between delivering software and restoring services. However, software delivery systems have not kept pace with the evolution of cloud-native software. One of the critical responsibilities of delivery is reliability.
In this invitational unconference session, join industry leaders and practitioners going deep on the role of delivery systems in their production environments. Key questions that will be explored are:
- Where does delivery end or begin in cloud native systems?
- Who are the key stakeholders, how do they stay informed, and what accountability models work in a cloud-native world?
- How have compliance and security impacted your delivery systems (i.e., B2B SaaS single tenant needs)?
Leaders and practitioners that attend this unconference will walk away armed with a deeper understanding of another lever in their reliability toolkit.
This SREcon Working Group will publish a summarized article on the concepts and findings to inform leaders on the value of delivery systems in service of reliability.
Applications for participation in this working group discussion are now closed.
The working group will meet during the same time periods as the talk sessions on Tuesday afternoon of the conference (1:50 pm–5:30 pm).
Accepted participants will be notified via email prior to the conference. If you yourself are not the right person from your company to participate, but you know someone else who would be, please pass this information along to them.
5:30 pm–6:30 pm
Showcase Happy Hour
Sponsored by DBS
Grand Ballroom ABGH
9:00 am–10:35 am
Mission City Ballrooms B1-B3 (SCCC)
Cognitive Apprenticeship in Practice with Alert Triage Hour of Power
Paige Cruz, Chronosphere
Cognitive apprenticeship is the philosophy that it is more effective to learn in context and real-world situations compared to following a tutorial in a sandboxed environment. In a nutshell it is "learning-through-guided-experience" and shines when teaching problem-solving processes experts use to handle complex tasks like say…investigating an alert by spelunking in production observability and monitoring data. Learn how Alert Triage Hour of Power became a can't miss meeting of camaraderie and system surprises!
Paige Cruz, Chronosphere
Paige Cruz is a Senior Developer Advocate at Chronosphere passionate about cultivating sustainable on-call practices and bringing folks their aha moment with observability. She started as a software engineer at New Relic before switching to SRE holding the pager for InVision, Lightstep, and Weedmaps. Off-the-clock you can find her spinning yarn, swooning over alpacas, or watching trash TV on Bravo.
Building a Diverse SRE Talent Pipeline
Salim Virji, Google; Alex Gornet, Major League Hacking
Most early-career SRE hires are university interns, but what about students who would be great interns if they picked up some systems skills? In particular, what about those of underrepresented groups who would be amazing interns if they had only discovered SRE?
To overcome these barriers, Google partnered with Major League Hacking to deliver the SRE Fellowship, an initiative to inspire and train future technologists on the skills needed to launch SRE careers. Many participants come from diverse backgrounds and wouldn't normally be recruited through traditional pathways, but after 12 weeks of immersive, project-based learning they walk away with a firm grasp of what SRE is all about and the skills needed to pursue a career in the field.
In this talk you'll learn why we partnered together, how to administer the initiative, and we'll share our learnings. Our hope is that attendees leave with the confidence to champion similar programs.
Salim Virji, Google
Salim Virji develops reliable engineering practices and processes for Google's SRE organization, and has previously developed distributed consensus and storage systems. Salim's interests include distributed systems and machine learning. Salim received an AB in Classics from the University of Chicago and is a New York City Master Composter.
Alex Gornet, Major League Hacking
After graduating from the University of Louisville with a degree in chemistry and a short stint playing professional tennis, Alex Gornet began his journey in the world of technical education. Now at Major League Hacking (MLH), he oversees delivery of the Fellowship program, an educational pathway and career accelerator for early career developers and technologists, to MLH partners. His interests include AI, yoga, and gluten free baking.
Mission City Ballrooms B4-B5 (SCCC)
The Best SREs Seem to Be the Ones without an SRE Title—And What We Can Do about It?
Kishore Jalleda, Stealth
In an industry where the SRE discipline has been growing (for a while now), with teams springing up everywhere, to do, well, SRE—but if Bob or Mary from that other SRE-like team are mostly the ones called for help to solve complex incidents, or other ambiguous production issues, what are the rest of the teams with the (official) title doing? Acting as a crutch? Baby sit legacy systems with a lot of toil? Doing low-value work no one else wants to do?
Clearly, this is not the right direction we want for the profession or the industry. The goal of this talk to surface this seemingly universal problem, what brought us here, potential risks of inaction, and offer practical solutions with actionable advice.
Kishore Jalleda, Stealth
After a decade of leading (global) SRE teams at Microsoft, Yahoo, Zynga, and IMVU, Kishore Jalleda pivoted to full-time coding and building products that can organize the world's unstructured data and processes to help people lower their stress, make better decisions, and focus on what matters most.
Given his background, his first use case is incident management; he is offering simple, novel, proactive solutions to the complex problems organizations face with managing incidents.
What Does "High Priority" Mean? The Secret to Happy Queues
Daniel Magliola, IndeedFlex
Like most web applications, you run important jobs in the background. And today, some of your urgent jobs are running late. Again. No matter how many changes you make to how you enqueue and run your jobs, the problem keeps happening.
The good news is you're not alone. Most teams struggle with this problem, try more or less the same solutions, and have roughly the same result.
In the end, it all boils down to one thing: keeping latency low. In this talk I will present a latency-focused approach to managing your queues reliably, keeping your jobs flowing and your users happy.
Daniel Magliola, IndeedFlex
Life-long coder, expert procrastinator, maker of weird things, and occasional game programmer obsessed with code performance and weird Lego machinery.
10:35 am–11:05 am
Break with Refreshments
Grand Ballroom ABGH and Mission City Foyer
11:05 am–12:40 pm
Mission City Ballrooms B1-B3 (SCCC)
Confessions of an SRE Manager
Andrew Hatch, LinkedIn
This talk will demystify and clarify for SRE ICs what SRE Managers care about, why they make the decisions and trade-offs they do, and, more importantly, how ICs can work better with SRE Managers to achieve their personal goals. SRE ICs will better understand the pressures and competing priorities that SRE Managers must work within, how to position themselves for career advancement and gain more insight into why specific work and decisions need to be made. They will also understand what bad SRE Management looks like, how to recognize it, and how to avoid repeating it in a future management role
Andrew Hatch, LinkedIn
I have worked in the technology industry for over 20 years, predominantly in Australia, with time spent in India and, recently, in the USA. My experience ranges from small to large-scale projects in multiple roles and industries spanning software engineering, consulting, and operations. In 2020 I migrated to the San Francisco Bay Area to take up a role at LinkedIn as an SRE Manager. Before this, I spent 6 years working at Australia's biggest online jobs and recruitment platform with the critical role of moving the business into AWS and up-leveling their Platform Engineering and Incident Management practices to support this. Since 2013, I have worked primarily in SRE Management roles and, through this experience, developed a passion for learning and adapting to complex systems and helping teams and organizations learn more from incidents to create better software, more resilient systems, and happier, empowered teams. I am a lifelong surfer and can now be found adapting to the crowds at Santa Cruz in California when not at work or at home with family.
Exploring Disconnects between Reliability Practitioners and Management/Executives
Kurt Andersen; Leo Vasiliou, Catchpoint
Join us to hear the "author's intent" for why they wrote—and why they suggest you read—the latest The SRE Report (https://bit.ly/2023-sre-report) In this session, we'll hear about the logic, emotion, and controversy during survey writing and results interpretation. In Summer 2022, an industry survey was run with almost 600 responses; the initial report on the findings was released in November 2022. Based on the self-identified organizational level by the respondents (e.g., Individual Contributor through Executive), we found quite a few of the inquiry topics had differing perspectives in their answer set. We will highlight these differences and provide some suggestions for bridging the perception gaps through the use of real-world situations. We will take a naked look at some of the most surprising data and explore some of the open-ended questions where survey respondents could type in anything they wanted (and they did!).
Kurt Andersen[node:field-speakers-institution]
Kurt Andersen worked as the head of strategy for Blameless.com. Prior to that he was one of the leads for the Product-SRE organization at LinkedIn. Across the full spectrum of IT-influence, he is strongly committed to developing the best engineers and teams, and enabling them with the right ideas, tools, and connections at the right time to facilitate personal and organizational learning and resilience.
Kurt has been active in the anti-abuse and IETF standards communities for over 20 years. He has spoken at multiple conferences on various aspects of reliability, authentication, and security and written for O'Reilly. He also serves on the USENIX Board of Directors and as liaison to the SREcon conferences worldwide.
Leo Vasiliou, Catchpoint
Leo grew up in technology operations where web performance, IT operations, and information security activities were part of his charter. Since transitioning to evangelizing DevOps activities in product marketing, Leo currently applies his passion for web performance data analysis in the context of monitoring and observability—primarily for customer-facing products and services.
Leo strongly believes correct data are the foundation for transforming information into wisdom. He works to perform activities at the intersection of technology and marketing to help improve IT's business relationship.
Mission City Ballrooms B4-B5 (SCCC)
Beacon: Intelligent Latency-Aware and Load Shedding Service Routing
Jason Griggs and Huajun Qin, Morgan Stanley
At large-scale organizations where a system of record (SOR) is kept in relational databases, database replication is commonly used to provide read-only copies for resilience and performance purposes. Most data queries are directed to replicated databases in an effort to relieve the load from primary database servers.
Due to inherent replication delays, returning up-to-date information to customers is a significant challenge.
Our framework, Beacon, was conceived and delivered in response to the obstacle of guaranteeing up-to-date information to customers with intelligent routing of services to either read-only replicas or primary data servers.
Jason Griggs, Morgan Stanley
Jason has had a varied career working in multiple fields of IT. He is tenacious in his quest to improve process, efficiency, and resiliency wherever he goes. Jason's leadership and attention to detail has helped to deliver a number of key projects to production.
Huajun Qin, Morgan Stanley
Qin has worked as an enterprise architect for over 25 years in multiple companies. He was the leading Web Services architect and a distinguished engineer at E*Trade Financial. Qin is passionate about building evolutionary architectures that stand the test of time.
Resiliency Practices in Managing CDN (Content Delivery Network)
Yeshwenth Jayaraman, Netflix
Our team manages Open Connect which is a purpose built CDN responsible for delivering video, images, and assets to Netflix subscribers all around the world. In this talk i will explore various failure domains, that we look at and the exercises we conduct such as Stack Failures, Overload Test, regional DNS withdrawal etc every quarter to measure how our systems behave. under various failure scenarios.
Yeshwenth Jayaraman, Netflix
Yesh is a seasoned professional with over 10 years of experience in the networking and SRE field. He began his career at Microsoft Azure, where he was part of the Networking team building and scaling datacenters worldwide. He has been working for the past 7 years with Netflix focussing on improving service resiliency among other things.
12:40 pm–1:55 pm
Luncheon
Sponsored by Jeli
Grand Ballroom ABGH
1:55 pm–3:30 pm
Mission City Ballrooms B1-B3 (SCCC)
Why This Stuff Is Hard
Lorin Hochstein, Netflix
We all face challenges in doing the work of keeping our systems up and running, from ever-growing complexity to the time pressure of delivering new features into production. This talk will bring these challenges into focus, treating them as first-class entities that are common to our work rather than being pathologies that are unique to our own local organizations. We'll also explore what we can do to increase our chances of success by recognizing the nature of these constraints.
Lorin Hochstein, Netflix
Lorin Hochstein is a Sr. Software Engineer at Netflix. He was previously Sr. Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California's Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.
Turning an Incident Report into a Design Issue with TLA+
A. Finn Hackett, University of British Columbia; Markus Alexander Kuppe, Microsoft Research
This talk will discuss our experience using modeling-driven techniques as part of a postmortem deep dive into a long-lasting, high-impact outage at Microsoft. We built a precise specification of the micro-service architecture, most notably its foundational distributed database service CosmosDB. Modeling allowed us to go beyond the standard postmortem analysis and accurately determine the outage's root cause. The key to our approach was using TLA+, a formal specification language that let us concisely and formally describe the database's user-visible behavior under concurrency. By building a specification, we could thoroughly understand the underlying design issue and find bugs in the database's documentation. Through this case study, we demonstrate the value of modeling-driven techniques in postmortem analysis and illustrate how they can help avoid similar issues in the future. We also explain how you can use TLA+ to build precise and reusable specifications.
A. Finn Hackett, University of British Columbia
Finn is a Ph.D. student exploring practical applications of domain-specific language design and formal verification. Formal verification can help people with reasoning exercises that unassisted humans often get wrong in practice, where each mistake can become a bug or a feature that's too hard to implement. That is why it's important to make these reasoning tools more accessible, by improving their design and by demonstrating different ways to apply them in practice. Presentation topic aside, Finn's most recent work is PGo, a TLA+-based domain-specific language for developing formally verified distributed systems. For older and/or musical projects, see Finn's personal site.
Markus Alexander Kuppe, Microsoft Research
Markus is a principal research software engineer at Microsoft Research. He has been a member of the TLA+ project for over a decade. In this role, Markus has made significant contributions to the development of the TLA+ tools and has helped engineers at Microsoft formalize their systems using TLA+. Markus also teaches TLA+ to engineers at Microsoft and elsewhere, sharing his knowledge and experience. Last summer, Markus served as Finn's mentor, working alongside fellow engineer Josh Rowe on using TLA+ to model incidents.
Mission City Ballrooms B4-B5 (SCCC)
The Making of an Ultra Low Latency Trading System with Go and Java
Yucong Sun and Jonathan Ting, Coinbase Inc
Coinbase Exchange team presents learnings from building and operating ultra low latency trading system, implemented in both Golang & Java, with E2E latency as low as sub-50 microseconds. We will discuss challenges with Programming Language & runtime, Threading model in both Golang & Java, RAFT implementations and latency characteristics, and all the way up to linux kernel tuning and challenges running under baremetal & AWS environment.
Yucong Sun, Coinbase Inc
Currently Staff Software Engineer at Coinbase, previously SRE at Google, Facebook, Linux Kernel contributor. Yucong Sun on Linux performance tuning and other Infrastructure and DevX areas, and is the author of the book <SRE: Google 运维解密>, an official Chinese translation of the book <Site Reliability Engineering : How Google Runs Production System>, <架构整洁之道>, an official Chinese translation of book
Jonathan Ting, Coinbase Inc
Currently Software Engineer at Coinbase, previously at FairX. Over the past seven years Jonathan Ting have been working at financial exchanges, he worked on all core components including FIX gateways and trading systems. In particular, Jonathan Ting developed a keen interest in learning as much as I could about tuning for ultra low latency. With hard work from the team in optimizing the stack on all layers, FairX has achieved round trip latencies on the order of sub-50 microseconds while maintaining high reliability, scalability and reproducibility.
Seeing the Invisible: Two Years at Wikipedia with W3C's Network Error Logging
Chris Danis, Wikimedia Foundation
If a user's packets get dropped in someone else's network, do they make a sound?
What if you could find out that your users can't reach your website—in near-real-time and completely automatically, with very low rates of both false negatives and false positives? This sounds too good to be true, but it has been exactly Wikipedia's experience as the most popular website in the world to implement W3C's Network Error Logging spec.
We'll give an overview of the technology itself, the tradeoffs Wikimedia made in our implementation, and case studies of several user-visible outages that NEL detected but traditional monitoring missed.
Chris Danis, Wikimedia Foundation
Chris is a lifelong tinkerer and a recovering ex-Googler. Currently an SRE at the Wikimedia Foundation, the non-profit that operates Wikipedia and other related projects, their work responsibilities mostly consist of wild hand-waving about incident response, symptom-focused alerting, and the follies of distributed systems.
3:30 pm–4:00 pm
Break with Refreshments
Grand Ballroom ABGH and Mission City Foyer
4:00 pm–5:35 pm
Mission City Ballrooms B1-B3 (SCCC)
Avoiding Cachepocalypse in the Land of the Monolith
David Amin, Duolingo
Like many companies who switched to a microservice architecture, Duolingo runs a legacy monolith that is both business critical and not entirely owned by any one team. This monolith hit a scaling limit by exhausting memcached connections, but strangely only on a single node of the cluster. What followed was a debugging journey that spanned Python and C, took down production for an hour, and required binary arithmetic to solve. The talk will suggest strategies for mitigating unclear ownership of legacy systems, creating a fast and safe feedback loop for debugging in difficult circumstances, and coordinating a multi-team response to issues that threaten business continuity.
David Amin, Duolingo
David is an SRE on the Observability team at Duolingo, working on measuring what matters to millions of learners. He is based in New York and has previously worked in web hosting, fintech, and defense.
Incident Archaeology: Extracting Value from Paperwork and Narratives
Clint Byrum, Spotify
Why do we fill out incident reports? Why can't we just get to the fixes and move on? We'll discuss how we learn from the aggregate of incidents at Spotify and turn this pile of paperwork into a gold mine of insight into how we work and operate our systems.
Clint Byrum, Spotify
Clint Byrum has been building, breaking, and evolving large scale systems for decades. From humble origins as a sysadmin to leading engineers building resilient infrastructure at scale, Clint has always followed curiosity and tried to stay centered on data and facts to drive real change. Currently Clint is a Staff Engineer at Spotify working on the reliability of Spotify for Artists. Clint lives in Los Angeles with his wife and children.
An Organizational Response to Incidents: Designing for Smooth Coordination in High Tempo, Large Scale Software Incident Response
Laura Maguire, Jeli
In the Incident Command System (ICS) framework, the Incident Commander role has long attracted organizational attention for selection, training, and competency identification. But, the majority of participants in an incident response are "followers" who carry out the direction of the commanders. So, making the incident responder group better overall is a force multiplier.
This talk draws from primary research to explore the role of the follower and the characteristics of good "followship." It also looks at the organizational preconditions that aid in incident response—both the everyday work of incident response and in large-scale software outages.
This talk is relevant to leaders and followers who carry a pager and, more generally, to teams looking to improve communication, coordination, and collaboration both internally amongst members and externally across organizational boundaries. Attendees will leave this session with new ideas of how to structure their incident management functions, along with practical tactics and strategies for use in their next outage.
Laura Maguire, Jeli
Dr. Laura Maguire leads research at Jeli. Her work studies how software engineers can optimize performance through evidence-based practice for incident management, learning from incidents, and organizational change management. She holds a Master's degree in Human Factors & Systems Safety from Lund University and a Ph.D. in Integrated Systems Engineering from Ohio State University.
Mission City Ballrooms B4-B5 (SCCC)
Building an APM with OpenTelemetry and OpenSource
Goutham Veeramachaneni, Grafana Labs
Proprietary APM solutions have provided immense value to developers, yet are very expensive, in part because there was no good way to assemble a similar feature set in OSS. Everything from the agents over the collection layers to the backends were proprietary. This is now changing with OpenTelemetry and its instrumentation SDKs and backends like Prometheus, Jaeger.
This talk will show you how you could leverage the power of OpenTelemetry and CNCF backends like Prometheus and Jaeger to build a compelling APM solution. We will also cover implementation risks and the upcoming improvements in the OTel community that will make things easier and more valuable. All built with and on top of Open Source.
Goutham Veeramachaneni, Grafana Labs
Goutham is a developer from India who started his journey as an infra intern at a large company where he worked on deploying Prometheus. After the initial encounter, he started contributing to Prometheus and interned with CoreOS, working on Prometheus's new storage engine. He is now an active contributor to the Prometheus eco-system and was a maintainer for TSDB, the engine behind Prometheus 2.0. He works at Grafana Labs on OpenTelemetry and other open source observability tools.
Measuring Real-Life Latency of the Internet: A Netflix Story
Thiara Ortiz, Netflix
Any time a Netflix member sits down, reclines in their chair and turns on their TV to Netflix, there's a moment of truth. It's an opportunity to deliver a spectacular service with amazing quality of experience. Misses, errors, or high latency that prevent individuals from streaming, as a result of ISP configuration changes, code deployment, or catastrophic fallback, result in an impact on how our service is perceived. This talk will go over how we have been investing in data analysis tools to make our jobs as SREs more efficient and our members happier. We want folks who attend this presentation to learn about t-digest and how they can apply this to supercharge their insights.
Thiara Ortiz, Netflix
Thiara has worked at some of the largest internet companies in the world, Meta and Netflix. During her time at Meta, Thiara found a passion for distributed systems and bringing new hardware into production. Always curious to explore new solutions to complex problems, Thiara developed Fleet Scanner, internally known as Lemonaid, to perform memory, compute, and storage benchmarks on each Meta server in production. This service runs on over 5 million servers and continues to be utilized at Meta. Since Meta, Thiara has been working at Netflix as a Senior CDN Reliability engineer. Her focus is primarily on resilience and quality of experience for members streaming from Open Connect. When incidents occur and Netflix's systems do not behave as expected, Thiara can be found working and engaging the necessary teams to remediate these issues.
Panel
Founder/CTO Perspectives: The Future of Distributed Tracing
Charity Majors, Honeycomb.io; David Cramer, Sentry; Maggie Johnson-Pint, Stanza
Honeycomb and Sentry are both major providers of distributed tracing software, with very different approaches to the problem space. Hear from their founder/CTOs about how they got started in distributed tracing, where they think distributed tracing is most impactful for engineering teams, and how they see the future of distributed tracing technology.
Charity Majors, Honeycomb.io
Charity is a co-founder and CTO at Honeycomb.io, a tool for software engineers to understand what happens when their code meets production. She has worked at companies like Facebook, Parse, and Linden Lab as a systems engineer and engineering manager but always seems to be responsible for the databases. Co-author of O'Reilly's Database Reliability Engineering and newly-released Observability Engineering. Charity loves free speech, free software, and single malt scotch.
David Cramer, Sentry
David Cramer is a software engineer by trade, and the co-founder and CTO of Sentry, an open source application monitoring platform used by nearly 100,000 technology companies. Prior to Sentry, he focused on infrastructure and developer experience at companies like Dropbox and Disqus, and is a prolific contributor to several open source ecosystems.
Maggie Johnson-Pint, Stanza
Maggie Johnson-Pint is co-founder and head of product at Stanza, a new startup building user interface aware fault tolerance for every team. Prior to co-founding Stanza, Maggie worked in developer tools, SRE and front end development at Microsoft and Stripe. She has also served on TC39—the JavaScript language standards committee. Maggie spends a lot of her spare time with board games and Australian Shepherds.
5:35 pm–7:00 pm
Conference Reception
Sponsored by Sedai
Grand Ballroom ABGH
7:00 pm–8:00 pm
Lightning Talks
Mission City Ballrooms B1-B3 (SCCC)
- Just the Cryptography You Need to Know for TLS
Lerna Ekmekcioglu, AWS - Metrics Have a DX Problem
Micha Hernandez van Leuffen, Fiberplane - 9 Common Mistakes When Starting with SLOs and How to Fix Them
Sal Furino, Nobl9 - A Blaming Culture Is Not Your Fault
Nele Lea Uhlemann, Fiberplane - You Monitor, but You Do Not Observe - SRE Lessons from Sherlock Holmes
Robert Barron, IBM - How 20 Years of SRE Prepared Me to Be a Dad
Jonah Horowitz - Making Golang HTTP Apps Observable
Prathamesh Sonpatki - 30 (More) Interviews Later
Paige Cruz, Chronosphere
9:00 am–10:35 am
Mission City Ballrooms B1-B3 (SCCC)
Human Observability of Incident Response
Matt Davis, FORM.com
In learning from incidents we also learn from each other, and about each other. The choreography of coordinating our response requires us to understand one another at the same level we're attempting to understand the incident, interconnected as a unified socio-technical system. This talk will give you practical advice on producing repeatably better outcomes by revealing the Human in Incident Response.
Matt Davis, FORM.com
Just as at home with electro-acoustic synthesizer electronics as with site reliability engineering, I find joy in operating inherently chaotic complex systems, whether in the jaws of distributed systems or flung across the organization to build sources of Resilience. My work through improvisation and intuition in teams seeks to discover diverse ways to adapt and learn from our emergent universe.
Far from the Shallows: The Value of Deeper Incident Analysis
Courtney Nash, Verica
We begin in the shallow end, where the waves are calm and the familiar sand warms our toes: can Duration tell us anything about our incidents? We test the waters and feel we could linger here a while, but it's somehow…unsatisfying. We wade a bit deeper, gaining comfort and confidence. Could other things tell us more—Severity perhaps? No, it floats away, providing no guidance.
We drift out with the tide and now our toes barely skim the bottom—we're beginning to tread water. Looking out farther, we realize the water becomes murky, the details hard to discern. There's a distinct draw to dive down, where we uncover a whole universe we didn't realize existed. Coordinated schools of vibrantly-colored fish dart by, weaving through intricate mazes of plants and coral, and in the darkest depths lurk surprising and unexpected dangers.
It's here, in the deep end of incidents, that yields the most value if we're willing to dive in. In this talk, I'll show how the shallow data of incidents reveal little, and give examples of incident reports from the VOID (Verica Open Incident Database) that illustrate the value of substantive, qualitative investigation. I examine how this hard work benefits others, our collective messages in bottles feeding back into a safer and more navigable sea for everyone.
Courtney Nash, Verica
Courtney Nash is a researcher focused on system safety and failures in complex sociotechnical systems. An erstwhile cognitive neuroscientist, she has always been fascinated by how people learn, and the ways memory influences how they solve problems. Over the past two decades, she’s held a variety of editorial, program management, research, and management roles at Holloway, Fastly, O’Reilly Media, Microsoft, and Amazon. She lives in the mountains where she skis, rides bikes, and herds dogs and kids.
Mission City Ballrooms B4-B5 (SCCC)
How SRE Makes Electric Vehicles
Adam Shake, Rivian
Having made the move from traditional software engineering companies to the world of electric vehicle manufacturing, I've been asked to apply the principles of SRE to this complex and human intense environment. This talk is to present the challenges of applying Automation, CI/CD, Observability, and the language of SLOs to a complex production line involving everything from humans to robots. I will discuss how we apply software principles to building a physical thing, and how we have solved some complex challenges at a massive scale.
Adam Shake, Rivian
Adam is a high energy, passionate early adopter of many things IT. He loves to help his team, and others at Rivian, see the vision of the future and get them excited about where we're heading. He believes in what DevOps and SRE can do for a world class technology company!
Adam has 20+ years of experience in web application development, and he has spent the last few deeply involved in DevOps and SRE. Adam leads a team of incredible SREs at Rivian and couldn't be happier!
Outside of work, he suffers from 'Multiple Hobby Syndrome'—much to his wife's dismay—and enjoys hunting, fishing, and the outdoors, singing with the sound of the Illinois barbershop chorus and his Barbershop Quartet, working out, 3D printing, tabletop gaming and a ton more. Adam's family includes his Wife (mentioned above), 11 year old son, and 6 year old twins.
Warding against the Dark Arts: Crafting a Defense Strategy against Botnet DDoS Attacks
Shirleen Sharma and Aaron Heady
Surviving a large-scale DDoS attack is usually not a requirement when designing a service. Yet, the ability to do so often translates into gains in both performance and service hardening and requires an intimate understanding of real-user traffic.
DDoS requires a defense-in-depth attitude to engineering our services; for sophisticated attacks, just depending on CDNs (almost all of them have some form of capability) gives some respite, but still hurts a 4-nines availability target.
This talk is for the SRE who has just begun thinking about large-scale DDoS mitigation and aims to provide a structure of how to create a comprehensive defense strategy.
Shirleen Sharma[node:field-speakers-institution]
Having worked on critical failover systems, resource compilers and high performance C#, Shirleen loves to dive deep into ambiguous problems as a software engineer at Microsoft. When she's not off slaying dragons at work, she creates accessible STEM education programs and loves to read.
Aaron Heady[node:field-speakers-institution]
Aaron is a Reliability Engineer recently with Microsoft who focuses on CDN/DNS performance, availability, and traffic routing. 20+ years in tech, let’s talk! When not working, he’s probably snowboarding, sewing, or cooking.
10:35 am–11:05 am
Break with Refreshments
Grand Ballroom ABGH and Mission City Foyer
11:05 am–12:40 pm
Mission City Ballrooms B1-B3 (SCCC)
The Revolution Will Not Be Terraformed: SRE and the Anarchist Style
Austin Parker, Lightstep
Site Reliability Engineering is more than a job title, discipline, or functional role—it's an organizational and cultural force advocating for change. How did this come to be, and how can practitioners become more effective in driving a culture of reliability in their organization? In this talk, I'll discuss the past, present, and future of SRE through the lens of movements such as Agile and DevOps and how they have been co-opted and commoditized. Our analysis will not only include technical and organizational themes, but connect them to social studies and the organization of peoples. You'll walk away with an appreciation for how SRE can become a more effective motivating factor for not only reliability of systems, but of the people that underpin them.
Austin Parker, Lightstep
Austin Parker has been creating problems with computers for the majority of his life. In a stunning face turn, he instead has spent the past five years helping others solve the problems that computers create. Formerly an SRE and DevOps Engineer, he now focuses on observability topics and shitposting on the internet. He is an OpenTelemetry maintainer and community manager, an author, event organizer, public speaker, and general bon vivant.
Implementing SRE in a Regulated Environment
Sandeep Hooda and Fabian Tay, DBS Bank
SRE is an approach that many organisations find attractive and ideal. In particular, for organisations bound by regulatory obligations regarding the use of technology in service delivery, there are valuable insights to be gained to better safeguard organisations from operational and financial risks. In this talk, we will share how we successfully enable SRE Practices of conducting Root Cause Analysis, Chaos Testing, and Trend Analysis in line with the regulatory expectations. In the region, we were one of the pioneers in the financial industry to adopt SRE.
Sandeep Hooda, DBS Bank
Sandeep is an Engineering Manager at DBS with over 19 years of experience. In this leadership role, he is responsible for engineering innovative and strategic solutions. He has deep technical expertise in Platform engineering, SRE, DevOps, Risk management, solution architecture and systems engineering. He has been instrumental in driving digital transformation and promoting SRE culture. He also had the privilege of speaking at several tech conferences and enjoys writing on SRE and DevOps topics. He enjoys his free time out in the ocean, practicing to sail around the world.
Fabian Tay, DBS Bank
Fabian drives the Site Reliability Engineering Transformation and Observability Programme at DBS. With over 18 years of experience working in the banking industry, he has managed multi-disciplinary teams and has worked on Automation, IT Infrastructure Management, Application modernisation, Product Delivery, and DevOps. Outside of work, he enjoys spending time with his family.
Financial Resiliency Engineering: Taming Cloud Costs
Darren Worrall, Shopify
Systems Engineers(or SREs/Engineers) have always deeply cared about resource efficiency. There is no better time to apply your skills to the cloud than now! This talk will dive deep into the work Shopify undertook in 2022 to significantly reduce our cloud infra footprint and costs, detailing the tools, approaches, and trade-offs we had to navigate along the way.
Darren Worrall, Shopify
Darren has been a Sysadmin/DevOps/SRE practitioner for 20 years, and is currently a Senior Staff Production Engineer at Shopify based in the UK. Darren can often be heard asking questions like "what problem are we actually trying to solve?", "who is our customer?", and "what does good look like?", which his colleagues never get tired of.
Mission City Ballrooms B4-B5 (SCCC)
Sto: A Better Way to Store and Query Profiler Data
Patrick Somaru, Meta
Profiler data is key to enabling engineers to optimize their code. This talk is about techniques we developed to make this data more usable and our open source implementation of these techniques.
Patrick Somaru, Meta
Pat is a member of Meta's Production Engineering team. He currently works on the Efficiency and Capacity Engineering Team, helping to make Meta faster for everyone.
Chaos-Driven Development: TDD for Distributed Systems
Dhishan Amaranath and Tucker Vento, Bloomberg LP
Chaos experimentation is a force multiplier for other Reliability Engineering practices. By taking the time to design an intentional suite of experiments early in the software development life cycle, we are able to more easily validate improvements and changes to the system as they are made. The broad applicability of Chaos Engineering allows us to apply test-driven development (TDD) principles to all manners of changes in a distributed system. In this talk, we will show how early adoption of these testing processes enables us to continually validate our expectations of a system as it grows. These principles can be applied to any distributed system, something we will illustrate through stories from our own experiences with both private and public cloud development.
Dhishan Amaranath, Bloomberg LP
Dhishan Amaranath works as a senior cloud SRE at Bloomberg, where he is building foundational elements on the public cloud. He is diligent about trying to apply all aspects of SDLC to anything he works on, and is especially passionate about all forms of testing and building resilient and reliable architectures. He also promotes the practice of Chaos Engineering across the firm through the company's Reliability Engineering Guild. When he is not geeking out, he immerses himself in listening to podcasts and books about behavioral economics. So, grab him for coffee and discuss why aisles at Target are always stocked the same, while Costco regularly moves stuff around.
Tucker Vento, Bloomberg LP
Tucker Vento is the team lead and product owner for Bloomberg's Resilience Engineering team, which is responsible for maintaining a Chaos Engineering platform and helping engineers run chaos experiments in Bloomberg's data centers. While attending another USENIX conference as an SRE, he learned about the value (and fun) of deliberately breaking distributed systems and subsequently embarked on a mission to make Bloomberg's engineers embrace the power of failure. When he is not busy learning how to break something, Tucker enjoys listening to fast cars, making bad music, and relaxing with his two cats.
Adaptive Concurrency Control for Mixed Analytical Workloads
Dan Kleiman, Klaviyo
Last year we launched a Query Service to provide real-time analytics for a mix of workloads that includes public APIs, dashboards in our app, and report generation.
Consolidating these use cases to a single service was a huge infrastructure cost and complexity win, but we soon started experiencing intermittent waves of timeouts, impacting all our callers at once.
We thought we had provisioned the service with enough capacity, so why were we hitting congestion?
Inspired by Jon Moore's 2017 Strangeloop talk "Stop Rate Limiting! Capacity Management Done Right", Netflix's blog post Performance Under Load, and their Concurrency Limits library, this talk will share how we iteratively applied the principles of concurrency control to improve the user experience of our Query Service.
Dan Kleiman, Klaviyo
Dan is a software engineer at Klaviyo—a marketing tech company that powers email, sms, and ecommerce integrations—where he works on query services for Klaviyo's data platform.
12:40 pm–1:55 pm
Luncheon
Sponsored by Cortex
Grand Ballroom ABGH
1:55 pm–3:30 pm
Mission City Ballrooms B1-B3 (SCCC)
If I Can Do It on an Ambulance, You Can Do It in an Office: Scalable Incident Response Using ICS
Thai Wood, Resilience Roundup
I'll outline what my experience, managing physical emergencies in an ambulance and ER, taught me about responding to incidents starting with a small team and how it can be grown and scaled up and perhaps more importantly scaled back down, based on the Incident Command System (ICS) framework. I'll cover approaches regardless of whether or not you already have an incident response plan.
Thai Wood, Resilience Roundup
Thai is an independent consultant that helps teams build better systems and improve their ability to effectively respond to incidents. A former EMT, he applies his experience managing emergency situations to the software industry. He writes about resilience engineering at ResilienceRoundup.com.
Tired Reacting to Certificate Outages? Build Certificate Resilient Distributed Systems Using Chaos Engineering Practices
Kaitlyn Yang and Vikram Raju, Microsoft
Certificate related disruptions and outages are very costly causing loss of customer trust, negative media coverage, impact to revenue and employee burn out. Preventing certificate related outages is hard due to lack of framework, processes and tools that can scale with ever changing complex distributed system. In this talk, we will go over real-world certificate failure scenarios that are hard to continuously validate in a seamless manner. We will deep dive on how we leveraged Chaos engineering practices and scaled our solution. We are hoping the audience will walk away on how to hunt, detect, continuously measure and shift left certificate resiliency of services in a scalable fashion.
Kaitlyn Yang, Microsoft
Kaitlyn is a Software Engineer at Microsoft where she works on building platforms that powers protection of Microsoft and its customers data. Kaitlyn also leads efforts to improve the security and reliability resiliency of Azure by enhancing Microsoft Chaos Studio product
Vikram Raju, Microsoft
Vikram is a Product Manager at Microsoft working on Azure Chaos Studio. Azure Chaos Studio is a fully-managed service that helps users measure, understand, and build application and service resilience to real world outages.
Mission City Ballrooms B4-B5 (SCCC)
How To Take Prometheus Planet Scale: Massively Large Scale Metrics Installations
Vijay Samuel and Nick Pordash, eBay
Observability at eBay has been on an exponential growth curve. What was a low 2M/sec ingest rate of time series in 2017 is now roughly 40M/sec with active time series close to 3 billion. Our current cortex inspired architecture of Prometheus builds sharding and clustering on top of the Prometheus TSDB. It is relatively simple to shard/replicate tenants of data in centralized clusters. However, large clusters with growing cardinality become less useful as query latencies degrade considerably. In 2020, Google published a paper on its time series database Monarch which is dubbed as a planet scale TSDB. The paper gave us some useful hints on how we could potentially decentralize our installation and go fully planet scale.
What started off as a humble prototype to federate queries to TSDBs deployed in Sydney, Amsterdam and the US from a centralized query instance, now is a living breathing entity that allows us to deploy our TSDBs anywhere in the world using simple Kubernetes operators, GitOps and intelligence on top of the Prometheus TSDB.
This talk focuses on:
- the development of field hint indices to fingerprint time series and use the same for pointed query fanout.
- functional query push down on top of Prometheus storage
- the struggles of managing a planet scale deployment and using Gitops to mitigate pains
- other lessons learned
Vijay Samuel, eBay
Vijay Samuel works with eBay's observability platform as its architect. During his time at eBay Vijay has transformed eBay's observability platform into a cloud native offering that is primarily built on top of open source technologies. He loves to code in Go and play video games.
Nick Pordash, eBay
Nick Pordash is the lead engineer of the Observability platform at eBay. He solves the hard problems of scaling the platform.
Your Infrastructure Needs to D.I.E.
Timothy Mamo, DigitalOcean
What can a diamond heist that occurred in 2003 teach us about InfoSec's CIA triad? Well, that the old way of doing security will not help us as we try to move towards being more Cloud Native. Simple tools and a lot of ingenuity can easily allow an attacker to ruin your day. So how can we make use of cloud native ways of development whilst also ensuring security?
We need infrastructure that can D.I.E.
Designing infrastructure that is Distributed, Immutable and Ephemeral will automatically integrate security by default by increasing the bar for attackers. Further, adding controlled chaos to the process ensures that you are continually learning and improving, increasing confidence in the ability to troubleshoot.
In this talk I'll highlight how the D.I.E. triad can help us be more secure while expecting for things to fail, and embrace the chaos.
Timothy Mamo, DigitalOcean
Timothy Mamo loves to help growing companies make the most of the cloud by focusing on Cloud Native technologies and processes. He's had a varied experience, from studying aerospace engineering and working in the automotive industry before moving into the world of Cloud and becoming a Developer Advocate for DigitalOcean. He enjoys working and helping others improve and understand, at times with some Mediterranean gusto.
3:30 pm–4:00 pm
Break with Refreshments
Grand Ballroom ABGH and Mission City Foyer
4:00 pm–5:30 pm
Closing Plenary Session
Not All Minutes Are Equal: The Secret behind SLO Adoption Failure
Michael Goins and Troy Koss, Capital One
Join us as we help you understand, calculate, and eventually prove SLOs value. Despite SLOs frequent introductions across industry forums and literature, their actual formulaic definition is obscured. We'll show how to look past the disconnect of marketing and the actual calculations of SLOs. Attendees will learn key differences between time-based vs event-based measurements. We will compare and contrast these methods as we review a real-world example of a high-severity incident. After we've established a common definition, we'll review the many key signals that SLOs and error budgets offer: slow burn, error budget recoveries, step function deviations, etc. These signals let teams latch onto empirical evidence to better understand their system's health and discover unreliability. After all, you can't improve what you don't measure.
Michael Goins, Capital One
A linguaphile, Goins turned a hobby in foreign languages into a career in programming ones. From a start in payroll systems to stint as a java tech lead, he focused on solving problems via software. After becoming frustrated by disjointed development processes he transitioned to a role defining and rolling out CICD practices at an organizational scale. Currently, Goins has switched to applying those same organizational scale transformation skills to the SRE domain at Capital One.
Troy Koss, Capital One
With what seems to be a natural attraction towards reliability, Troy constantly found himself involved in making things...well...more reliable. After working in software development, he stumbled into operations and saw a clear opportunity to use software to orchestrate such efforts. Currently, he works in Capital One's stability organization leading enterprise Site Reliability Engineering (SRE). He plays a critical part in both evolving the enterprise strategy while leading a team of engineers focused on partnering with and influencing business, architecture, and technology partners in delivering on the strategy.
Hell Is Other Platforms
Alex Hidalgo, Nobl9; Andrew Clay Shafer, Ergonautic
Deliberately or inadvertently, everyone builds a platform to support their business mission. And in doing so we often look to what others have done to get us there. But: what if other's principles, practices and platforms aren't applicable to our context?
In Sartre's 'No Exit' people are brought to a mysterious room in hell where they had all expected medieval torture devices to punish them for eternity, but eventually realize they had been put together to torture each other. Join Andrew Clay Shafer as DevOps and Alex Hidalgo as SRE as they both wait for Platform Engineering to join them in hell. During their suffering they will explain how you can best retain your agency, identify your needs, and understand how you can avoid the afterlife; while also outlining how we all got here in the first place.
Alex Hidalgo, Nobl9
Alex Hidalgo is the Principal Reliability Advocate at Nobl9 and author of "Implementing Service Level Objectives." During his career he has developed a deep love for sustainable operations, proper observability, and using SLO data to drive discussions and make decisions. Alex's previous jobs have included IT support, network security, restaurant work, t-shirt design, and hosting game shows at bars. When not sharing his passion for technology with others, you can find him scuba diving or watching Premier League football. He lives in Brooklyn with his partner Jen and a rescue dog named Taco. Alex has a BA in philosophy from Virginia Commonwealth University.
Andrew Clay Shafer, Ergonautic
Andrew Clay Shafer evangelized DevOps tools and practices when DevOps was not a word before falling in love with SLOs in theory and practice. Living at the intersection of Open Source and Cloud Computing across two decades, they gained experience in every role in software delivery from support and QA to product and development. Andrew now focuses on engineering operable resilient socio-technical systems and communities as a founder of Ergonautic.
5:30 pm–5:35 pm
Closing Remarks
Program Co-Chairs: Sarah Butt, Salesforce; Mohit Suley, Microsoft