SREcon22 Asia/Pacific Conference Program

Wednesday, 7 December

8:00 am–9:00 am

Morning Coffee and Tea

Level 2 Foyer

9:00 am–9:10 am

Opening Remarks and Presentation of the USENIX Lifetime Achievement Award

Grand Ballroom

Liz Fong-Jones, Honeycomb, and Jamie Wilkinson, Google

9:10 am–10:10 am

Opening Plenary Session

Grand Ballroom

Computing Performance 2022: What's on the Horizon

Wednesday, 9:10 am10:10 am

Brendan Gregg

The constant drive for faster computing performance introduces new hardware and software components for the SRE team to manage, observe, monitor, alert, and consider for capacity planning. This session tours the current state for major technologies, discussing performance improvements underway that you may soon be adopting and managing. Topics include processors (including 3D stacking and cloud vendor CPUs), memory (including DDR5 and HBM), disks (including 3D Xpoint), networking (including QUIC and XDP), hypervisors (including lightweight VMs: Firecracker and Cloud Hypervisor), AI-based auto tuning, and more. The future of performance is increasingly cloud-based, with hardware hypervisors and custom processors, observability down to cycle stalls (even as cloud guests), high-speed syscall-avoiding applications (eBPF, FPGAs, and io_uring), and AI-based auto tuning. This session provides ideas for improving performance, reducing latency, and meeting SLOs, and also provides opinions from a performance engineering expert with predictions for the future.

Brendan Gregg[node:field-speakers-institution]

Brendan Gregg is an internationally renowned expert in computing performance. He is an Intel Fellow, focusing on cloud computing performance and eBPF. Previously based in the US where he worked for Netflix and Sun Microsystems, he authored Systems Performance and BPF Performance Tools (Addison-Wesley), and received the USENIX LISA Outstanding Achievement award. He has delivered industry-leading performance for various products, and created widely used performance tools, methodologies, and visualizations, including flame graphs. His work is the basis for multiple startups.

10:10 am–10:40 am

Break with Refreshments

Level 2 Foyer

10:40 am–12:10 pm

Track 1

Grand Ballroom

Move Fast and Learn Things: Principles of Cognition, Teaming, and Coordination to Support High Performance and Resilient Site Reliability Engineering

Wednesday, 10:40 am11:40 am

Dr. Laura Maguire and Nora Jones, Jeli

What if we designed work systems to support SRE's in building and maintaining large-scale distributed systems better?

With software systems running at speed and scale and talent scarce, it is increasingly critical to design work systems that support an SRE's ability to recognize anomalies, adapt to changing conditions, and effectively coordinate across inter and intra-organizational boundaries.

In this talk, you'll learn about the cognitive and coordinative mechanisms that underlie resilient software engineering. Drawing from engineering psychology, design thinking, cognitive systems engineering, and contemporary management theory, this talk will serve as a primer for better understanding yourself, members of your team, and organizational life in general. Using practical case study examples and engaging stories from 5 years of studying software engineers at work while they build, maintain, and repair large-scale distributed systems, this session will enlighten and inspire your SRE practice.

Laura Maguire, Jeli

Laura leads the research program at Jeli.io. She has a Master's degree in Human Factors & Systems Safety and a PhD in Cognitive Systems Engineering. Her doctoral work focused on distributed incident response practices in DevOps teams responsible for critical digital services. She was a researcher with the SNAFU Catchers Consortium from 2017-2020 and her research interests lie in resilience engineering, coordination design and enabling adaptive capacity across distributed work teams.

Nora Jones, Jeli

Nora Jones is the founder and CEO of Jeli. She is a software engineer and leader with 10+ years of experience at innovative companies including Netflix and Slack. Nora's focus on the sociotechnical aspects of engineering — the intersection between how people and software work together in practice in distributed systems — is a founding pillar of Jeli, as well as the Chaos Engineering movement and learningfromincidents.io community, both of which Nora helped build from the start.

How to Not Destroy Your Production Kubernetes Clusters

Wednesday, 11:40 am12:10 pm

Qian Ding, Ant Group

This talk presents our real production incident stories when managing hundreds of Kubernetes clusters, particularly when a single cluster scales to 10K+ nodes. We demonstrate that Kubernetes in production can be fragile if not operated skillfully. These operations can be as simple as adding a single node into the cluster or modifying a configmap used by the API server. Yet, the chain reactions of such operations may end up causing the entire clusters to stop scheduling pods or drop significant API requests. By sharing our lessons learned from these failures, we conclude with our best practices to maintain high cluster availability.

Qian Ding, Ant Group

Qian works at Ant Group as a staff engineer focusing on site reliability engineering. He is the SRE tech lead of adopting Kubernetes in the Ant Group production environment. He is passionate about adopting and promoting SRE's philosophy for managing large-scale production systems. His current interest includes designing SLOs from the end-users perspective for using Kubernetes as well as using SLOs to drive reliability feature development for Kubernetes.

Track 2

Hyde Park Room

The Math behind the Incident Aftermath: A Practical Guide to Measuring Incident Impacts

Wednesday, 10:40 am11:40 am

Ashish Patel and Sriram Srinivasan, PayPal

Despite having world class reliability systems, some incidents do occur making varying levels of impact to our business. One of the many steps involved in the aftermath of the incident is measuring its impact.

Accurate measurement of financial impact due to an incident is an important part of incident management and is needed in real-time for many reasons including regulatory requirements.

However, it's not easy to calculate the impact given the dynamics of the complex distributed system. As SRE is deep-rooted in automation, we have built an Incident Impact Calculation Framework that accurately measures the incident impact using various Statistical and ML Models. It runs in seconds, is completely automated and fits well with our other incident management tools.

Building a real-time, reliable, and robust impact assessment framework is not as straightforward as it seems. Join us in this session to know more about impact assessment, its design, and challenges.

Ashish Patel, PayPal

Ashish is a software engineer at PayPal and works for site reliability platform engineering Team. He develops and maintains tooling that helps PayPal and it's businesses assess their losses and prioritise incidents. His area of interests include SRE, distributed systems and machine learning. In his spare time, he reads and draws unrecognisable cartoons.

Sriram Srinivasan, PayPal

Sriram Srinivasan is a Technical Architect at PayPal, where he is focused on building software tools and platforms in SRE space. He has 20 years of experience building and operating applications and services in production with a focus on reliability. He enjoys solving challenging technological problems. He particularly likes Software Design and Architecture. Besides his day job, Sriram is mentoring women entrepreneurs.

OpenTelemetry and Observability: What, Why, and Why Now?

Wednesday, 11:40 am12:10 pm

Greg Leffler, Splunk

OpenTelemetry enables you to own your own observability-related telemetry data and take it anywhere. In this talk, I'll explain what the OpenTelemetry project is, give a brief overview of how it enables observability, and discuss why data ownership with OpenTelemetry is so important. You'll learn the components of an OpenTelemetry data pipeline and you'll also learn why you should care as an SRE. OpenTelemetry reduces toil, is the industry-standard for observability instrumentation, and is still a relatively new project, so you have lots of opportunity to contribute and shape the direction it goes in.

You'll leave this talk with an understanding of why OpenTelemetry is the future and its overall design and philosophy.

Greg Leffler, Splunk

Greg Leffler heads the Observability Practitioner team at Splunk and is on a mission to spread the good word of Observability to the world. Greg's career has taken him from the NOC to SRE, from SRE to management, with side stops in security and editorial functions at eBay Advertising and LinkedIn. In addition to Observability, Greg's professional interests include hiring, training, SRE culture and operating effective remote teams. Greg holds a Master's Degree in Industrial/Organizational Psychology from Old Dominion University.

12:10 pm–1:30 pm

Lunch

Level 2 Foyer

1:30 pm–3:00 pm

Track 1

Grand Ballroom

Principles of Safety and Reliability Learned from US Navy Landing Signal Officers

Wednesday, 1:30 pm2:30 pm

Matthew Brahms, OpsLevel

Recovering fighter aircraft aboard an aircraft carrier is an extremely complicated and dangerous process—even under the best conditions. Performing a successful arrestment aboard the boat has specific parameters, process, and problems that must be overcome and adhered to.

While this topic is seemingly unrelated to software engineering, many similarities and principles can be learned from how the role of Landing Signal Officer is performed. As a community, they have crafted and honed their role at great cost to safely and expeditiously recover aircraft. These similarities and principles have implications to both SRE practitioners and the field of SRE as a whole.

Matthew Brahms, OpsLevel

As a Site Reliability Engineer, Matthew works to build scalable/resilient systems and instill SRE culture into the teams he embeds with (SLI,SLO,SLA anyone?!). Previous roles have included DevOps Engineer, Systems Administrator, and being a professional Classical musician.

Originally from Columbus, OH, Matthew holds degrees from The Ohio State University and Carnegie Mellon University in Pittsburgh. Currently he lives in Austin, TX, and enjoys working with Kubernetes, Go, and other cloud native technologies.

Other favorite activities include spending time with his family; training for a marathon; eating a whole-food, plant-based diet; and talking/listening to all things Classical in music.

Infra Eng to Staff SRE: A Tale of Developing Yourself in an Ever Evolving Industry

Wednesday, 2:30 pm3:00 pm

Jess Belliveau, Apptio

A reflection on the journey transitioning from an Infrastructure Engineer into the SRE world—how and why would others follow and what hurdles do we face in an evolving industry. This talk discusses different paths to reaching the "lucrative" staff job title, overcoming imposter syndrome and dealing with cognitive dissonance that can easily be felt within the SRE sphere as you interact with others or make a similar job title journey.

Jess Belliveau, Apptio

Jess is an SRE at Apptio working on the Platform and Reliability Engineering team. Holding keen interests in automation, developing technologies, coffee, empowering developers and building resilient platforms (and cycling!).

Track 2

Hyde Park Room

Lifecycle of a Sample in the Prometheus TSDB

Wednesday, 1:30 pm2:30 pm

Ganesh Vernekar, Grafana Labs

Prometheus is the leading open-source metrics monitoring solution. Prometheus 2.x contains a highly optimised Time Series Database (TSDB). It has ACID compliance and works efficiently with 100s of millions of time series on a single node. On top of its code base, Cortex, Thanos and Mimir have implemented distributed TSDBs.

The Prometheus TSDB has many components that make this all possible—an in-memory database, a Write-Ahead Log (WAL) for durability, memory-mapping of data from disk, persistent immutable data blocks with highly optimised inverted indices, and various techniques of maintaining this data in the background.

This talk will take you through the entire lifecycle of a time series sample (timestamp int64, value float64) starting from the point it enters the Prometheus TSDB's in-memory database until it gets deleted from the persistent blocks according to the retention policies.

Ganesh Vernekar, Grafana Labs

Ganesh has been contributing to Prometheus for nearly 5 years and is a Prometheus team member and maintainer of its Time Series Database (TSDB). He is currently working on the new native histograms in Prometheus. He has also contributed to Cortex, Grafana Mimir, and Grafana.

Metrics Stream Processing Using Riemann

Wednesday, 2:30 pm3:00 pm

Pradeep Chhetri, Cloudflare

In today's world where we have a wide variety of tools to add observability into our software stack, every tool comes with its own set of strengths and weaknesses. In this talk, we will look into one such monitoring tool, Riemann, a stream processing system written in Clojure. We will discuss some of its core concepts, use cases and some of its features which make it unique as compared to others.

Pradeep Chhetri, Cloudflare

Pradeep is an SRE at Cloudflare where he currently works in the Postgres Team. Over the last 10 years, he has been working mostly with startups. He is particularly interested in databases.

3:00 pm–3:30 pm

Break with Refreshments

Level 2 Foyer

3:30 pm–5:00 pm

Track 1

Grand Ballroom

Lifecycle of Reusable Automations: Track, Maintain, Deprecate

Wednesday, 3:30 pm4:30 pm

Renisha Fernandes and Bharat P, VMware

As SRE's we all spend hours together to automate every possible manual task. While it is important to write new automated flows, it is equally important to keep existing automated flows relevant. As the number of active automation workflows grow significantly, various scaling problems come into picture. One of them is Maintaining relevant automations and avoiding Script rot. When a SRE Decides to use automated workflow on customer platform, they should be able to trust the relevancy of the script. Join us at our talk to get an insight over our experience of scaling in-house automation platform that hosts 2000+ automation workflows, with close to 5 million executions every month. We will talk about building and scaling automation platform, tracking automation workflows and executions, understanding script rot, working with execution data and how we used insights gathered from data to automatically maintain relevancy of workflows

Renisha Fernandes, VMware

Renisha Fernandes has been into software development close to 10 years, contributing to both backend and front end development. For the past 5 years, she has been contributing to the development and scaling of the automation platform which is actively being used by VMware VMC SRE. She likes playing around with Distributed Systems Design and effective scaling.

Bharat P, VMware

Bharat has been part of various product developments and operations for the last 15+ years. He is the Engineering Manager for SRE Services Cloud Operations. Platform Engineering falls into his current interests.

Dashboards and Runbooks: Scrapbooking for Engineers

Wednesday, 4:30 pm5:00 pm

Colin Douch, Cloudflare

With the SRE revolution, Alert Runbooks and Dashboards have become vital tools for engineering teams hoping to adopt better incident response strategies. Unfortunately, these tools are often used in a way that makes them ineffective at this task. In particular, these tools are often created as knee jerk responses to incidents, without thought as to where they fit into the overall landscape of the incident response. This leads to hyper specific tooling that often masks the root causes of incidents and negatively impacts an incident response rather than helping.

In this talk, I will cover why creating dashboards and runbooks is such an attractive proposition to engineering teams, why it's so easy to fall into the specificity trap, why having these runbooks and dashboards is such an issue, and where these tools should instead fall into your incident response structure.

Colin Douch, Cloudflare

Colin currently Tech Leads the Observability Platform Team at Cloudflare, orchestrating and inventing solutions to Monitor and Debug Cloudflares infrastructure. Starting in Mining, he has been working, in the Observability space for close to 10 years with companies both big and small.

Track 2

Hyde Park Room

Observability Is Not Analytics!

Wednesday, 3:30 pm4:30 pm

Andrew Cowie

Implementing observability was a game-changer. We dramatically reduced our time to identify problems, isolate causes, and see effects of changes.

But it's not quite as easy to retrofit as we might like to think. Brooks taught us to be wary of doing things over, but we couldn't safely make even basic changes to the existing codebase. Being able to do observability at all was a major motivation for a massive re-engineering. We'll share lessons learned as we rebuilt a large distributed system.

As we iterate the code we iterate our telemetry, too. Once you've learned something and changed the system, it's a new system; telemetry is not a continuous function! This has a drawback: you can't use observability as a substitute for business metrics. Which raises an interesting question: can you actually measure your SLOs using SLIs in a distributed system?

Andrew Cowie[node:field-speakers-institution]

Andrew Cowie has an extensive background of software development, systems operations, production infrastructure, and engineering leadership experience but somewhat unusually started his career as an infantry officer in the Canadian army, having graduated from Royal Military College with a degree in engineering physics. He later ran operations for a new media company in Manhattan and was a part of recovering the firm after the Sept 11 attacks. Since then he has consulted on crisis resolution, change management, robust architectures, and (more interestingly) leveraging Open Source to achieve these ends. Andrew has been working in and around systems engineering and functional programming for many years; his most recent work has been to re-engineer observability into analytics pipelines written in Haskell.

Lessons Learned Building a Global Synthetic Monitoring System

Wednesday, 4:30 pm5:00 pm

Surajnath Sidh, Grafana Labs

In this talk, I will share lessons learned while building, deploying, and operating a global network of black box monitoring probes built using blackbox exporter as the core of the system.

I will also talk about design choices we made, evolved the deployment and rollout process of probes using RKE, and security practices used to secure the probes.

Thursday, 8 December

8:00 am–9:00 am

Morning Coffee and Tea

Level 2 Foyer

9:00 am–10:30 am

Track 1

Grand Ballroom

Sustaining Everything, Everywhere, All at Once!

Thursday, 9:00 am10:00 am

Fanjing Meng, Robert Barron, and Hua Ye, IBM

Sustainability is one of the hot topics in addressing challenges of climate change. When we put it in the context of Site Reliability Engineering, it means improving the power consumption efficiencies of the whole stack (from infrastructure to application) while ensuring the SLAs/SLOs of the services they support. In this talk we will start by discussing the concepts and technical challenges of a full-stack sustainability optimization platform. Then, we'll share the measurement systems, standards and testing approaches of sustainable computing which we have developed and use. This will not merely be a high-level concept, we'll present the architecture and detailed design of the whole platform. Part of the session will include sharing our practices which were developed using a data driven approach for efficiency optimization. We will also show an end-to-end demonstration of the whole platform in our data center and showcase active resource management and dynamic workload scheduling technologies.

Fanjing Meng, IBM

Dr. Fanjing Meng, Chief Technology Officer of IBM China System Development Lab, has more than 20 years of cutting-edge technology research, development and management experience, including sustainable computing, AIOps, ITOA, cloud computing, software and solution engineering and etc. Currently, she is committed to the research and development of sustainable computing technologies by building a full-stack sustainable computing optimization and management platform based on IBM systems, software and services to accelerate the realization of sustainable computing and sustainable digital transformation of enterprises. She has published more than 30 academic papers in international conferences and journals, has more than 40 international patents and patent applications in many innovative fields, and has received more than 30 awards for technological innovation and contribution from IBM and IEEE. In addition, she is actively involved in the establishment and construction of technical and academic communities, serving as the General Chair(Co-Chair), Technical Program Committee Chair(Co-Chair), Technical Program Committee Member of International Conferences and reviewers of International Journals, as a founding member and project leader of the IEEE WIE (Women-in-Engineering) Beijing Affiliate, and as a member and invited speaker of IEEE Women in Services Computing (WISC).

Robert Barron, IBM

Robert is an SRE Architect in the office of the IBM CIO where he enjoys helping others solve problems even more than he enjoys solving them himself. Robert has over 20 years of experience in IT and is still happiest when learning something new. He lives in Israel with his wonderful wife and two children. His hobbies include history, space exploration, and bird photography.

Hua Ye, IBM

Hua Ye, a Senior Architect of IBM China Systems Center, has over 20 years of experiences in the design, architect, implementation, operation and management of a client-facing data center. He has in-depth knowledge in infrastructure management and is familiar with data center technology, IT hardware, networking, storage, and related software. He is now responsible for IBM data center operations, working closely with global technical team and leading the local infrastructure team to support client projects and IT infrastructure operations.

Introducing the Reliability Map – r9y.dev

Thursday, 10:00 am10:30 am

Aaron Bowden, Google

So you want to make your systems more reliable? Sure, there's a book for that, but what does it mean? What do we do next? The r9y.dev map helps you assess what you've already got, where you want to get to, and helps you build a roadmap of tools, process, culture changes.

Aaron Bowden, Google

Aaron joined Google in 2017 and can now be found leading the Site Reliability Engineering practice for GCP Professional Services across Japan and Asia Pacific. His prior experience spans some 20 years and has centered on solution architecture, software delivery and large scale transformation projects across a variety of global businesses.

Aaron holds a Master of Management (Information Technology) and a Master of Computing Research. When he is not helping customers to scale and respond to global events, you can find him busy trying to build retro arcade machines.

Track 2

Hyde Park Room

Chaos Engineering at Scale

Thursday, 9:00 am10:00 am

Sharath Reddy and Venkatesh Maligireddy, PayPal

As an SRE or an application owner, it is common to come across the below questions/scenarios during the day-to-day activities of an engineer:

  • "If only we had seen this sooner…" during the course of a SITE incident.
  • "What happens if one of my service dependencies fails?"
  • "How reliable my application is in the production environment?"

Chaos engineering has evolved into a must-to-have SRE culture that addresses the above questions and thereby improves the resiliency of internal systems that gives the teams confidence and a path to provide best-in-class products at scale.

In this talk, we will cover

  1. The Chaos principles
  2. How to prepare for Chaos journey in an organization
  3. How to conduct Chaos Gamedays
  4. How to Measure and Track the resiliency of a system
  5. Leverage existing opensource Chaos platforms

Sharath Reddy, PayPal

Sharath is an Engineer with 10 years of experience in Software. Worked in product development as well as Site Reliability in large Enterprises as well as a couple of startups. Have a strong passion for working on Complex engineering problems, which generally keeps him going. He has a Penchant for the Elegant Design of systems. Apart from this, he follows and sometimes plays cricket & Soccer.

Venkatesh Maligireddy, PayPal

Venkatesh Maligireddy is a Senior Software Engineer at PayPal where he works on building the enterprise Chaos platform. In his prior roles, he lead a team built a ChatOps platform that helps automate Incident management and Operational efficiency workflows at PayPal. In his role as an SRE, he has also worked on Disaster Recovery and Parity measurement platforms across Data Centers.

The Multi Layered Cake of Resilience

Thursday, 10:00 am10:30 am

Joe Chop, Shopify

The key to building resilient distributed systems is accepting that the system will fail. Furthermore, accepting that the system will fail at every layer of the stack, from the edge all the way down to the database. In this talk, we will overview a practical, multi-layer approach to building resilient web applications. Using a simplified version of a web application stack, we will illustrate through animations, a common failure scenario and how resiliency techniques at various layers protect the system.

Joe Chop, Shopify

Joe Chop is a site reliability engineering manager at Shopify, leading the SRE team spread across Hawai'i, Singapore, Japan, and Australia. When Joe is not on call, he enjoys pager free time out on the ocean and mountains, surfing and hiking the volcanoes of Hawai'i.

10:30 am–11:00 am

Break with Refreshments

Level 2 Foyer

11:00 am–12:00 pm

Track 1

Grand Ballroom

Capacity vs Efficiency: Building a Globally Scalable Cloud Database

Thursday, 11:00 am12:00 pm

Daniel Marshall, Google LLC

The last 2 years have shown countless examples of products massively scaling up capacity from sudden shifts to a digital-first approach. At the same time, global chip shortages limit the availability of machine capacity, and economic conditions require cost-effectiveness and minimal over-provisioning.

This talk focuses on striding the line between capacity and efficiency, using lessons learned from operating and scaling Google Cloud Firestore to discuss how and where to make trade-offs, and strategies to optimise for both.

Daniel Marshall, Google LLC

Daniel is an SRE at Google Sydney, where he's been working on Cloud Native Databases for the past 3 years. In a prior life he was an electronics engineer, and helped build the data centers he now runs software in.

Track 2

Hyde Park Room

Improving Observability, Reliability, and Security of Relational Database Ecosystem

Thursday, 11:00 am12:00 pm

Sundar Raman Ganesh, LinkedIn

As SREs who offer MySQL as a service to developers, we were presented with the challenge of providing support when applications hit connection limits or performance bottlenecks due to incorrect client side configurations or other resource constraints. To solve these operational problems, we developed a set of custom wrappers over open-source and native MySQL client libraries to provide observability into the client drivers and pools, in order to be able to quickly identify issues and mitigate them. In this we talk, we discuss the following

  • Benefits of introducing observability into database clients
  • Key metrics to monitor on application clients of databases
  • Improving service availability with client side intelligence

Further, we speak on how we improved security of our services by developing a custom server plugin for MySQL that performs authentication using mTLS.

Sundar Raman Ganesh, LinkedIn

Sundar has been working in the Internet industry for the last 5 years working with mail systems, web servers, containers and databases. Over the last couple of years has worked extensively on improving resiliency of relational database systems at LinkedIn. He is passionate about building automations that work at scale.

12:00 pm–1:20 pm

Lunch

Level 2 Foyer

1:20 pm–2:50 pm

Track 1

Grand Ballroom

Improving Machine Learning Development Reliability

Thursday, 1:20 pm2:20 pm

Brian Hansen and Yan Yan, Meta

The Machine learning Development LifeCycle is not the same as Software Development LifeCycle. It’s so different that we believe that we need to develop new ways to rationalize how we go about building, monitoring and alerting on ML artifacts as they go through the process. This talk explores those differences. It highlights challenges of ML reliability and scalability, what we’ve done and the need for involvement from this community to evolve how we think about the development and productization of machine learning as it explodes across our industry.

Brian Hansen, Meta

Brian Hansen Brian leads the AdsML Production Engineering teams for Meta, focused on scaling machine learning in production environments. He has been a successful serial entrepreneur for two decades taking multiple start-ups from early to late stage growth. Throughout his career Brian has been a leader building global teams leveraging infrastructure to improve business performance.

Yan Yan, Meta

Yan Yan is a production engineer within the AdsML PE: Model Ecosystem team. She’s defining and building a Machine Learning Development Lifecycle with partners across Meta. Yan has been a speaker at the 2019 & 2020 USENIX Operational Machine Learning conferences and the 2018 & 2019 Meta PE Summit. Prior to Meta, she graduated from UCLA with a Master’s degree in computer science.

How Can We Make Data Integrity Easy?

Thursday, 2:20 pm2:50 pm

Adrian Ratnapala, Google

If you keep valuable data, then you want to look after it. So you probably already know all about backups. And you probably also know that don't need backups but you need restores, and that you don't need restores but you need recovery.

But if you've heard all that before, you've probably also heard that you need plans, test, automation, metrics, and... a whole lot of hard stuff. Hard stuff is hard work, so maybe (I'll keep your secret), you don't all those ducks lined up.

Let's talk about why all this is so hard, and what databases, tools and frameworks can do for you to make easy. So that you really can keep you valuable data with the care it deserves.

Adrian Ratnapala, Google

Dr. Adrian Ratnapala is a Site Reliability Engineer who has worked on storage systems at Google for six years; which is to say, he is fanatical about being a good custodian of other peoples' data. Dr Ratnapala has also been an embedded-systems software engineer, and a quantum physicist. Dr Ratnapala is an experienced professional nerd with a working knowledge of numerical analysis, optics, electronics, the integration of complex hardware/software systems, distributed computing and really, really, really, big data.

Track 2

Hyde Park Room

Real-Time Adaptive Controls for Resilient Distributed Systems

Thursday, 1:20 pm2:20 pm

Praveen Yedidi, CrowdStrike

Modern services are equipped with hundreds of tunables. There are a lot of these tunables such as worker pool sizes, autoscaling policies, throttlers and circuit breakers that directly effect the service resilience. Finding ideal initial values for these tunables requires deep technical expertise. Also, these workloads change over time, requiring regular effort to re-tune stale parameters. As a consequence, configuration errors have become a source of operational toil and one of the major causes of overload, cascading service and system failures across the industry. Services should aim to expose a minimal configuration surface by dynamically adjusting parameters based on observations. Praveen will provide a deep-dive into how CrowdStrike is using real-time Adaptive Controls(inspired from TCP congestion control) to dynamically tune these parameters for improved resiliency using real-time sampling of errors and latencies, removing the need for periodic adjustment. He will also discuss lessons learned deploying the feature to CrowdStrike's massive production systems that handles multiple trillions of events per day without causing any incidents.

Praveen Yedidi, CrowdStrike

Distributed systems developer and Engineering Manager with experience in mentoring, facilitating and leading teams offering a decade of experience in Large Scale cloud-native application and tooling development. Possessing excellent analytical skills summed up with strong knowledge in Go, JavaScript, Kubernetes, AWS, Terraform, Vault, Consul, Service Meshes, Observability and monitoring tools. Active open source contributor and contributed to projects like Kubernetes, gvisor, grafana, terraform, firecracker-containerd.

Cognitive and Self-Adaptive System for Effective Distributed-Tracing in Applications

Thursday, 2:20 pm2:50 pm

Susobhit Panigrahi

The solution is a holistic approach to make use of Cognitive Learning to provide better results and capture useful traces in the heap of traces in the ever complex world of microservice based architecture. Solution streamlines the process and reduces multiple pain points for SRE, Dev and Ops, come join to share thoughts! :)

Susobhit Panigrahi, VMware

Curious, inquisitive, and friendly individual working as SRE and Developer at VMware with a knack for solving interesting problems with scalable and simple solutions.

2:50 pm–3:20 pm

Break with Refreshments

Level 2 Foyer

3:20 pm–4:20 pm

Track 1

Grand Ballroom

Site Reliability Evangelism: Practice Start-up within an Established Web-Presence

Thursday, 3:20 pm3:50 pm

Piers Chamberlain and Catherine Matheson, Trade Me

Introducing SRE practices within a mature online business is not for the fainthearted. Google lays out several possible approaches to SRE transformation but promoting the central ideas to our engineering colleagues was not really what we'd expected to spend so much time on.

We will talk about the first year of SRE for our organisation, being frank about the reality of our position today given the Initial Roadmap set out and how plans changed along the way.

Piers Chamberlain, Trade Me

Piers currently works as a Site Reliability Practice Lead at Trade Me. With a background in performance engineering and incident management he's unashamedly fascinated by the ways that things fail, the ways that they recover, and what we learn along the way.

Catherine Matheson, Trade Me

Catherine currently works as a Technical Product Owner at Trade Me. She has a background in quality assurance so has been interested in anything that helps improve the quality of the products we release to our customers.

Deploying Humans at the Edge of SRE

Thursday, 3:50 pm4:20 pm

Jan Peuker, Stripe

SRE principles are fundamentally built around end-user happiness. But talking to actual end users and solving individual issues is commonly seen as reactive, toil, something product teams do or something to generalize and automate.

We are proposing to bring SRE principles and empathy into the end-user facing engineering community, focusing on SRE as a culture rather than a role or product (like a platform team). For that, we give an overview of common user-facing tech-adjacent roles, from sales to support. How they fit into your socio-technical system, and how to optimize your product user journeys for long-term user happiness. Ideally we also provoke some thoughts around SRE career paths and openness for uncommon backgrounds.

Jan Peuker, Stripe

Jan is an Integration Engineer at Stripe, based in Singapore. He keeps running stuff in a way that may resemble what is usually called Staff Engineer, Engineering Manager, Tech Lead, Architect, Data Engineer, Operations Engineer, DevOps, Support, Product Ops, Solution Engineer, Product Operations Manager or simply SRE - but can simply be summarized as “Must fix everything”. Which was the motto printed on his farewell mug from Google.

Track 2

Hyde Park Room

Challenges, Best Practices, and Solutions for Monitoring and Alerting with Big Data

Thursday, 3:20 pm3:50 pm

Daniel O'Dea, Atlassian

Imagine you run an automated ice cream gigafactory. You notice your vanilla ice cream output has dropped. You check your monitoring system—it can't find the problem because the error is hidden somewhere in your thousands of ice cream machines (you make too much ice cream).

After days of manually searching the factory:

  1. You find a pond of vanilla ice cream next to a row of faulty machines, and
  2. You realise that 30% of your ice cream melts on its way to be packaged.

In cloud SaaS products (no, not Sugar as a Service), good system monitoring is similarly crucial. We'll discuss how we monitor Jira at Atlassian for millions of customers, and explore some challenges we've faced. We'll go over best practices and solutions to help you maintain the sweetest systems, with a particular focus on managing big, high-cardinality data.

Daniel O'Dea, Atlassian

Daniel O'Dea is a Site Reliability Engineer at Atlassian, where he is a key player in major incidents involving Jira. Daniel is the co-founder of Thorial, a startup building software tools for managers. Daniel is also a classical pianist, composer, and artist.

A Better Way to Manage Stateful Systems: Design for Observability and Robust Deployment

Thursday, 3:50 pm4:20 pm

Kazuki Higashiguchi, Autify

Designing for monitoring and robust deploying system components with a state of their own is more complex than stateless. We prefer to build our system components as stateless as possible since it is one of the best practices in the cloud-native era, but some systems inevitably have a state. Without consideration, your application hides its state and becomes a black box, which wouldn't be observable. Besides, it would be impossible to implement robust deployment without downtime since we need to verify whether we can release changes by checking the state of running applications.

In this talk, I'm going to discuss some tips to design better stateful systems for observability and robust deployment gained by the project where we've built a business-critical WebSocket server to establish a secure long-living tunnel connection, including:

  • Application design to provide insight into the internal state
  • Blue-green deployment, including business logic
  • Better architecture around stateful applications
  • Filterable logging

Attendees will gain reusable tips and reference examples when building a stateful system.

Kazuki Higashiguchi, Autify

Kazuki Higashiguchi is a Sr. Site Reliability Engineer at Autify, a company serving AI-powered testing automation platforms for web and mobile applications. His day job is building and serving robust and reliable backend infrastructure for handling a bunch of test executions for customers.

4:30 pm–6:30 pm

Conference Reception

Level 2 Foyer

Sponsored by Google

Friday, 9 December

8:00 am–9:00 am

Morning Coffee and Tea

Level 2 Foyer

9:00 am–10:00 am

Track 1

Grand Ballroom

Reliability Reviews in the Wild: Using Data to Drive Production Health

Friday, 9:00 am9:30 am

Karthik Nilakant, Canva

"Reliability reviews" (also known as "production meetings", or "operational reviews") are regular opportunities for engineering teams (or groups of teams) that own production services to reflect on their operational health. Ideally, these reviews should be based on quantitative insights drawn from multiple sources (such as incidents, post-incident reviews, service levels and on-call alerts), to help teams objectively decide where to prioritize their efforts. In this talk, I aim to share my practical experiences in scaling out this practice across our organization. I'll also share some insights from our work in developing an automated reporting platform to support this practice, focusing on challenges in collating health data from multiple sources and mapping to service owners. This talk should help inform reliability practitioners or leaders that are considering increasing adoption of similar reviews in their own organizations.

Karthik Nilakant, Canva

Karthik is a coach in the Reliability Platform group at Canva and is currently based in Wellington, New Zealand. He's previously worked as a product manager, team lead and engineer within the SRE field.

Leveraging Continuous Production Profiling for Providing Insights into Service Performance

Friday, 9:30 am10:00 am

Saurabh Badhwar, LinkedIn

Continuous Production Profiling allows engineering teams to gather insights for their services allowing them to understand how their Service Performance is evolving and allow them to make data backed decisions when it comes to optimizing a service or performing RCA for a service. In this talk, you will learn about:

  • Why have always-on production profiling for your services
  • Managing overheads and discovering services dynamically
  • Extracting Performance Insights and make it easy for engineers to understand performance from profiling

Saurabh Badhwar, LinkedIn

Saurabh Badhwar is a Staff Software Engineer at LinkedIn in the Service Performance Insights team, where his team works on production Service Performance Insights and Root Cause Analysis tools with the help of tracing and production profiling techniques.

Outside of Work, he is fond of writing and has authored 2 books into the space of Large scale application development with Python.

Track 2

Hyde Park Room

Applying SRE Principles to CI/CD

Friday, 9:00 am9:30 am

Mel Kaulfuss, Buildkite

The automation of building & testing code with CI/CD enables us to ship code easily and frequently with a high level of trust that bugs won't impact end-users. Why then are our CI/CD systems still often painfully slow, unreliable & our ability to deliver frequently blocked? Site Reliability Engineering (SRE) aims to reduce the pain caused by unhealthy platforms & processes that affect the reliability & stability of production systems. Join Buildkite's Mel Kaulfuss as she looks at CI/CD through the SRE lens. Learn how to define meaningful SLOs (service-level objectives) & SLIs (service-level indicators), and use error budgets to tune your test suites & pipelines to manage CI/CD infrastructure & processes just as you would production systems.

Mel Kaulfuss, Buildkite

Mel is Lead Developer Advocate at Buildkite. She's spent the past decade delivering software wearing many hats; Software Engineer, Production Coordinator, Project Manager. She's driven by creating safe spaces, inclusive teams and communities. She's organised and emceed numerous RubyConfs in Australia, and most recently launched Buildkite's own developer conference; Unblock. When she's not clickity clacking, she's patting dogs, eating strawberries, learning German and watching Nordic Noir.

Gremlins Exposed: Shining a Light on Mischievous Systems

Friday, 9:30 am10:00 am

Thomas Cuthbert

Despite our best efforts, the services we are operate will occasionally show signs of mischievous gremlins—even if kept dry and fed before midnight!

Overcoming natural instinct is not easy, years of experience has served us well up until now; however, when faced with unknown and complicated gremlins, assumptions and reactive thinking lead to wasted time chasing red-herrings.

While adhering to a structured and methodical process, together we'll develop our personal gremlin hunting style by running through some common scenarios and figuring out: What's the right tool for the job, Do I scratch the surface or dig deep, Where should I look for useful information, and How do I ask for help?

Thomas Cuthbert, Canonical

Thomas is an SRE at Canonical for the APAC region. His curiosity and drive to understand the deeper meaning of things has been a catalyst for the engineering of a personal problem solving style worth sharing!

10:00 am–10:30 am

Break with Refreshments

Level 2 Foyer

10:30 am–12:00 pm

Track 1

Grand Ballroom

Burnout at Scale: What to Try When You Just Can't

Friday, 10:30 am11:30 am

Courtney Eckhardt

In SRE, we say (truthfully) that all problems are people problems, and people problems have not been scaling well during the pandemic. People are exhausted, they're still getting sick, there are still dramatic staffing shortages in our own workplaces and in core societal services, and some days it really feels like everything's coming apart at the seams.

The remedy that seems to come up most often is "be more empathetic"- but as SREs, we know that "do more" is not actionable and doesn't constitute a plan. It's not very clear what it means to be more empathetic (what do you actually do?), or how it helps (what does it fix, for you or anyone else?).

So what should we be trying to do? Let's talk about exhaustion, loneliness, friendship, mutual aid, and creating and maintaining the human connections we need to sustain ourselves and our societies.

Courtney Eckhardt[node:field-speakers-institution]

Courtney comes from a background in customer support and internet anti-abuse policy. She combines this human-focused experience with the principle of Conway's Law and the work of Kathy Sierra and Don Norman into a wide-reaching and humane concept of operational reliability. You can find her knitting in the audience of conference talks, and she's always interested in cat pictures.

Backend API Design for SREs

Friday, 11:30 am12:00 pm

Sam Dunster, Meta, Inc.

APIs are everywhere in our industry. Public clouds, internal systems, microservices—they all depend on APIs. Backend APIs in particular are often neglected, leading to increased operational burden for SREs.

In this talk I will explain why good API design is important for SREs. We will cover 5 useful tips with examples that will help reduce operational toil, clarify your SLA and speed up your team's incident investigations.

Sam Dunster, Meta, Inc.

Sam Dunster is a Production Engineer at Meta. He has worked in a variety of infrastructure teams and dealt with his fair share of major production incidents. These days he spends his time bringing flash into Meta's largest storage systems to provide extremely low latency storage for Meta's largest key-value storage platforms with a focus on reliability and high quality backend systems design.

Track 2

Hyde Park Room

Online Database Reliability, Performance, and Consistency Engineering

Friday, 10:30 am11:30 am

Yoshinori Matsunobu, Meta

Operating online databases that serve user facing workloads at scale is challenging. What can you do to prevent from database outages? On incidents, how can you find root causes and mitigate quickly?

Databases can fail by various reasons, and you need to find root causes—offending queries/diffs, and mitigate quickly. Understanding various database reliability and performance practices will help, such as knowing common database outage reasons, indexing, how query optimizer works.

Database has another obvious, but hard to guarantee, requirement that it should not lose data / return wrong data. Users will treat the system is "unreliable" if database consistency or correctness is lost. How can we continuously verify our data is correct in production?

In this session, the speaker will show several database reliability and correctness issues that may happen in production. For each issue, the speaker will explain what kinds of workarounds can help to debug or mitigate, and will tell common performance and reliability practices.

Yoshinori Matsunobu, Meta

Yoshinori Matsunobu is a Production Engineer at Meta, specializing in online databases in production. Yoshinori has over 20 years of database industry experiences, mainly MySQL, and recently RocksDB. Yoshinori created several essential open source products and operated in production, such as MyRocks, quickstack, and MHA. Yoshinori has spoken at many conferences, and shared database practices with communities. Yoshinori received multiple database industry awards, such as Lifetime Open Source Database Contributor from Percona (2022), Honorable Mention from VLDB about MyRocks paper (2020).

Migrating Datastores

Friday, 11:30 am12:00 pm

Tessa Bradbury, Buildkite

As software engineers, we're often trying to keep up with growing products and new technologies, this means we're often needing to move data from an old datastore to a new one—while the data's still in use. Unsurprisingly, this comes with some challenges.

Tessa will use a recent Redis migration at Buildkite as a case study to explore:

  • The different types of data you might be working with
  • Strategies to safely and efficiently migrate each type
  • How to test; locally, in CI, and in prod

Also, communicating the value of these types of infrastructure projects outside of engineering can often be difficult. Tessa will explain some of the team's strategies to ensure the value of this work was visible across the wider organization.

Tessa Bradbury, Buildkite

Tessa is a Staff Software Engineer at Buildkite, a CI/CD platform. She is part of the team ensuring build agents run the right code at the right time and all the necessary record keeping is done so everyone can see the results. She's also particularly interested in how we learn and share knowledge. Especially as we navigate the increasingly complex world of software development.

12:00 pm–1:20 pm

Lunch

Level 2 Foyer

1:20 pm–2:50 pm

Track 1

Grand Ballroom

Our Experience Tracking and Driving SLO Adoption at Goldman Sachs

Friday, 1:20 pm2:20 pm

Jordan Li and Ivan Ryabov, Goldman Sachs

SLI/SLO is the fundamental building block for any SRE organization. It creates a language to communicate service reliability. However, apply it in practice and at scale poses some interesting challenges:

  • Challenges for Developers
    • How to move from telemetry to user-centric metrics
    • How to implement monitoring strategy to measure SLI for different type of services
    • How to define SLO and the math behind it
  • Challenges for Stakeholders
    • How to interpret and evaluate SLO

At Goldman Sachs, we tackle this problem with tooling, "SLO Repository", a tool we built with open-source technologies for driving SLO adoption, making SLO discoverable to drive on-going feedback loop and easy to understand for stakeholders.

This talk will convey Goldman Sachs' key lessons learned while driving SLO adoption across organization and showcasing the ecosystem we built around SLO adoption.

Jordan Li, Goldman Sachs

Jordan Li is a Software Engineer and Site Reliability Engineer at Goldman Sachs, focuses on building tool for observability and SLO adoption. Prior to that, he was a network engineer and cloud engineer at HKT. Besides his day job, you can find him doing card tricks.

Operationalizing ML Training Infra at Meta Scale

Friday, 2:20 pm2:50 pm

Shivam Bharuka, Meta

Machine learning models are growing rapidly in scale in order to support the recommendation and content understanding use-cases. In order to keep up with this growth, we have re-architected the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. Traditional reliability practices do not translate well to detect problems in the ML training stack. In this talk, I will talk about the challenges we encountered and the approach we took to re-design and scale reliability for the ML Training Platform.

Shivam Bharuka, Meta

Shivam is an engineering leader with Meta as part of the AI Infrastructure team for the last three years. During this time, he has helped scale the machine learning training infrastructure at Meta to support large scale ranking and recommendation models, serving more than a billion users. He is responsible for driving performance, reliability, and efficiency-oriented designs across the components of the ML training stack at Meta. Shivam holds a B.S. and an M.S. in Computer Engineering from the University of Illinois at Urbana-Champaign.

Track 2

Hyde Park Room

Advanced Linux Kernel Networking Monitoring

Friday, 1:20 pm2:20 pm

Jizhong Jiang and Shane Xie, Alibaba Cloud

Linux networking stack is very complex, especially when container is involved. The heavily used linux netns, virtual device, ipvs, iptables makes it easier for network jitter to occur inside the node, and it is more difficult to troubleshoot. As cloud kubernetes prodiver, we have had all kinds of networking issues, and we developed a kernel monitor tool which named net-exporter to help us to determine the cause of networking issues.

In this talk, we will describe how linux container networking works, what would happen when a networking packet was sent/received, the common networking issues in container environment and the root cause of the issues. Also we will illustrate the implementation of net-exporter, and how to use net-exporter to troubleshoot network issues.

Jizhong Jiang, Alibaba Cloud

Jizhong Jiang is a staff engineer of Alibaba Cloud, mainly responsible for the work of container networking and AIOps. He have been worked on container for more than 8 years.

Shane Xie, Alibaba Cloud

Shane Xie is a senior engineer on Container Service for Kubernetes(ACK) team at Alibaba Cloud. He has devoted most of his time to container networks, and all sorts of observabilities around them.

Using the Internet as Your Load-Balancer

Friday, 2:20 pm2:50 pm

Martin Barry

Load-balancing happens at many layers in the infrastructure of the services we all use and work on. Your edge team or CDN needs to think about load-balancing on a global, internet-wide scale. This talk will help you understand how traffic is balanced on to active/active infrastructure using DNS and BGP routing.

Martin Barry, Fastly

Martin is the Interconnection Manager at Fastly where he focuses on the delivery of new edge capacity to add to their 215 Terabit network (as of June 2022). He has over two decades of experience in technical operations, starting with system and network administration, through to the design, deploy and operation of large scale internet infrastructure.

2:50 pm–3:20 pm

Break with Refreshments

Level 2 Foyer

3:20 pm–4:20 pm

Closing Plenary Session

Grand Ballroom

A Post Incident Review Review

Friday, 3:20 pm4:20 pm

Tom Partington, ANZx

Our post incident process is a little different to most, and mainly because of what it doesn't include rather than what it does.

We don't identify a root cause, we don't create or track action items, and we don't report on incident counts or MTTRs. We also work in a highly regulated industry, in a 1000+ person organisation, and repeat incidents are rare.

In this talk I'll discuss how we developed this process, the reasons why, and look at the safety science behind the concepts.

Tom Partington, ANZx

Tom has held many different titles, the most recent of which is Site Reliability Engineer, but despite what the job was called the work has generally remained the same. Trying to keep systems running with sticky-tape and glue (and not always successfully). Understanding how things break became somewhat of an obsession and he fell deep into the safety science rabbit hole where he discovered that many other industries have been studying accidents and human performance long before these computer things became popular.

4:20 pm–4:30 pm

Closing Remarks

Grand Ballroom

Liz Fong-Jones, Honeycomb, and Jamie Wilkinson, Google