SREcon18 Asia/Australia Conference Program

Wednesday, 6 June 2018

8:00 am-9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–9:10 am

Opening Remarks

Grand Ballroom

Program Co-Chairs: Paul Cowan, Google, and Xiao Li, LinkedIn

9:10 am–11:00 am

Opening Plenary Session

Grand Ballroom

The Evolution of Site Reliability Engineering

Wednesday, 9:10 am–10:05 am

Benjamin Purgason, LinkedIn

Available Media

Few companies invest in SRE before there is a raging operational fire on their hands. As a result SREs often start out as firefighters, desperately trying to keep the company alive for one more day. Once we put out the fires and our site is safe we can begin evolving—but where to?

In this talk we share the five distinct stages of SRE evolution. Moreover we’ll cover the transition of roles and responsibilities between site reliability and software engineers over time, the relationship dynamics that impede progress, and the shifts in mindset that must occur over time.

Join us, and we’ll help you transform the reactive growth of your SRE team into directed evolution.

I craft culture and code that supports Site Up, Site Secure, and LinkedIn's ability to change.

Safe Client Behaviour

Wednesday, 10:05 am–11:00 am

Ariel Goh, Google

Available Media

Ubiquitous compute power has created frequent impedance mismatches between client capabilities and server capacity in distributed systems. Careful client behaviour design protects the server from unintended load and enables safe recovery after outages. These techniques improve resiliency both in microservice environments (where they protect microservices from each other) and in more traditional client-server environments (where a large number of clients such as mobile phone apps might be stacked against a comparatively small number of servers.)

Attendees will learn how to identify types of requests that are potentially unsafe. They will learn about the effects of unsafe client behaviour on the server, demonstrated with pseudocode samples and simulations of the behaviour. They will learn how to modify client behaviour with techniques like back-off and jitter to achieve better results on the server side.

11:00 am–11:30 am

Break with Refreshments

Grand Ballroom Foyer

11:30 am–11:55 am

Track 1

Grand Ballroom 1

Service Monitoring Manual—2018 Edition

Wednesday, 11:30 am–11:55 am

Nikola Dipanov, Facebook

Available Media

Monitoring a.k.a figuring out what production code is doing is extremely important for an SRE organization. Monitoring services the right way can have a profound impact on how we do SRE. Modern software systems can be incredibly complex, code running on thousands of machines, depending on services we don't control and running code on user devices. Observing behavior of such systems means we have to change how we think about monitoring.

This talk will go over what a modern monitoring infrastructure for running software at scale looks like:

Asking the right questions—how to decide what to monitor
Types of data we want to collect and what answers it can help us find
A look at how we build services at Facebook
Collecting, storing and querying monitoring data at scale
When things go wrong—what makes for a good alarm and what makes a bad one
Putting it all together—debugging an outage using data

As an attendee, you will come out of the talk with fresh ideas about logging and monitoring. You will hear how we tackle these problems at Facebook, and why we do things the way we do.

Nikola has spent the last 2 years at Facebook as a Production Engineer, based in Dublin, Ireland. Prior to that he worked as a Software Engineer in several industries, ranging from small internet startups to large Telcos. Due to this prolonged exposure to production software, he's developed an acute need to measure and verify.

Track 2

Grand Ballroom 2

Introduction to Alibaba Monitoring System

Wednesday, 11:30 am–11:55 am

Ren Xinchi

Available Media

We all know that monitoring is one of the most important topics in the field of devops, but sometimes we are also suffering from it, such as alarm storm and high cost of deployment. And in this talk, we will share how Alibaba deal with these problems.

In Alibaba, there are hundreds of major KPIs has been defined to measure the running status of the business. In our system, we have created a CMDB for the business monitoring, which is called as Hammurabi. This is used to record the business function points with their priority levels and stakeholders. We will map the business monitoring to this CMDB. Thus, when the alarm comes, we can quickly confirm the business impact based on the trend of business indicator and make an emergency response. Intelligent business monitoring method based on time series analysis has already been used in our business monitoring, which helped to improve the accuracy of alarms from the baseline of 20% to 80% recently.

A Case Study that include Taobao and Youku will be shared to show how the business monitoring contributes to the optimization of operations and business.

I am a senior maintenance engineer from Alibaba group. Now I am taking charge of the business monitoring for all the business units of Alibaba.

Track 3

Kingfisher Room

How Atlassian Is Tackling Error Budgets, Agile Style

Wednesday, 11:30 am–11:55 am

Gui Vieiro, Atlassian

Available Media

Striking a balance between feature delivery and reliability is a challenge many organizations face. Error Budgets, the practice of blocking feature releases when a service fails to meet Service Level Objectives, is an effective way of achieving the right balance. However, getting buy-in to adopt Error Budgets can be difficult as many see the process as heavy handed. This talk explores the agile approach Atlassian took in adopting Error Budgets to reap benefits quickly and avoid impacting dev velocity. Come hear about what compelled us to give Error Budgets a try, how we went about it, the challenges we faced, the results to date, and where we are going next with Error Budgets.

For better than a decade Gui Vieiro has positioned organizations for success by growing and coaching technologists, developing technology strategies, and establishing practices leading operational excellence. Joining Atlassian in 2016 gave Gui the opportunity to enter the world of SRE and drive reliability initiatives for the benefit of millions of customers worldwide.

11:55 am–1:25 pm

Luncheon

Waterfront & Riverfront Ballrooms
Sponsored by Baidu

1:25 pm–2:50 pm

Track 1

Grand Ballroom 1

Building SRE: Culture from the Outside In

Wednesday, 1:25 pm–1:50 pm

Todd Palino, LinkedIn

Available Media

Many companies want to grow a site reliability engineering team, but first need to ask “Is my company ready for SRE?” Taking the "lid-off" one of the most mature SRE organizations, I’ll describe the cultural tenets that provide the foundation for a high-trust and inclusive environment that is necessary for any SRE team to exist and evolve.

Todd Palino is a Senior Staff Site Reliability Engineer at LinkedIn, tasked with keeping Zookeeper, Kafka, and Samza deployments fed and watered. He is responsible for architecture, day-to-day operations, and tools development, including the creation of an advanced monitoring and notification system. Previously, Todd was a Systems Engineer at Verisign, developing service management automation for DNS, networking, and hardware management, as well as managing hardware and software standards across the company.

In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and can be found sharing his experience on both SRE and Apache Kafka at industry conferences and tech talks. He is also the co-author of Kafka: The Definitive Guide, now available from O'Reilly Media. When everything else is not keeping him busy, you'll find him out on the trails, training for his next marathon.

Quantifying Empathy with Service Level Objectives

Wednesday, 1:55 pm–2:50 pm

Ketan Gangatirkar, Indeed, Inc.

Available Media

The goal of a Site Reliability Engineer is to create a reliable, scalable, performant service. To drive the right results, you need the right measures. Enter Service Level Objectives.

A good SLO starts with your customers. Next understand the role your service plays in your business. Then turn the understanding into numbers. Dive into the data to discover where diminished reliability and performance harm user experience and business results. Only then you can set good objectives.

Good SLOs will mirror your users' and business needs. When your users hurt, you want to know it. By feeling your users' pain you can discover the levels of service that hit the sweet spot of reliability, product success, and operability. It's empathy, but measured and quantified.

This talk will walk you step by step through finding, gathering, and understanding the inputs that inform good SLOs. We will review different classes of SLO and when to use them, with real examples to guide you. When you leave this talk, you'll be armed with practical advice to measure, validate, and improve the reliability of your products.

Ketan Gangatirkar is the Vice President of Engineering for Indeed's Job Seeker products. For the last 9 years, he's been helping millions of people get jobs. He has broken Indeed's site in dozens of different and creative ways over the years and has finally learned what not to do. For a time, he was responsible for the SRE organization at Indeed, helping the company evolve from centralized operations to a faster, more independent, and more scalable model, so that people like Ketan can't break the site anymore.

Connect:

@ketang

Track 2

Grand Ballroom 2

Doing Things the Hard Way

Wednesday, 1:25 pm–1:50 pm

Chris Sinjakli, SRE at GoCardless

Available Media

Our discipline is one of tropes and maxims—the commoditisation of infrastructure, the golden signals of monitoring, the breaking down of barriers spurred by DevOps.

Surely there are mistakes we won't make again. Surely we've left the bad times behind.

Some mistakes are just too tempting to avoid.

Motivated by examples from GoCardless—a company founded in 2011—we'll explore three failure modes:

dividing product and infrastructure teams early in the company's life
pinning our hopes on the big rework that never arrives
forgetting the basics of SRE while seeking out hard problems

We'll explore what makes each failure mode so tempting, what it might look like if you're experiencing it, and approaches to dig yourself out.

Chris enjoys all the weird bits of computing that fall between building software users enjoy and running distributed systems reliably. All his programs are made from organic, hand-picked, artisanal keypresses.

Connect:

@ChrisSinjo

Achieving Observability into Your Application with OpenCensus

Wednesday, 1:55 pm–2:50 pm

Emil Mikulic, OpenCensus

Available Media

Application metrics and distributed traces are immensely powerful for developers, but are difficult to automatically retrieve. Based off of the same technology used at Google, OpenCensus is an open source project that aims to make the collection and submission of app metrics and traces easier for developers.

In this talk you will learn about:

The benefits of traces and metrics, and how we use them at Google
The case for a common instrumentation implementation
An architectural overview of OpenCensus, including integrations and exporters
Introspection via z-pages
Our vision of the future for instrumentation

While the Census project originates from Google, it has evolved into an open source collaboration between multiple cloud and APM vendors and the OSS community, and already supports Prometheus, Zipkin, Stackdriver, and SignalFx.

See opencensus.io for more details.

Emil works on the C++ implementation of OpenCensus. Previously he was an SRE at Google, and also worked on Google's internal Census implementation.

Track 3

Kingfisher Room

Efficient Trouble Shooting of Service Failures with Multi-Tag Data Analysis

Wednesday, 1:25 pm–1:50 pm

Xuan Cao; Junfang Jiang, Baidu

Available Media

One of the most important works for SREs is troubleshooting the problem causing KPI degradation such as decrease of PV, advertisement income, feed click rate, etc.

Many of the problems only affect a portion of the incoming traffic. If the on-call engineers can learn about the characteristics of the affected portion, such as one of traffic source area, browser type or access network standard, then diagnosis would be accelerate.

Therefore, we mark a set of tags on each user request. When a failure happens, we look for the common points among the faulty requests. This generates a huge amount of tagged data, which increase the searching scope and thus leading to low efficiency in trouble shooting, therefore automatic analysis is imperative.

In this talk, we will present our work in Baidu where we apply machine learning techniques to recommend the tags most relevant to the failure. This approach adopts unsupervised anomaly detection and entropy-based dimension reduction techniques, which can automatically recommend key data features for troubleshooting. The proposed approach has been validated by hundreds of real cases. It significantly speeds up the troubleshooting procedure when compared to traditional approaches.

Autonomous Workload Rebalancing in Kafka

Wednesday, 1:55 pm–2:50 pm

Indrajeet Kumar, LinkedIn

Available Media

Kafka at Linkedin processes over 3 Trillion messages a day with over 2000 kafka brokers. At such a scale, maintaining balanced workload on kafka clusters as they go through irregular traffic patterns and hardware failures is a daunting task. SREs at Linkedin expend significant time and effort in handling these curveballs and making sure the hardware resources are utilized evenly, which made it quite evident that intelligent automation was crucial to scale any further. This talk outlines Linkedin’s approach towards solving this problem with the help of Kafka Cruise Control.

Indrajeet is part of the Kafka SRE team at Linkedin. He builds tools and automation to help manage the Kafka ecosystem at Linkedin.

2:50 pm–3:20 pm

Break with Refreshments

Grand Ballroom Foyer

3:20 pm–5:15 pm

Track 1

Grand Ballroom 1

Know Thy Enemy: How to Prioritize and Communicate Risks

Wednesday, 3:20 pm–4:15 pm

Matt Brown, Google

Available Media

Every SRE team attempting to manage, mitigate or eliminate the risks facing their system will encounter two fundamental problems:

As humans our intuitive judgement about risk is unreliable.
The work required to address all potential risks far outstrips our available time and resources.

The CRE team (Customer Reliability Engineering—a group of Google SREs who partner with cloud customers to implement SRE practices in their application and across the cloud provider/customer relationship) battles these challenges every day in our interactions with customers. We have drawn on Google’s deep experience managing reliable systems, and the broader field of risk management techniques to develop a process that allows us to communicate an objective ranking of risks and their expected cost to a system. This ranking and the associated cost data can then be used as an input to team and business decision making.

This talk will cover the development of our process, explain how anyone can apply it to any system today and demonstrate how the resulting ranking and costs provide objective, consistent data which can take the tension and subjectivity out of often tense discussions around work priorities and focus (e.g. more features or more reliability?).

Matt began his SRE career with Google in Dublin in 2007, shifting to London in 2012, and since 2016 works remotely from Cambridge in New Zealand. During this time Matt has worked on and led a range of diverse SRE teams ranging from Google's internal corporate infrastructure, through to the Internet facing load-balancing infrastructure that keeps Google fast and always available. His current role with the Customer Reliability Engineering team is pioneering how to apply SRE practices across organisations to address the challenges posed by today's world where the traditional boundaries between platforms and their customers are being blurred.

Connect:

@xleem

Data Visualization for SREs—an Essential Skill for Quick Debugging

Wednesday, 4:20 pm–4:45 pm

Yash Shah, LinkedIn

Available Media

SREs are software engineers with a broad skill set who work with systems in general. Depending on the type of work and teams, our usual time is spent in correlating incidental data to conclude the causes of issues. While we use ELK, Splunk etc. to visualize our logs; It’s an essential skill to parse log file by hand & visualize it to make useful observations quickly. Many times, we end up writing APIs & command line shortcuts to accelerate our debugging. We can make use of some of the techniques I’ll show to visualize these data quickly.

Data is in abundance. Where to focus requires sets of skills. Visualizing data is one of these essential skills.

Sr. Engineer, site reliability @LinkedIn. Fond of oneliners and automating automation.

Connect:

@yashness_

You Can't Stop Fires with an Ambulance

Wednesday, 4:50 pm–5:15 pm

Piers Chamberlain, Xero

Available Media

SRE is often perceived as an emergency response function—dealing with incidents and restoring system health. While it is true that there is much useful work done in this space, reactive processes impact only the MTTR not the MTBF. Once the low-hanging fruit of detection and remediation improvements are gone, improvement takes more and more investment. At this point, I'd argue, it is time to start preventative work.

We saw some reduction in incident rates through establishing a Post-Mortem process but these often involved only the poor souls who happened to have been called upon to help triage and fix the issue. Realisation dawned on me that attempting to evangelise to engineers with little influence on the balance of functional/non-functional development effort was going to have limited success.

By embarking on a campaign evangelising reliability (similar to the way forest fire prevention, or health promotion campaigns might work) and targeting the right level within the organisation, we're seeing positive cultural and behavioural changes, and better operational morale and we believe it will result in fewer severity outages.

Alternately breaking and fixing things since I un-boxed my Commodore 64

Track 2

Grand Ballroom 2

Comprehensive Container-Based Service Monitoring with Kubernetes and Istio

Wednesday, 3:20 pm–4:15 pm

Fred Moyer, Circonus

Available Media

Operating containerized infrastructure brings with it a new set of challenges. How do you instrument containers? How do you evaluate API endpoint performance? How do you identify bad actors in your infrastructure?

The Istio service mesh enables instrumentation of APIs without code change. Istio provides service latencies for free; how can you make sense all that data? With math, that’s how. I will demonstrate use of mathematical techniques to ask and answer business queries. I’ll show how to create RED (Rate, Errors, Duration) dashboards that provide insight into API performance; they are essential for meeting service level objectives. And how to monitor at scale cost effectively with histograms, which preserve metric fidelity and enable statistical analysis.

This talk is targeted at K8s developers and SREs who are faced with the challenge of reporting to business decision makers. Attendees will come away with the know how to be able to answer the questions posed in this description, an understanding of their infrastructure performance, and the ability to determine if they are under or over provisioned.

Fred implemented the first external metrics adapter for the Istio service mesh to monitor Docker based services using Circonus. He is actively involved in connecting with Circonus' users and engineers at a technical level, as well as developing code bridges between Circonus and external systems. Fred is a recovering Perl and C programmer, and these days likes to hack in Go and is learning Lua. He is a 2013 White Camel award winner, and Apache Software Foundation member, and works as an engineer for Circonus.

Connect:

@phredmoyer

How to Make Releases Safer in Baidu

Wednesday, 4:20 pm–4:45 pm

Pingping Xue and Yu Chen, Baidu

Available Media

Changes/updates are a major source of service faults. In Baidu, around 54% of the faults are introduced by changes. As a result, progressive rollout becomes imperative to improve service stability. Progressive rollout divides the deployment process into several stages. Each stage only deploys the change on a subset of the instances. Checkings are applied between consecutive stages to detect faults. If a fault is detected, the deployment is terminated and rolled back.

Intuitively, we can build a rollout system that enables development engineers to specify checking rules in each stage. Surprisingly, however, the Devs are not good at this, although they are the creators of the modules. Therefore, the reliability engineers are forced to add rules on stability indicators. But this leads to numerous false alarms, stalling the release procedure frequently. As a result, we turn to machine learning based methods. In order to obtain satisfying results, the algorithm must be able to learn the “normal” changes of each indicators, and quantitatively measure current changes to decide whether there are faults or not.

In this talk, we will present several real cases to demonstrate the dilemma we confront in rollout checking, and how the machine learning algorithm works.

Pingping Xue is the Senior SRE in the SRE Department of Baidu. She has worked on release efficiency and stability for four years. she helped to construct the progressive rollout mechanisms and avoid a lot of release faults. Her work improved Baidu's release efficiency and accelerate the product iteration siginificantly.

Yu Chen is a Data Architect at the IOP group of Baidu's Cloud Unit. His work focuses on service stability issues, including alerting and diagnosis. Previously, he has been working at Microsoft Research Asia. His research interests are distributed systems, consensus protocols, search ranking and query recommendation.

Cultural Nuance and Effective Collaboration for Multicultural Teams

Wednesday, 4:50 pm–5:15 pm

Ayyappadas Ravindran, LinkedIn

Available Media

What is considered "good-communication" is different for different cultures. In some cultures "good communication" is being as explicit as possible, and the responsibility of conveying the message is on the person communicating. In other cultures, "good communication" is more implicit and the responsibility of understanding the message falls on the the person receiving it.

When I started working with Americans, I often heard in meetings “What do you mean?”. In India, this is considered rude, as the question is perceived as a challenge to what was said. However, the person is really just looking for more information. It would have been better if the person said, "I didn’t quite understand". In America, the former question is considered “good communication” because the person was being direct and explicit, but in India, this question would have been perceived as rude.

Americans are also taught to give negative feedback in a positive frame. In European countries, good feedback is direct feedback. In a situation where an American manager may be giving negative feedback to a European employee, there is a high probability that the employee won't understand the feedback since they are used to receiving direct feedback. This may leave the manager wondering why the employee has not improved even after they received the feedback.

Hailing from Kerala, India, I have around 15 years of experience in the IT industry. Based out of Bangalore, India, I have lead the Grid SRE team at Yahoo! and the Data SRE team at Intuit. Currently, I lead all of LinkedIn's Data SRE teams in Bangalore. I am deeply passionate about SRE and anything data.

Track 3

Kingfisher Room

Call to ARMs: Adopting an arm64 Server into x86 Infrastructure

Wednesday, 3:20 pm–4:15 pm

Ignat Korchagin, Cloudflare

Available Media

Over the years Cloudflare have built a huge network: today we have over 5000 servers in 120 data centers around the world and operate a dozen of workloads in our infrastructure, including Nginx, Kafka, Mesos, ClickHouse, Prometheus, ElasticSearch, Hadoop etc. But with all the diversity in software stacks our hardware fleet still shared one common thing—CPU architecture: we, like most other SaaS companies around the world, exclusively used x86-based CPUs.

Recently we began exploring using a second CPU architecture for our hardware. The obvious choice was ARM, because it is the second most popular architecture in the world thanks to the boom of the mobile and IoT markets. Initially we considered to evaluate cost-effectiveness of running a different CPU architecture in our data centers and how easy it is to avoid vendor lock-in. The recently published Meltdown and Spectre attacks gave us even more reasons to pursue this goal.

After doing preliminary tests and some synthetic benchmarks and the results were promising came the question: so what’s next? The answer: we need to put this into the wild...

The talk provides an overview of potential steps and pitfalls of adopting a second CPU architecture in your cloud.

Ignat is a systems engineer at Cloudflare working mostly on platform and hardware security. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets.

Connect:

@secumod

Randomized Load Balancing, Caching, and Big-O Math

Wednesday, 4:20 pm–4:45 pm

Julius Plenz, Google

Available Media

Randomized load balancing is a common strategy to distribute requests across a server farm. When M requests are randomly assigned to N servers, every server is on average responsible for M/N requests. But in practice the distribution is not uniform: how many requests does the busiest server receive, relative to the average? This is the "peak to average load ratio". It is an important quantity that describes how much capacity is “wasted” when a system is provisioned for peak load. The closer this ratio is to one, the more uniform the utilization of servers is.

This talk gives a quick overview over two theorems from the following papers: "Balls into Bins" and "Small Cache, Big Effect."

Applied to the "requests to servers" scenario, the first theorem gives a closed expression for the amount of requests hitting the busiest server (with high likelihood). One somewhat surprising conclusion is that the peak to average ratio gets worse if number of requests and number of servers grow proportionally. The second theorem analyzes how effective small, fast caches can be, and how effective they remain as a system scales in size.

Julius Plenz is an SRE at Google in Sydney, where he works on a large, internal storage system.

Getting Started with Chaos Engineering

Wednesday, 4:50 pm–5:15 pm

Ana Medina, Gremlin

Available Media

Chaos engineering is the practice of conducting thoughtful, planned experiments designed to reveal weaknesses in our systems. This talk will share how you can get started practicing Chaos Engineering in your organization.

Ana is a Software Engineer living in San Francisco. She is currently working as a Chaos Engineer at Gremlin, helping companies avoid outages by running proactive chaos engineering experiments. She last worked at Uber where she was an engineer on the SRE and Infrastructure teams specifically focusing on chaos engineering and cloud computing. Catch her tweeting at @Ana_M_Medina mostly about traveling, diversity in tech and mental health.

Thursday, 7 June 2018

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:55 am

Track 1

Grand Ballroom 1

Automatic Datacenter and Service Deployments Based on Capacity Planning Artifacts

Thursday, 9:00 am–9:55 am

Xiaoxiang Jian, Alibaba Group

Available Media

When you first built services in one data center, it is always easy to do the deployments and changes. When the number of services increases, you would spend more time on service management and governance.

When we package all the services into one cloud offering and need to deploy in hundred of data centers, we have new problems. We need to evaluate all the capacity and data center or even network requirements of each services, and then initialize the new data center based on the evaluation. Normally it would need experts from datacenter, network and software development team with one or two months. How to do all these things automatically only based on the capacity planning artifacts? In this talk, I’ll share some of how we build and ship Apsara Stack, a dedicated cloud offering from Alibaba, to the data centers from different vendors.

Specifically:

A full datacenter capacity planning based on the measures from users' perspective.
A unified deploy model for network, operating systems and various cloud products.
Maintaining lightweight, easy-to-change production configuration instead of providing different toolsets.
Validate and test all changes automatically when changing any deployments

Xiaoxiang is the Technical Lead of Apsara Infrastructure Team in Alibaba, responsible for the technical infrastructure of Alibaba eCommerce platform and Alibaba Cloud. He studied Electronic Engineering at the Tsinghua University, Beijing.

Connect:

@jianxx

From Monitoring to Automated Testing of Your Infrastructure Code

Thursday, 10:00 am–10:55 am

Jesse Reynolds, Puppet, Inc.

Available Media

Don't have time to write automated tests for your infrastructure code? Don't see the point? Or don't know where to start? This talk is for you.

Now we're writing code to manage our infrastructure with tools like Puppet, Chef, Ansible etc, we are effectively developing software. One of the wonderful aspects to this is that we have the world of software development quality best practices to draw on in order to achieve a high rate of change while not compromising on reliability. Writing tests for infrastructure code (and having them execute automatically as part of a continuous integration pipeline) is a key element to this, and is the focus for this talk.

But how do you get started on this? What are some tools to help? How should we think about this problem? This talk will provide an overview of the different types of tests that can be written, from small unit tests to integration and acceptance testing. It will focus on integration testing where existing monitoring checks can come in handy, or at least provide a crossover or an entry point. In some cases the tests can also be used as checks in the monitoring system.

Track 2

Grand Ballroom 2

Shopify's Move from the Data Centre to the Cloud

Thursday, 9:00 am–9:55 am

Scott Francis, Shopify

Available Media

Shopify is one of the largest commerce web sites in the world, with over 500,000 merchants including Kylie Jenner and Kanye West. In 2017, we made the decision to move from primarily co-located data centres to the cloud.

This talk will dig into why we made the decision to abandon the DC, one that may interest other companies considering the same move. We'll carefully talk through each step of the process—how we planned, managed and executed the migration.

We'll also dive deep into the tooling we built to make this possible: a tool for performing zero-downtime shard failovers; a live shop mover to migrate shops between shards; among others. These tools are what allowed us to successfully perform the migration with almost no downtime for our merchants.

We'll also go into our performance tuning methodologies and capacity planning process. And of course, no project executes as planned, so we'll also share some of the problems we encountered and lessons learned along the way.

Scott Francis is a senior production engineer lead at Shopify, focusing primarily on reliability, scalability, and performance. He'll take any opportunity to jump into gdb or debug a core dump. He enjoys cooking and sometimes dog walking in what little free time he has.

Ensuring Reliability of High-Performance Applications

Thursday, 10:00 am–10:55 am

Anoop Nayak, LinkedIn

Available Media

This talk goes a bit beyond the traditional SRE tasks. It throws light into the story of LinkedIn Lite which is now the default mobile web experience in the developing countries. We start with describing the product constraints and how as an SRE we help in debugging, monitoring and often contributing code into the product in order to make sure its highly reliable, available and performant.

We start with how we have a hybrid model of both SSR(Server Side Rendering) and CSR(Client Side Rendering) which makes the application load fast i.e under 6 seconds and perform smoothly on even low end devices. We also share our experiences on implementing the Progressive Web Application, its merits and also the risk involved. We also evaluate whether this level of performance is possible with javascript frameworks.

And then we dive into the Android app whose size is half of the default Hello World application on android. It also emphasizes on how SREs can contribute to a production level mobile application code. Finally we also see how we can monitor "lite" applications(applications which are primarly webview based) and optimize webview based applications.

Anoop is a Site-Reliability Engineer on the LinkedIn India Products SRE team which handles products developed in India like LinkedIn Lite, LinkedIn Placements, and a bunch of relevance related services. He is also one of the major contributors to the LinkedIn Lite Android App which is less than 1MB in size.

Track 3

Kingfisher Room

Lightning Talks

Thursday, 9:00 am–10:55 am

Managing Distributed Systems with BOSH on GCP
Ronak Banka, Pivotal
Data Integrity: Key to Protect One of the Most Valuable Assets of Your Company
Chongxiu Wang, Google Sydney
Trouble in the Data Center
Dan-Claudiu Dragoș, Facebook
Modern IT Incident Response at Scale
Abhijit Pendyal

5-minute Break

Moving from the First Line of Defence to Last Line of Defence
Pradeep Thangavel, Freshworks
Demystifying the "A" Word—Accountability
Lenin Velu, PayPal India
Using SSH with CA-Signed Certificates
Marlon Dutra, Facebook Inc.
Relentless Reliability—Handling Hotspots
Benjamin Kaehne, ServiceNow

10:55 am–11:25 am

Break with Refreshments

Grand Ballroom Foyer

11:25 am–11:50 am

Track 1

Grand Ballroom 1

Smarter Disasters: End-to-End Automation for Incidents

Thursday, 11:25 am–11:50 am

Karthik Nilakant, Xero

Available Media

In this talk, I will discuss the different aspects of incident management and how we've built automation around each part at Xero. This includes: transforming manual alerts into automatic notifications through an issue report pipeline; a chat bot that streamlines incident coordination by facilitating effective communication and providing guidance through the process; and how we extract data from each incident for postmortem review. I'll also discuss how our tools have evolved and the lessons we learned on the way.

Karthik is a Senior Site Reliability Engineer at Xero. He's been based in the Auckland (New Zealand) office since 2016. Previously, he worked as a computer systems researcher and an enterprise server infrastructure consultant, both in New Zealand and the United Kingdom.

Track 2

Grand Ballroom 2

Debugging at Scale—Going from Single Box to Production

Thursday, 11:25 am–11:50 am

Kumar Srinivasamurthy, Microsoft Corp

Available Media

It's very easy to launch a debugger on your dev box, attach to the right process and step through code. However, things are different when you need to debug an issue in production that's getting tens of thousands of requests per second. What if the issue reproduces only in production? How do you debug without affecting production traffic? What techniques can you use in your development to make it easier to debug issues? Does your application use tracing? What debug logs are written out to aid in analysis?

This talk will cover:

Challenges with debugging in production
Various approaches that are used in the industry
Examples from Bing & Cortana incidents and steady state problems to illustrate the techniques
How do you design services that make them easier to debug

Kumar works at Microsoft and has been in the online services world for several years. He currently runs the Bing and Cortana Live site/SRE team. For the last several years, he has focused on growing the culture around live site quality, incident response and management, service hardening, availability, performance, capacity, SLA metrics, DRI/SRE development and educating teams on how to build services that run at scale.

Connect:

@00kumars

11:50 am–1:20 pm

Luncheon

Grand Ballroom Foyer

1:20 pm–2:45 pm

Track 1

Grand Ballroom 1

Productionizing Machine-Learning Services: Lessons from Google SRE

Thursday, 1:20 pm–2:15 pm

Salim Virji and Carlos Villavieja, Google

Available Media

Have you thought that your model trained on a Monday might not work on Saturday? Or that the model that you trained on users in Florida might not work for all Spanish-speaking users? In this talk, we present lessons learned from deploying and productionizing ML systems across various products at Google.

Salim Virji is a Site Reliability Engineer at Google, where he has built distributed compute, consensus, and storage systems.

Connect:

@salim

Carlos Villavieja is a Computer Architect/Researcher working as a Software/Site Reliability Engineer at Google. He works on Storage optimizations and his interests vary from micro-architecture to machine learning.

Connect:

@villaviejaC

Pro Tip: Save Money on Outages by Having a Bot Do the Heavy Lifting

Thursday, 2:20 pm–2:45 pm

Cezar Guimaraes, Microsoft

Available Media

Humans are slow, unreliable and hard to train. Azure has saved many millions of downtime minutes by using a knowledgeable and intelligent Bot. This Bot enhances and automates impact assessment, mitigation and problem management from your incident management.

You will learn about how to effectively run your outages and the strategy that we used to ensure that our solution was what our users wanted and would in fact lead to the immense time and cost savings that we predicted. We will share our guiding principles and lessons learned along the way.

Cezar Guimaraes is a Site Reliability Engineer Lead on the Microsoft Azure team. He has more than 15 years of experience and has worked at Microsoft for 12 years as a Software Engineer. Currently, he is working on Azure to identify and resolve problems that stand in the way of service uptime through engineering solutions such as bots and intelligence/autonomous engines.

Track 2

Grand Ballroom 2

Evolution of SRE and Rising Need of SRE Catalyzers

Thursday, 1:20 pm–2:15 pm

Isha Ganeriwal, LinkedIn

Available Media

An SRE team is responsible for availability, performance, efficiency, change management, emergency response, and capacity planning. As the need for high availability systems are growing, demands for and from SREs are growing further. To support these demands, an efficient and effective ecosystem is needed around SREs to ensure we deliver what we commit and we commit what is needed in a timely manner.

Isha Ganeriwal is presently working at LinkedIn, India, as a Senior Technical Program Manager for SRE organization. Isha has more than 10 years experience in the field of project/program management and has worked with data analytics and report engineering teams in the past as a program manager. She is associated with LinkedIn's SRE organization since last year and has had the opportunity to take a close look at SRE day-to-day functions and work through the evolution of SREs. Her vision is to create an efficient and effective operating system for SREs.

How to Serve and Protect (with Client Isolation)

Thursday, 2:20 pm–2:45 pm

Frances Johnson, Google Australia

Available Media

Client isolation is an important consideration for the reliability of Google Maps. We want to avoid becoming overloaded where possible and degrade gracefully otherwise. Following some user-visible outages in the area, Geo SRE began working on implementing client isolation.

But these things are never easy. There are multiple points in the stack where you can drop traffic so where should you do it and what are the tradeoffs? Are all requests created equal? What if your system changes partway through? Why doesn't exempting important traffic make sense? What other unexpected benefits does client isolation give you? All this and more!

Attendees will learn about what goals you can set for traffic management, identifying characteristics of your traffic and system architecture to leverage, and the strategies which can be used to design the solution.

I'm an SRE at Google in Sydney working on Spanner, formerly Geo (Maps).

2:45 pm–3:15 pm

Break with Refreshments

Grand Ballroom Foyer

3:15 pm–5:10 pm

Track 1

Grand Ballroom 1

A Tale of One Billion Time Series

Thursday, 3:15 pm–4:10 pm

Ruiyao Yao, Baidu

Available Media

Monitoring system is vital for service stability and availability. To support Baidu’s massive services and machines, the metrics being collected has grown to 1 billion. These metrics must be stored in a reliable and efficient database, which must support real-time insertion of new data and various queries, ranging from aggregation, alerting, to reports and visualization, with diversified time granularities.

Our time-series database (TSDB) consists of three layers, a memory database based on Redis stores hot data, a HBase stores warm data, and a HDFS stores cold data. To achieve efficient insertion, we extensively apply batch and asynchronous methods to the write path, in addition to HBase’s ability of high throughput writing. To improve reading performance, we design specialized data model and embed multi-layer down-sampling mechanism into HBase. The memory database incorporates compression techniques to serve real-time, frequent, and small queries, while preserving memory consumption at a reasonable level. All the data are backed up in a separate HDFS to support offline analysis.

In this talk, we will explore the challenges of large scale time-series processing, and introduce our practice of building TSDB. We will also share some successful experiences, such as retention policy, and trade-offs between cost and performance.

Currently a Senior Engineer at Baidu, Beijing. Responsible for the design and development of data platform of AIOps.

Connect:

@ryeyao

Introduction to Linux Container Internals

Thursday, 4:15 pm–4:40 pm

Tyler McMullen, CTO, Fastly

Available Media

Software Fault Isolation, or SFI, is a way of preventing errors or unexpected behavior in one program from affecting others. Sandboxes, processes, containers, and VMs are all forms of SFI. SFI is a deeply important part of not only operating systems, but also browsers, and even server software.

The ways in which SFI can be implemented vary widely. Operating systems take advantage of hardware capabilities, like the MMU (Memory Management Unit). Others, like processes and containers, use facilities provided by the operating system kernel to provide isolation. Some types of sandboxing even use a combination of the compiler and runtime libraries in order to provide safety.

Each of the methods of implementing SFI have advantages and disadvantages, but we don't often think of them as different options toward a similar end goal. When we consider the growing prevalence of things like edge computing and "Internet of Things", our common patterns start to falter.

In this talk, we'll focus on how sandboxing compilers work. There are important benefits, but also major pitfalls and challenges to making it both safe and fast. We'll talk about machine code generation and optimization, trap handling, memory sandboxing, and how it all integrates into an existing system. This is all based on a real compiler and sandbox, currently in development, that is designed to run many thousands of sandboxes concurrently in server applications.

Tyler McMullen is CTO at Fastly, where he’s responsible for the system architecture and leads the company’s technology vision. As part of the founding team, Tyler built the first versions of Fastly’s Instant Purging system, API, and Real-time Analytics. Before Fastly, Tyler worked on text analysis and recommendations at Scribd. A self-described technology curmudgeon, he has experience in everything from web design to kernel development, and loathes all of it. Especially distributed systems.

Connect:

@tbmcmullen

Automatic Traffic Scheduling for Internet Connectivity Failures

Thursday, 4:45 pm–5:10 pm

Liuqing Zhang, Baidu Inc.

Available Media

When it comes to high availability (HA) or user experience (UE), people often think about the stability of backend services or product design. Network connectivity, especially the Internet connectivity, is neglected. This might partly come from the impression that network is usually stable. Our observation shows that network failures are far from scarce, at least in China. Every week, we detect 3-5 PoP failures, 5-10 backbone failures breaking the connectivity at the province level. Most of the failures can be remedied by modifying DNS setting to bypass the broken path. The remediation depends on two systems, the detection system and the traffic scheduling system.

The detection system must detect failures precisely and punctually. Besides dedicated monitoring agents, we recruit volunteering agents to improve coverage and punctually. Dealing with their heterogeneity and unpredictable presence is crucial to the detection performance.

The traffic scheduling system is responsible for detouring the traffic to the correct path. It must consider not only the connectivity of external network links, but also the users’ experience and the load of target IDCs.

In this talk, we will introduce how to implement and use the above two systems to handle Internet connectivity failures.

Liuqing Zhang, Staff Software Engineer, Baidu Inc.

Track 2

Grand Ballroom 2

Lessons Learned from Our Main Database Migrations at Facebook

Thursday, 3:15 pm–4:10 pm

Yoshinori Matsunobu, Facebook

Available Media

At Facebook, we created a new MySQL storage engine called MyRocks (https://github.com/facebook/mysql-5.6). Our objective was to migrate one of our main databases (UDB) from compressed InnoDB to MyRocks and reduce the amount of storage and number of servers used by half. In August 2017, we finished converting from InnoDB to MyRocks in UDB. The migration was very carefully planned and executed, and it took nearly a year. But that was not the end of the migration. SREs needed to continue to operate MyRocks databases reliably. It was also important to find any production issue and to mitigate or fix it before it becomes critical. Since MyRocks was a new database, we encountered several issues after running in production. In this session, I will introduce several interesting production issues that we have faced, and how we have fixed them. Some of the issues were very hard to predict. These will be interesting for attendees to learn too.

Attendees will learn the following topics.

What is MyRocks, and why it was beneficial for large services like Facebook
What should be considered for production database migration
How migration should be executed
Learning 4-6 real production issues

Yoshinori Matsunobu is a Production Engineer at Facebook, and is leading MyRocks project and deployment. Yoshinori has been around MySQL community for over 10 years. He was a senior consultant at MySQL Inc since 2006 to 2010. Yoshinori created a couple of useful open source product/tools, including MHA (automated MySQL master failover tool) and quickstack.

PV Monitoring Based on Linear Regression

Thursday, 4:15 pm–4:40 pm

Wang Bo, Baidu

Available Media

PV (Page View) curve is one of the most important curves for SREs. Every significant drop on the curve is regarded as an incident. Therefore, SREs are badly in need of a good anomaly detection algorithm.

Because PV fluctuates during day and night, the detection heavily depends on its expected values. Moving average is a naïve method to generate the expected values. It suffers from two reasons. First, it lags behind the actual trend, which will miss the drop on a rise trend. Second, it cannot easily differentiate between the drop and the recovery after a rise. Advanced methods such as exponential smoothing also have their own shortcomings. When PVs are large, the local fluctuations of the curve are relatively small, rendering a smooth curve. This inspired us to apply linear regression to generate the expected value. But linear regression is susceptible to abnormal values.

In this talk, we will present a method based on robust linear regression to compute expected values. This method is able to resist the impact of anomalies. Moreover, we will also introduce a statistical hypothesis testing method to detect anomalies, eliminating the need to set different thresholds at different time in simple methods.

I am a Senior Software Engineer at Baidu. I am mainly engaged in operation data analysis, including the time-series anomaly detection, fault diagnosis.

Do Docs Better: Practical Tips on Delivering Value to Your Business through Better Documentation

Thursday, 4:45 pm–5:10 pm

Riona MacNamara, Google

Available Media

Missing, incomplete, or stale/inaccurate documentation hurts development velocity, software quality, and—critically—service reliability. And the frustration it causes can be a major cause of job unhappiness.

SREs often spend 35% of their time on operational work, which leaves only 65% for development. Time spent on documentation comes out of the development budget, and this is challenging if there's a perception that creating and maintaining docs is grunge work that may not be recognized or rewarded. To convince fellow engineers and leadership to invest time and resources in documentation, it's essential not only to create good docs, but to gather data that communicates their quality, effectiveness, and value.

Attendees will learn:

How to understand and communicate documentation quality
Functional requirements for SRE documentation
Best practices for creating better, useful documentation
How to better communicate the value of documentation work to your business in order to drive change

Riona is senior staff technical writer at Google, where she has worked for 11 years, and leads the team that builds g3doc, Google's internal platform for engineering documentation, used by thousands of projects within the company. Before Google, she worked at Amazon and spent almost 10 years as a writer, editor, and program manager at Microsoft.

Connect:

@rionam

6:00 pm–8:00 pm

Conference Reception

River Promenade
Sponsored by Huobi