Scaling Reliability at Dropbox: Our Journey towards a Distributed Ownership Model

Sat Kriya Khalsa; Matt Witherow

All sessions will be held at the Grand Copthorne Waterfront Hotel Singapore.

Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)

Attendee Files

SREcon17 Asia/Australia Attendee List (PDF)

Monday, 22 May 2017

7:30 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–11:20 am

Opening Plenary Session

SRE in This Region

Grand Ballroom

LinkedIn SRE: From Inception to Global Scale

Monday, 9:00 am–9:55 am PDT

Bruno Connelly and Viji Nair, LinkedIn Corporation

Available Media

Will be covering the following objectives:

The History of SRE at LinkedIn
Problems behind the scenes
- Infrastructure / Services / Releases
The world before SRE
Establishing the SRE culture
- Core Principles
- SRE Engagement Model
- Ownership model and ownership culture
- People, the most important assets
- Redefining the role & Leveling up
Running Linkedin @ scale
- Guard-railed multi-phased deployment process
- Error budgeting
- AB Testing & Deployments
- Building it together, self-services
Learning and Evolution
How we leverage global/remote teams
- Value additions
- How to seed and grow them successfully
- Challenges, and how to address them
- How to scale global teams
- Leanings, what worked and what didn't
- Best practices, do’s and don’ts when you seed, feed and grow a global team

Bruno Connelly, LinkedIn Corporation

Bruno Connelly, VP Engineering, Site Reliability, leads the teams responsible for the growth, scale, and day-to-day operation of LinkedIn. Prior to LinkedIn, Bruno spent seven years at Yahoo! where he held leadership positions in search engine operations, the Panama project, and ultimately led the teams responsible for the uptime of all Yahoo! advertising platforms.

Viji Nair, LinkedIn Corporation

Viji Nair is the Site Lead and Director of SRE at LinkedIn's Bangalore SRE Organization. He played a pivotal role in shaping and scaling LinkedIn’s India SRE presence from an idea to a world-class team of 65 engineers. This team is now recognized as one of India’s premier SRE organizations.

Next Generation of DevOps: AIOps in Practice @Baidu

Monday, 9:55 am–10:50 am PDT

Xianping Qu and Jingjing Ha, Baidu

Available Media

Baidu has thousands of applications and hundreds of thousands of servers. For high availability and reliability services, our SREs have developed many operation tools and systems. But, these tools are difficult to reuse and scale because of various of different operations concepts, runtime envoriments and operations strategies. Thus, we built a platform named AIOps platform (AI means automation and intelligence) to help SREs more quickly and efficiently develop operations tools. This platform provides unified operations abstract layer, operations strategies, automated scheduling and execution. Thus, SREs can focus on building their custom and advanced features.

In this talk, we demonstrate the core procedure of AIOps platform by actual cases in the productive environment of the core products at Baidu. The following technologies will be involved and mentioned: the platform architecture, OKB (operations knowledge base), OPAL(operations abstract layer), and practices in failover, auto scaling, etc.

Xianping Qu, Baidu

Xianping Qu is a manager of DevOps team at Baidu, the largest search engine in China, and has built Baidu’s monitoring platform and data warehouse. Now, He leads DevOps team to work on some challenging projects, such as anomaly detection, RCA, auto-scaling, etc. He is also interested in data analysis and machine learning.

How Could Small Teams Get Ready for SRE

Monday, 10:55 am–11:20 am PDT

Zehua Liu, Zendesk Singapore

Available Media

Site Reliability Engineering encompasses a large area of topics. The SRE book itself contains 34 chapters in 500+ pages. It’s not easy for a small team to start to adopt these SRE practices. At Zendesk Singapore, we went through the initial chaos of engineering and reliability issues when the engineering team grew from 10 to 40 engineers and the product focus shifted from SME to enterprise customers. Several initiatives that we took during this period helped stabilize the product and got the team into a shape where it’s ready to apply more SRE best practices from the SRE book and other sources. In this talk, we will share the details about some of the projects that we consider as essential in preparing a small and young team to tackle more serious site reliability issues. We will discuss how some of these key ideas could be combined to form foundations of the principles discussed in the SRE book. We hope that this talk could help teams facing similar growth and product change issues better cope with them while keeping the product reliable.

Zehua Liu, Zendesk Singapore

Zehua establishes and leads the tooling team at Zendesk, where he works on making sure that developers are happy developing what they want to develop and the quality of the products the developers deliver is great. He is currently based in Singapore.

11:20 am–11:50 am

Break with Refreshments

Grand Ballroom Foyer

11:50 am–12:15 pm

Track 1

Monitoring and Alerting

Grand Ballroom 1

How We Built TechLadies in Singapore

Monday, 11:50 am–12:15 pm PDT

Elisha Tan, Founder, TechLadies

Available Media

TechLadies is a community-led initiative for women in Asia to connect, learn, and advance as programmers in the tech industry. Since our launch in 2016, we’ve taught 177 ladies in Singapore and Malaysia programming skills and saw 7 ladies getting technical internships or hired as junior software engineers. In this talk, I’ll share about the inspiration behind setting up TechLadies, the what and how of the programs we do, and the lessons learned on how to introduce more women to the tech industry. Hopefully it will be useful for you in increasing gender diversity at your workplace!

Track 2

Service Lifecycle

Grand Ballroom 2

Focal Impact: The Service Pyramid

Monday, 11:50 am–12:15 pm PDT

Michael Elkin, Facebook

Available Media

The Service Pyramid is a framework based on the hierarchy of of needs - with a service oriented focus. It's used by Production Engineers at Facebook to navigate a sea of tasks and mountains of work by giving clear guidance on what relative priorities different kinds of work get. Using this guidance we can avoid common pitfalls, gauge how mature the reliability of a service is, and help divide work between PE's and other disciplines.

Come learn about how one team of Production Engineers have been applying the Service Pyramid in the Datacenter space: engaging with DCIM, Controls, Mechanical, and Electrical engineers to build better monitoring & automation.

Michael Elkin, Facebook

Production Engineer at Facebook; responsible for an untold number of server reboots and once successfully managed to blame the network.

12:15 pm–1:40 pm

Luncheon, Sponsored by Baidu

Waterfront Ballroom

1:40 pm–3:00 pm

Track 1

Monitoring and Alerting (continued)

Grand Ballroom 1

Smart Monitoring System for Anomaly Detection on Business Trends in Alibaba

Monday, 1:40 pm–2:05 pm PDT

Zhaogang Wang, Alibaba Group

Available Media

Anomaly detection based on time series analysis approaches has been a focused theme in the field of monitoring, especially for business indicators monitoring. In Alibaba, hundreds of major KPIs need to be monitored in real time to detect the abnormal events and raise alarms. The effectiveness of the alarms is strictly evaluated by human operators. Therefore, we proposed an intelligent anomaly detection method to make the business monitoring system more scalable and easier to maintain.

There are two major problems in traditional anomaly detection approaches:

How to get accurate predictions based on seasonal time series data with complex noises and interferences.
How to determine the segmental thresholds dynamically, and learn from human feedbacks, e.g. the manual labelling data, to improve the accuracy of anomaly detection, and tolerate the error, event contradictions within the labelling data.

For the first problem, as a tradeoff among computation performance, accuracy and robustness, we introduced a specific method to pre-process the data, and chose the STL method to do predictions on our business data.

For the second problem, we proposed a closed loop feedback method to determine the initial segment thresholds, and utilized the human labelling data to update the thresholds continuously.

Zhaogang Wang, Alibaba Group

Zhaogang Wang,
Senior technical specialist, Alibaba Group.
2009-2015, Senior Engineer, SRE team, Baidu.

Areas of interest:
Intelligent monitoring system,
Fault diagnosis

Graphite@Scale or How to Store Millions of Metrics per Second

Monday, 2:05 pm–3:00 pm PDT

Vladimir Smirnov, System Adminstrator, Booking.com

Available Media

This is a story about dealing with metrics at scale. A lot of metrics.

This is our story of the challenges we’ve faced at Booking.com and how we made our Graphite system handle millions of metrics per second.

You will learn about one of the most high load Graphite-compatible stacks, the problems it poses and the challenges in maintaining it and scaling it further, pushing Graphite to its limits and beyond.

System Administrators and SREs who are interested in monitoring and scalability would find this useful.

Vladimir Smirnov, Booking.com

I've dealt with large scale systems design and administration in IT for over 6 years. For the last 1.5 years I've been working Booking.com, specializing in scaling our Graphite stack, improving its reliability and performance.

We at Booking.com have hundreds of backend servers, hundreds TB of data which we use to handles millions of metrics per second using our Graphite stack. The rate of growth is enormous and constantly growing.

Track 2

Service Lifecycle (continued)

Grand Ballroom 2

Data Checking at Dropbox

Monday, 1:40 pm–2:05 pm PDT

David Mah, Dropbox

Available Media

At Dropbox, we’ve worked incredibly hard to build infrastructure that we are confident in trusting. A major aspect of our confidence comes from the verification of our data at rest, which gives us signal that our data will be properly usable when requests actually come in.

In this talk, we’ll break down the thinking about how to design and build a consistency checker system. We’ll start with the actual needs/goals of such a system, then follow with the sub-components of the system. We’ll include both distributed system design AND how to design your alert escalation workflow to be as simple as possible for human operators.

Attendees are expected to leave the session understanding how they could build consistency checkers for their own systems. This includes:

Do you even need a consistency checker?
What independent components need to exist?
What is a good alerting + triaging workflow?
What is involved in an auto-remediation mechanism for constraint failures

David Mah, Dropbox

David Mah is a Site Reliability Engineer at Dropbox who has built several monitoring mechanisms across Dropbox’s block storage and server file system infrastructure. He is also the author of Dropbox’s auto-remediation infrastructure.

Managing Server Secrets at Scale with a Vaultless Password Manager

Monday, 2:05 pm–3:00 pm PDT

Ignat Korchagin, Cloudflare

Available Media

Operating a datacenter or a distributed network involves handling a lot of secrets. In most cases, you have to deal with at least four types of secrets for each piece of hardware: SSH server key, key to bootstrap your configuration management system, disk encryption key, and some per-server credentials to access other services. And often, these keys have to be set up before your configuration management kicks in, making the automation of this process more difficult.

Security-wise, it is important to control where and when those secrets are generated. Often, keys are generated by startup scripts. However, during initial boot (especially on diskless systems), the system may have a low entropy level in its internal random number generator, resulting in statistically weak generated keys.

And once you have your keys, you need to store them securely. This presents a chicken-and-egg problem. Encrypted disk is a great solution, but? You need a key to access an encrypted disk and where do you store that? Also, where do you store keys for diskless systems?

This talk presents an approach that combines hardware support and cryptography to deal with the above issues and unify and simplify secret management for your hardware fleet.

Ignat Korchagin, Cloudflare

Ignat is a systems engineer at Cloudflare working mostly on platform and hardware security. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets.

3:00 pm–3:30 pm

Break with Refreshments

Grand Ballroom Foyer

3:30 pm–4:55 pm

Track 1

Monitoring and Alerting (continued)

Grand Ballroom 1

Open-Falcon: A Distributed and High-Performance Monitoring System

Monday, 3:30 pm–4:25 pm PDT

Yao-Wei Ou, Wifire; Wei Lai, Didi Corporation

Available Media

There are already many outstanding open-source monitoring systems. However, with rapid growing businesses and specific requirements of internet companies, the existing open-source monitoring systems are not better enough on performance and scalability.

Under the situation, Open-Falcon’s main distinguishing features compared to other monitoring system are:

Scalability: Scalable monitoring system is necessary to support rapid business growth. Each module of Open-Falcon is super easy to scale horizontally.
Performance: With RRA(Round Robin Archive) mechanism, the one-year history data of 100+ metrics could be returned in just one second.
High Availability: No criticle single point of failure, easy to operate and deploy.
Flexibility: Falcon-agent has already 400+ built-in server metrics. Users can collect their customized metrics by writing plugins or just simply run a script/program to relay metrics to falcon-agent.
Efficiency: For easier management of alerting rules, Open-Falcon supports strategy templating, inheritance, and multiple alerting method, and callback for recovery.
User-Friendly Dashboard: Open-Falcon could present multidimensional graph, including user-defined dashboard/screen.

Yao-Wei Ou, Wifire

Yao-Wei Ou is the director of Wifire oversea R&D center. Before he joined 21Vianet Group(NASDAQ: VNET), he worked on 360 total security at Qihoo 360(NYSE:QIHU). Yao-Wei has a Master degree in Computer Science from Taiwan University. He has accumulated various developing experience such as Windows kernel driver and UI, Android RenderScript framework, iOS Applications, LLVM compiler and distributed system. Since 2015 he joined Fastweb, he leads the development of CDN monitoring system in Fastweb and is a core member of Open-Falcon organization.

Wei Lai, Didi Corporation

Wei Lai is the technical director of Didi, Founder of the Open-Falcon software and community.

Talking to an OpenStack Cluster in Plain English

Monday, 4:30 pm–4:55 pm PDT

Wei Xu, Tsinghua University

Available Media

Modern systems build on layers and layers of abstractions with tons of modules. These abstractions help the development but make it a nightmare to operate. OpenStack is a system of this kind: its states, including persistent (DB) states, are distributed across dozens of modules in the system. Operators have to access these states using obscure command line tools that has hundreds of switches no one remembers. Integrating it with other open source projects like Ceph further complicates the problem. Reasoning about the inconsistencies of these states – one of the leading causes of user-visible bugs – is beyond what current log-based monitoring systems’ capability.

As both system operation practitioners and academic researchers, we discuss our experience in operating a 130-node OpenStack private cloud, as well as our research on how we automatically build a knowledge graph based on system states and logs. We will demonstrate our natural language interface that can provide all information about the system, crossing layers, and modules – all with plain English queries. Finally, we also present a simple anomaly detection system indicating “why” a problem happens.

Wei Xu, Tsinghua University

Wei Xu is an assistant professor at the Institute for Interdisciplinary Information Sciences of Tsinghua University. He received his Ph.D. from UC Berkeley in 2010. He worked at Google as a software engineer before joining Tsinghua University.

Wei Xu has a broad research interest in distributed system design and big data. He has published 20+ research papers in leading venues.

He is also the director of Open Compute Project (OCP) Certification Lab in China. He is a recipient of the Chinese National Youth 1000 Program, graduate student advising award from Tsinghua, and faculty research awards from Google, IBM, and Microsoft.

Track 2

Service Lifecycle (continued)

Grand Ballroom 2

Diistributed Consensus Algorithms

Monday, 3:30 pm–4:25 pm PDT

Laura Nolan, Google

Available Media

Processes crash or may need to be restarted. Hard drives fail. Natural disasters can take out several datacenters in a region. Site Reliability Engineers need to anticipate these sorts of failures and develop strategies to keep systems running in spite of them.

This usually means running systems across multiple sites, and this means that you need to make tradeoffs between availability and consistency of your system state.

This talk explores distributed consensus algorithms, such as RAFT and Paxos in production: how they work, how they perform, what can go wrong, when to use them and not to use them.

Laura Nolan, Google

Laura Nolan has been a Site Reliability Engineer at Google for four years, working on large data infrastructure projects and most recently, networking. Her background is in software engineering and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly SRE book, and is co-chair of SRECon EMEA 2017.

A Scheduling Framework For Large-Scale Based on Ansible

Monday, 4:30 pm–4:55 pm PDT

AiZhen Chen, Qiniu Cloud

Available Media

This subject introduces how to use Ansible to build a distributed scheduling framework, to achieve the operation and maintenance platform for tens of thousands of nodes of the data collection and task distribution and other large-scale scheduling operations. By subdividing the interface layer, distributed scheduling layer, driver layer,to realize Data collection and task distribution for the underlying operating system, storage, network.And by modifying the Ansible source code to solve business incoming parameters judgment.

AiZhen Chen, Qiniu Cloud

Chen AiZhen is an expert in cloud computing and is a docker product manager at Qiniu Cloud, where she designs large scale computer operation platforms and is responsible for a number of cloud projects. She has many years of computer operations management experience as well as rich experience with the PaaS platform architecture design, automated operation, and maintenance.

Tuesday, 23 May 2017

8:00 am–9:00 am

Continental Breakfast

Grand Ballroom Foyer

9:00 am–10:50 am

Track 1

Monitoring and Alerting (continued)

Grand Ballroom 1

Draining the Flood—A Combat against Alert Fatigue

Tuesday, 9:00 am–9:55 am PDT

Yu Chen, Baidu Inc.

Available Media

Monitoring system is an important tool for SREs to guarantee service stability and availability. Baidu’s monitoring system, Argus, keeps tracking hundreds of services of distribution system. With the increasing complexity of the services, the monitoring items and the corresponding anomaly detectors grows to a magnitude that the generated alerts floods from time to time. On an average day, Argus detects millions of warning events and sends out thousands of SMS alerts to on-call engineers. This results in a per person amount of more than 100 alerts during the day time and 30 during the night time. When a severe failure occurs, the alerts flood in a massive surge, and become little helpful for the engineers to fix the problem. Therefore, it is imperative to improve the detection accuracy of abnormal events, reduce the amount of alerts, and organize them into a meaningful way.

In this talk, we will introduce our practice that leverage machine learning methods to detect anomalies and group alerts, in order to solve the above issues. We will also share some successful experiences, such as alert based datacenter-level failure detection, and alert-triggering automatic recovery techniques.

Yu Chen, Baidu Inc.

Yu Chen is the data architect of the SRE team in Baidu. He previously worked in Microsoft Research Asia as a researcher. His working experience includes data mining, search relevance, and distributed systems.

Good, Better, Best, Mobile User Experience

Tuesday, 9:55 am–10:50 am PDT

Fred Wu, Tingyun

Available Media

Traditional DC/Cloud monitoring platform focus more on high availability. Restart, reboot and reimage are the top 3 quick actions for operations engineer, but in current IT environment, Mobile/Cloud architect needs 3rd monitoring platform to help operations engineer to understand more about application, end user experience: widen the monitoring from DC only to Mobile user, deepen the monitoring from system to application and code running, Change alert only to alert first and take action at the same time. The presentation will show to audience why change, best practice and customer benefit of using 3rd monitoring platform, also some case studies of customer satisfaction improvement, lower the average customer cost.

Fred Wu, Tingyun

Fred Wu, VP of Tech & Service, has been with Tingyun for 2 years and has 18 years of experience with the China IT market, for both technical team management and solution architect for ADC and APM, focus on the industry of OTA, E-Commerce, CSP, and Banking.

Track 2

Service Lifecycle (continued)

Grand Ballroom 2

Reliable Launches at Scale

Tuesday, 9:00 am–9:55 am PDT

Sebastian Kirsch, Google Switzerland GmbH

Available Media

How do you perform up to 70 product and feature launches per week safely and reliably? Google staffed a dedicated team of Site Reliability Engineers to solve this question: Launch Coordination Engineers (LCEs) work across Google's service space to audit new products and features for reliability, act as liaisons between teams involved in a launch, and be gatekeepers. Google designed its launch process to be easy on developers, scalable and provide a consistent bar for reliability at launch. A common checklist helped making the results of a launch review reproducible. This launch checklist was later automated into a self-service tool curated by the LCE team and re-used by other teams. The specialisation on the challenges of launching enabled the LCE team to build a breadth of experience in applicable techniques and allowed them to function as vehicles for knowledge transfer between different parts of the company.

Attendees will learn about challenges unique to launch situations, the advantages of a dedicated LCE team, how to structure a launch process, the outline of a launch checklist, and select techniques for successful launches.

Sebastian Kirsch, Google Switzerland GmbH

Sebastian Kirsch is a Site Reliability Manager for Google in Zürich, Switzerland. He manages the team responsible for the reliability of Google's financial transaction systems. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps and Google Calendar. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.

Didi: How to Provide a Reliable Ridesharing Service

Tuesday, 9:55 am–10:50 am PDT

Ming Hua and Lin Tan, Didi Corporation

Available Media

Didi is a ridesharing company similar to Uber, which was founded in 2012, but right now Didi is providing transportation services for close to 400 million users across over 400 cities in China. As the infrastructure team, we faced a wide range of technical challenges:

With a 500% annual requests growth, how to guarantee the service quality for a rapidly expanding system?
With more than 400 releases per day, how to ensure to iterate applications rapidly and minimize risk?
How to prevent a large scale system failure, which will immediately affect customers over 400 cities?
How to put as much efforts as we could into the system stability along with rapidly business iteration

In this session, we will share lessons learned from Didi:

The importance of a specific stress test and system capacity estimation for our ridesharing service.
The best practice of Didi’s standard delivery including application requirements and deployment procedure.
The benefits from an intelligent load balancing strategy based on cities and a high availability system which across different region zones.
A measurement mechanism for system stability work.

Ming Hua, Didi Corporation

Ming Hua works on Didi SRE principal architecture with a focus on service stability work.

Lin Tan, Didi Corporation

Lin Tan is an SRE Senior Engineer at Didi.

10:50 am–11:20 am

Break with Refreshments

Grand Ballroom Foyer

11:20 am–12:00 pm

Track 1

Monitoring and Alerting (continued)

Grand Ballroom 1

Measuring the Success of Incident Management at Atlassian

Tuesday, 11:20 am–11:45 am PDT

Gerry Millar, Atlassian

Available Media

When an incident happens it's the worst possible time to be bogged down with confusing systems and processes. A well defined Incident Management Process that's light-weight and supported by good automation offers a way to get fast, easy and predictable results during an incident, but if you don't implement the right things in the right way you risk bad results, such as high time-to-recovery, at critical times.

Find out how Atlassian drives value out of the Incident Management process and what metrics we use to track it. We'll also cover how we created automation to remove the overhead in managing incidents and deep dive into a case study to explain how it all ties together.

The target audience for this conceptual session is people who are involved in the management of incidents, such as SREs and delivery team members.

Gerry Millar, Atlassian

Gerry Millar is a member of the Reliability Process Group, the team responsible for Atlassian's incident response and recovery process. He trains staff, develops automation, measures incident performance, and drives post-incident reviews at Atlassian.

He was originally hired as a technical operations engineer for the company's cloud-facing products and as this group morphed into SRE Gerry's role morphed along with it. He now enjoys writing a bit of Python and ocean swimming around Sydney's beaches.

Track 2

Service Lifecycle (continued)

Grand Ballroom 2

Managing Changes Seamlessly on Yahoo's Hadoop Infrastructure Servers

Tuesday, 11:20 am–11:45 am PDT

Vineeth Vadrevu, Yahoo

Available Media

Yahoo's grid-ops makes frequent changes (15 per-week) to below entities:

OperatingSystem configuration (packages, security-patches)
Hadoop (packages)
Supporting applications i.e. LDAP, Kerberos, Logging, Monitoring

For a platform like Hadoop at Yahoo's scale; stability, reliability and uptime are highly crucial because of the sheer magnitude to which the platform caters to. The objectives that are set, for pushing changes, so that the platform doesn't take a hit on stability, reliability and uptime are:

Continuous Delivery and DevOps philosophies are strictly adhered to
Every node always has right configs
Each change is seen to be pushed within specific time period and is uniform across all nodes
Changes made can be visualised, tracked, monitored, validated and if required, can be reverted easily
Feasible mechanisms are made available in the pipeline to promote pushes across Hadoop clusters in a staged manner so that change rollovers are smooth
Effective gates are in place to reduce the impact of a wrong change
Any change reviewed and committed makes it's way automatically to every node

This presentation will share and discuss:

How the objectives are achieved through automation
Experiences and lessons learned

We believe our practices can easily be adopted by SREs to effectively manage Large-Scale Infrastructure.

Vineeth Vadrevu, Yahoo

Vineeth is a Sr. Production Engineer, Principal at Yahoo. He is with GridOps team at Yahoo working on Large Scale Production Engineering Applications and Infrastructure Monitoring & Management.

Vineeth has B.Tech in Computer Science (Affiliated to Andhra University) and an Executive MBA in Operations Management from Symbiosis Intl. University, Pune, India.

12:00 pm–2:00 pm

Luncheon

Grand Ballroom Foyer

2:00 pm–3:25 pm

Track 1

Incident Management

Grand Ballroom 1

Event Correlation: A Fresh Approach towards Reducing MTTR

Tuesday, 2:00 pm–2:55 pm PDT

Renjith Rajan and Rajneesh, LinkedIn

Available Media

LinkedIn has hundreds of microservices spanning across different data centres and dependent on each other. Even though microservice architecture has its own advantages, often times, identifying the problematic service during an incident/outage could significantly contribute to a high MTTR and unnecessary escalations. Event Correlation Engine is an attempt to algorithmically identify the responsible service quickly and escalate to the right team.

Event Correlation Engine examines the entire service stack by looking at critical downstream latencies, error metrics and other monitoring events to recommend a responsible service. It understands callee caller communication for linkedIn’s complex microservices architecture.The engine uses dynamic thresholds, which it learns by processing last 30 days data, to provide an effective recommendation. We’ll discuss the approach we used in building it and how it is being used at LinkedIn to reduce MTTR and on call escalations.

Renjith Rajan, LinkedIn

Renjith Rajan is a Staff Site Reliability Engineer with LinkedIn's production SRE team. He joined LinkedIn in July 2016 and prior to that he was working with Yahoo in the Ad Serving team.

Rajneesh , LinkedIn

Rajneesh is a Site Reliability Engineering Manager leading the production SRE team at Linkedin, Bangalore. He joined Linkedin in April 2015 and prior to that he was leading the Global Platforms team at Yahoo.

Automated Troubleshooting of Live Site Issues

Tuesday, 3:00 pm–3:25 pm PDT

Sriram Srinivasan, PayPal India Private Ltd.

Available Media

Troubleshooting of live site issues can be challenging especially when our production stack is made up of over 2000 applications and services. PayPal’s SRE team is also involved in troubleshooting and driving resolution of the various live site issues reported by the customer and merchant support teams. Today to troubleshoot a live site issue, we go to multiple places depending on the issue at hand. Predominantly we go and look into the Centralized Application Logs. Then we also check the various data sources and the in-house alerts. There is so much of information to look for. A lot of effort goes into gathering data about the failed attempt/transaction from various sources internal to PayPal. Thus we needed an automated way to troubleshoot issues. So we have developed an Auto Troubleshooting Platform that aggregates the data from all the underlying data sources, troubleshoots and records the results. The Platform is built in a way that anyone can post any type of ticket and get it troubleshooted automatically. Auto Troubleshooting Results will be available in minutes and the same can be seen through a portal. In this talk, I will highlight the journey that we have undertaken in making this happen.

Sriram Srinivasan, PayPal India Private Ltd.

Sriram Srinivasan is a technologist with over 14 years of experience in Software Development. He worked in multiple teams at PayPal India Private Ltd. in various aspects of the software development lifecycle, including conception, design, development and supporting products. In his current role as Architect at PayPal's SRE team, he got an opportunity to design and develop an Auto Troubleshooting Platform.

Track 2

Service Expansion

Grand Ballroom 2

"A Unit Test Would Have Caught This:" Small, Cheap, and Effective Testing for Production Engineers

Tuesday, 2:00 pm–2:25 pm PDT

Andrew Ryan, Facebook Inc.

Note: This talk ends at 2:25 pm.

Available Media

Production Engineers write tons of code for automating and managing distributed systems, ranging from one-off shell scripts to complex frameworks of many thousands of lines. But, in our profession, formal software testing has historically been weak. While there are a wide variety of ways to test software, in this talk, we'll focus on the methods that are easiest to implement and give the best results.

We'll discuss how teams use "small and cheap" tests at Facebook to give us the largest quality improvements with the lowest effort. Examples will include projects written in Bash and Python, as well as our Chef code, which is written in Ruby. We'll also discuss some of the ancillary benefits of testing, including recruiting, training, and team continuity.

Andrew Ryan, Facebook, Inc.

Andrew has been a member of Facebook's Production Engineering team since 2009. He currently works as a member of the Traffic Infrastructure team, helping to make Facebook faster for everyone.

Testing for DR Failover Testing

Tuesday, 3:00 pm–3:25 pm PDT

Zehua Liu, Zendesk

Available Media

Disaster Recovery is an important area in SRE. A simplified scenario is recovery from a full data centre failure. The Zendesk Chat backend infrastructure operates in a single data centre. The way to be sure that DR works is to perform a real failover. The past failover attempts were full of surprises and unexpected issues, most of them having to do with the applications failing to work after failover, due to various reasons. These unexpected issues led to failed failover tests and/or extended maintenance window due to extra efforts required to bring things back to order, causing bad customer experience.

How can we increase our confidence in the system being working after failing over? We want to confidently declare that it should work, instead of typing the finger-crossed emoji. In this talk, we share our experiment with setting up testing for the DR environment. The biggest question here is whether we would accept writes to the DR DBs other than those coming from replication from production. The compromise is between the risks of production stability and risks of failed DR failover. We will discuss the alternates we considered and the final approach we have adopted.

Zehua Liu, Zendesk Singapore

Zehua establishes and leads the tooling team at Zendesk, where he works on making sure that developers are happy developing what they want to develop and the quality of the products the developers deliver is great. He is currently based in Singapore.

3:25 pm–3:55 pm

Break with Refreshments

Grand Ballroom Foyer

3:55 pm–4:45 pm

Track 1

Incident Management (continued)

Grand Ballroom 1

Accept Partial Failures, Minimize Service Loss

Tuesday, 3:55 pm–4:20 pm PDT

Daxin Wang, Baidu Inc.

Available Media

Large Internet products are too complex to completely recover from a failure rapidly, as root cause localization and large scale operation are very time-consuming. If the failures can be isolated to a small part of system, we can transfer user query to the other parts still work, or cut off the failed minor subsystem, which is much more rapid than completely system recovery. This talk will presents some practical experiences for failure isolation in Baidu.

First, we should have at least N+1 data center redundancy, and eliminate unnecessary global “single-point” module. All automated operations should be limited to execute in one data center first. When the system in one data center is damaged by network or software failure, we can transfer user requests to other data centers rapidly, even automatically.

Second, we make the non-essential components of the system detachable. When one of them fails, it can be detached immediately to keep the major function still work for user.

Daxin Wang, Baidu Inc.

Daxin Wang has been working in Baidu SRE team for more than 7 years, focusing on principles of building and operating high available products, including monitoring, HA architecture, safely automation.

Azure SREBot: More than a Chatbot—an Intelligent Bot to Crush Mitigation Time

Tuesday, 4:20 pm–4:45 pm PDT

Cezar Guimaraes, Microsoft

Available Media

Azure SREBot is more than just a Chat Bot. Azure SREBot is a knowledgeable and intelligent engine that replaces tribal knowledge and automates incident response activities. It is also extensible, allowing other teams to add their own knowledge.

In this talk you will hear how SREBot is being developed and used to reduce the Time to Mitigate (TTM) Azure incidents. We will explain how it was designed and the share the main issues we are facing.

Cezar Guimaraes, Microsoft

Cezar Guimaraes is a Site Reliability Engineer Lead on the Microsoft Azure team. He has more than 15 year of experience and has worked at Microsoft for 11 years as a Software Engineer. Currently he is working on Azure to identify and resolve problems that stand in the way of service uptime through engineering solutions such as bots and intelligence/correlation engines.

Track 2

Service Expansion (continued)

Grand Ballroom 2

Merou: A Decentralized, Audited Authorization Service

Tuesday, 3:55 pm–4:20 pm PDT

Luke Faraone, Dropbox

Available Media

Every organization has a system for access control be it spreadsheet, LDAP, IAM, something home grown, or all of the above. Most of these approaches suffer from some combination of hard to use interfaces, incomplete coverage, lack of audit/compliance functionality, and bottlenecks for permission grants.

Merou is Dropbox's homegrown, open source authorization service. It manages a wide range of environments--from the corporate network, to production data centers, and cloud providers like AWS. And it is a transparent system of record that is managed in a decentralized manner by the individuals and teams that own applications and services Merou provides authorization for.

We will present a general overview of our approach, highlight the features that make this system unique, sample some of our current use cases, and present some lessons learned.

Luke Faraone, Dropbox

Luke is a security engineer on Dropbox's Infrastructure Security team, which works to accelerate the secure deployment of internal systems. He is also Dropbox's representative to the TODO Group, an open group of companies who collaborate on running successful and effective open source projects and programs.

In his spare time, Luke contributes to the Debian project as a developer and as a member of the ftpmaster team, which oversees and maintains the well-being of Debian's official package repositories.

Canary in the Internet Mine

Tuesday, 4:20 pm–4:45 pm PDT

Brook Shelley, Turbine Labs

Available Media

Software releases are often costly, slow, and difficult to roll back. This talk will suggest a different way forward with a flexible, incremental, reversible release workflow using tools like a reverse-proxy and customer-centric monitoring. I'll also talk about header and cookie based dev branch tests, and ways to use an incremental release to test internally as well. There will be a short demo of a release gone bad, and one that succeeds, in order to illustrate the principles of this talk. There are some blinky lights, but not many.

Brook Shelley, Turbine Labs

Brook Shelley lives in Portland, OR with her cat Snorri, where she builds a better software release & response process at Turbine Labs. Her writing has appeared in The Toast, Lean Out, Transfigure, & the Oregon Journal of the Humanities. She speaks at conferences on queer & trans issues, & is co-chair of the board of Basic Rights Oregon.

6:00 pm–8:00 pm

Reception

River Promenade