All sessions will be held at the Grand Copthorne Waterfront Hotel Singapore.
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Monday, 22 May 2017
7:30 am–9:00 am
Continental Breakfast
Grand Ballroom Foyer
9:00 am–11:20 am
SRE in This Region
Grand Ballroom
LinkedIn SRE: From Inception to Global Scale
Bruno Connelly and Viji Nair, LinkedIn Corporation
Will be covering the following objectives:
- The History of SRE at LinkedIn
- Problems behind the scenes
- Infrastructure / Services / Releases
- The world before SRE
- Establishing the SRE culture
- Core Principles
- SRE Engagement Model
- Ownership model and ownership culture
- People, the most important assets
- Redefining the role & Leveling up
- Running Linkedin @ scale
- Guard-railed multi-phased deployment process
- Error budgeting
- AB Testing & Deployments
- Building it together, self-services
- Learning and Evolution
- How we leverage global/remote teams
- Value additions
- How to seed and grow them successfully
- Challenges, and how to address them
- How to scale global teams
- Leanings, what worked and what didn't
- Best practices, do’s and don’ts when you seed, feed and grow a global team
Bruno Connelly, LinkedIn Corporation
Bruno Connelly, VP Engineering, Site Reliability, leads the teams responsible for the growth, scale, and day-to-day operation of LinkedIn. Prior to LinkedIn, Bruno spent seven years at Yahoo! where he held leadership positions in search engine operations, the Panama project, and ultimately led the teams responsible for the uptime of all Yahoo! advertising platforms.
Viji Nair, LinkedIn Corporation
Viji Nair is the Site Lead and Director of SRE at LinkedIn's Bangalore SRE Organization. He played a pivotal role in shaping and scaling LinkedIn’s India SRE presence from an idea to a world-class team of 65 engineers. This team is now recognized as one of India’s premier SRE organizations.
Next Generation of DevOps: AIOps in Practice @Baidu
Xianping Qu and Jingjing Ha, Baidu
Baidu has thousands of applications and hundreds of thousands of servers. For high availability and reliability services, our SREs have developed many operation tools and systems. But, these tools are difficult to reuse and scale because of various of different operations concepts, runtime envoriments and operations strategies. Thus, we built a platform named AIOps platform (AI means automation and intelligence) to help SREs more quickly and efficiently develop operations tools. This platform provides unified operations abstract layer, operations strategies, automated scheduling and execution. Thus, SREs can focus on building their custom and advanced features.
In this talk, we demonstrate the core procedure of AIOps platform by actual cases in the productive environment of the core products at Baidu. The following technologies will be involved and mentioned: the platform architecture, OKB (operations knowledge base), OPAL(operations abstract layer), and practices in failover, auto scaling, etc.
Xianping Qu, Baidu
Xianping Qu is a manager of DevOps team at Baidu, the largest search engine in China, and has built Baidu’s monitoring platform and data warehouse. Now, He leads DevOps team to work on some challenging projects, such as anomaly detection, RCA, auto-scaling, etc. He is also interested in data analysis and machine learning.
How Could Small Teams Get Ready for SRE
Zehua Liu, Zendesk Singapore
Site Reliability Engineering encompasses a large area of topics. The SRE book itself contains 34 chapters in 500+ pages. It’s not easy for a small team to start to adopt these SRE practices. At Zendesk Singapore, we went through the initial chaos of engineering and reliability issues when the engineering team grew from 10 to 40 engineers and the product focus shifted from SME to enterprise customers. Several initiatives that we took during this period helped stabilize the product and got the team into a shape where it’s ready to apply more SRE best practices from the SRE book and other sources. In this talk, we will share the details about some of the projects that we consider as essential in preparing a small and young team to tackle more serious site reliability issues. We will discuss how some of these key ideas could be combined to form foundations of the principles discussed in the SRE book. We hope that this talk could help teams facing similar growth and product change issues better cope with them while keeping the product reliable.
Zehua Liu, Zendesk Singapore
Zehua establishes and leads the tooling team at Zendesk, where he works on making sure that developers are happy developing what they want to develop and the quality of the products the developers deliver is great. He is currently based in Singapore.
11:20 am–11:50 am
Break with Refreshments
Grand Ballroom Foyer
11:50 am–12:15 pm
Monitoring and Alerting
Grand Ballroom 1
How We Built TechLadies in Singapore
Elisha Tan, Founder, TechLadies
TechLadies is a community-led initiative for women in Asia to connect, learn, and advance as programmers in the tech industry. Since our launch in 2016, we’ve taught 177 ladies in Singapore and Malaysia programming skills and saw 7 ladies getting technical internships or hired as junior software engineers. In this talk, I’ll share about the inspiration behind setting up TechLadies, the what and how of the programs we do, and the lessons learned on how to introduce more women to the tech industry. Hopefully it will be useful for you in increasing gender diversity at your workplace!
Service Lifecycle
Grand Ballroom 2
Focal Impact: The Service Pyramid
Michael Elkin, Facebook
The Service Pyramid is a framework based on the hierarchy of of needs - with a service oriented focus. It's used by Production Engineers at Facebook to navigate a sea of tasks and mountains of work by giving clear guidance on what relative priorities different kinds of work get. Using this guidance we can avoid common pitfalls, gauge how mature the reliability of a service is, and help divide work between PE's and other disciplines.
Come learn about how one team of Production Engineers have been applying the Service Pyramid in the Datacenter space: engaging with DCIM, Controls, Mechanical, and Electrical engineers to build better monitoring & automation.
Michael Elkin, Facebook
Production Engineer at Facebook; responsible for an untold number of server reboots and once successfully managed to blame the network.
12:15 pm–1:40 pm
Luncheon, Sponsored by Baidu
Waterfront Ballroom
1:40 pm–3:00 pm
Monitoring and Alerting (continued)
Grand Ballroom 1
Smart Monitoring System for Anomaly Detection on Business Trends in Alibaba
Zhaogang Wang, Alibaba Group
Anomaly detection based on time series analysis approaches has been a focused theme in the field of monitoring, especially for business indicators monitoring. In Alibaba, hundreds of major KPIs need to be monitored in real time to detect the abnormal events and raise alarms. The effectiveness of the alarms is strictly evaluated by human operators. Therefore, we proposed an intelligent anomaly detection method to make the business monitoring system more scalable and easier to maintain.
There are two major problems in traditional anomaly detection approaches:
- How to get accurate predictions based on seasonal time series data with complex noises and interferences.
- How to determine the segmental thresholds dynamically, and learn from human feedbacks, e.g. the manual labelling data, to improve the accuracy of anomaly detection, and tolerate the error, event contradictions within the labelling data.
For the first problem, as a tradeoff among computation performance, accuracy and robustness, we introduced a specific method to pre-process the data, and chose the STL method to do predictions on our business data.
For the second problem, we proposed a closed loop feedback method to determine the initial segment thresholds, and utilized the human labelling data to update the thresholds continuously.
Zhaogang Wang, Alibaba Group
Zhaogang Wang,
Senior technical specialist, Alibaba Group.
2009-2015, Senior Engineer, SRE team, Baidu.
Areas of interest:
Intelligent monitoring system,
Fault diagnosis
Graphite@Scale or How to Store Millions of Metrics per Second
Vladimir Smirnov, System Adminstrator, Booking.com
This is a story about dealing with metrics at scale. A lot of metrics.
This is our story of the challenges we’ve faced at Booking.com and how we made our Graphite system handle millions of metrics per second.
You will learn about one of the most high load Graphite-compatible stacks, the problems it poses and the challenges in maintaining it and scaling it further, pushing Graphite to its limits and beyond.
System Administrators and SREs who are interested in monitoring and scalability would find this useful.
Vladimir Smirnov, Booking.com
I've dealt with large scale systems design and administration in IT for over 6 years. For the last 1.5 years I've been working Booking.com, specializing in scaling our Graphite stack, improving its reliability and performance.
We at Booking.com have hundreds of backend servers, hundreds TB of data which we use to handles millions of metrics per second using our Graphite stack. The rate of growth is enormous and constantly growing.
Service Lifecycle (continued)
Grand Ballroom 2
Data Checking at Dropbox
David Mah, Dropbox
At Dropbox, we’ve worked incredibly hard to build infrastructure that we are confident in trusting. A major aspect of our confidence comes from the verification of our data at rest, which gives us signal that our data will be properly usable when requests actually come in.
In this talk, we’ll break down the thinking about how to design and build a consistency checker system. We’ll start with the actual needs/goals of such a system, then follow with the sub-components of the system. We’ll include both distributed system design AND how to design your alert escalation workflow to be as simple as possible for human operators.
Attendees are expected to leave the session understanding how they could build consistency checkers for their own systems. This includes:
- Do you even need a consistency checker?
- What independent components need to exist?
- What is a good alerting + triaging workflow?
- What is involved in an auto-remediation mechanism for constraint failures
David Mah, Dropbox
David Mah is a Site Reliability Engineer at Dropbox who has built several monitoring mechanisms across Dropbox’s block storage and server file system infrastructure. He is also the author of Dropbox’s auto-remediation infrastructure.
Managing Server Secrets at Scale with a Vaultless Password Manager
Ignat Korchagin, Cloudflare
Operating a datacenter or a distributed network involves handling a lot of secrets. In most cases, you have to deal with at least four types of secrets for each piece of hardware: SSH server key, key to bootstrap your configuration management system, disk encryption key, and some per-server credentials to access other services. And often, these keys have to be set up before your configuration management kicks in, making the automation of this process more difficult.
Security-wise, it is important to control where and when those secrets are generated. Often, keys are generated by startup scripts. However, during initial boot (especially on diskless systems), the system may have a low entropy level in its internal random number generator, resulting in statistically weak generated keys.
And once you have your keys, you need to store them securely. This presents a chicken-and-egg problem. Encrypted disk is a great solution, but? You need a key to access an encrypted disk and where do you store that? Also, where do you store keys for diskless systems?
This talk presents an approach that combines hardware support and cryptography to deal with the above issues and unify and simplify secret management for your hardware fleet.
Ignat Korchagin, Cloudflare
Ignat is a systems engineer at Cloudflare working mostly on platform and hardware security. Ignat’s interests are cryptography, hacking, and low-level programming. Before Cloudflare, Ignat worked as senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets.
3:00 pm–3:30 pm
Break with Refreshments
Grand Ballroom Foyer
3:30 pm–4:55 pm
Monitoring and Alerting (continued)
Grand Ballroom 1
Open-Falcon: A Distributed and High-Performance Monitoring System
Yao-Wei Ou, Wifire; Wei Lai, Didi Corporation
There are already many outstanding open-source monitoring systems. However, with rapid growing businesses and specific requirements of internet companies, the existing open-source monitoring systems are not better enough on performance and scalability.
Under the situation, Open-Falcon’s main distinguishing features compared to other monitoring system are:
- Scalability: Scalable monitoring system is necessary to support rapid business growth. Each module of Open-Falcon is super easy to scale horizontally.
- Performance: With RRA(Round Robin Archive) mechanism, the one-year history data of 100+ metrics could be returned in just one second.
- High Availability: No criticle single point of failure, easy to operate and deploy.
- Flexibility: Falcon-agent has already 400+ built-in server metrics. Users can collect their customized metrics by writing plugins or just simply run a script/program to relay metrics to falcon-agent.
- Efficiency: For easier management of alerting rules, Open-Falcon supports strategy templating, inheritance, and multiple alerting method, and callback for recovery.
- User-Friendly Dashboard: Open-Falcon could present multidimensional graph, including user-defined dashboard/screen.
Yao-Wei Ou, Wifire
Yao-Wei Ou is the director of Wifire oversea R&D center. Before he joined 21Vianet Group(NASDAQ: VNET), he worked on 360 total security at Qihoo 360(NYSE:QIHU). Yao-Wei has a Master degree in Computer Science from Taiwan University. He has accumulated various developing experience such as Windows kernel driver and UI, Android RenderScript framework, iOS Applications, LLVM compiler and distributed system. Since 2015 he joined Fastweb, he leads the development of CDN monitoring system in Fastweb and is a core member of Open-Falcon organization.
Wei Lai, Didi Corporation
Wei Lai is the technical director of Didi, Founder of the Open-Falcon software and community.
Talking to an OpenStack Cluster in Plain English
Wei Xu, Tsinghua University
Modern systems build on layers and layers of abstractions with tons of modules. These abstractions help the development but make it a nightmare to operate. OpenStack is a system of this kind: its states, including persistent (DB) states, are distributed across dozens of modules in the system. Operators have to access these states using obscure command line tools that has hundreds of switches no one remembers. Integrating it with other open source projects like Ceph further complicates the problem. Reasoning about the inconsistencies of these states – one of the leading causes of user-visible bugs – is beyond what current log-based monitoring systems’ capability.
As both system operation practitioners and academic researchers, we discuss our experience in operating a 130-node OpenStack private cloud, as well as our research on how we automatically build a knowledge graph based on system states and logs. We will demonstrate our natural language interface that can provide all information about the system, crossing layers, and modules – all with plain English queries. Finally, we also present a simple anomaly detection system indicating “why” a problem happens.
Wei Xu, Tsinghua University
Wei Xu is an assistant professor at the Institute for Interdisciplinary Information Sciences of Tsinghua University. He received his Ph.D. from UC Berkeley in 2010. He worked at Google as a software engineer before joining Tsinghua University.
Wei Xu has a broad research interest in distributed system design and big data. He has published 20+ research papers in leading venues.
He is also the director of Open Compute Project (OCP) Certification Lab in China. He is a recipient of the Chinese National Youth 1000 Program, graduate student advising award from Tsinghua, and faculty research awards from Google, IBM, and Microsoft.
Service Lifecycle (continued)
Grand Ballroom 2
Diistributed Consensus Algorithms
Laura Nolan, Google
Processes crash or may need to be restarted. Hard drives fail. Natural disasters can take out several datacenters in a region. Site Reliability Engineers need to anticipate these sorts of failures and develop strategies to keep systems running in spite of them.
This usually means running systems across multiple sites, and this means that you need to make tradeoffs between availability and consistency of your system state.
This talk explores distributed consensus algorithms, such as RAFT and Paxos in production: how they work, how they perform, what can go wrong, when to use them and not to use them.
Laura Nolan, Google
Laura Nolan has been a Site Reliability Engineer at Google for four years, working on large data infrastructure projects and most recently, networking. Her background is in software engineering and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly SRE book, and is co-chair of SRECon EMEA 2017.
A Scheduling Framework For Large-Scale Based on Ansible
AiZhen Chen, Qiniu Cloud
This subject introduces how to use Ansible to build a distributed scheduling framework, to achieve the operation and maintenance platform for tens of thousands of nodes of the data collection and task distribution and other large-scale scheduling operations. By subdividing the interface layer, distributed scheduling layer, driver layer,to realize Data collection and task distribution for the underlying operating system, storage, network.And by modifying the Ansible source code to solve business incoming parameters judgment.
AiZhen Chen, Qiniu Cloud
Chen AiZhen is an expert in cloud computing and is a docker product manager at Qiniu Cloud, where she designs large scale computer operation platforms and is responsible for a number of cloud projects. She has many years of computer operations management experience as well as rich experience with the PaaS platform architecture design, automated operation, and maintenance.
Tuesday, 23 May 2017
8:00 am–9:00 am
Continental Breakfast
Grand Ballroom Foyer
9:00 am–10:50 am
Monitoring and Alerting (continued)
Grand Ballroom 1
Draining the Flood—A Combat against Alert Fatigue
Yu Chen, Baidu Inc.
Monitoring system is an important tool for SREs to guarantee service stability and availability. Baidu’s monitoring system, Argus, keeps tracking hundreds of services of distribution system. With the increasing complexity of the services, the monitoring items and the corresponding anomaly detectors grows to a magnitude that the generated alerts floods from time to time. On an average day, Argus detects millions of warning events and sends out thousands of SMS alerts to on-call engineers. This results in a per person amount of more than 100 alerts during the day time and 30 during the night time. When a severe failure occurs, the alerts flood in a massive surge, and become little helpful for the engineers to fix the problem. Therefore, it is imperative to improve the detection accuracy of abnormal events, reduce the amount of alerts, and organize them into a meaningful way.
In this talk, we will introduce our practice that leverage machine learning methods to detect anomalies and group alerts, in order to solve the above issues. We will also share some successful experiences, such as alert based datacenter-level failure detection, and alert-triggering automatic recovery techniques.
Yu Chen, Baidu Inc.
Yu Chen is the data architect of the SRE team in Baidu. He previously worked in Microsoft Research Asia as a researcher. His working experience includes data mining, search relevance, and distributed systems.
Good, Better, Best, Mobile User Experience
Fred Wu, Tingyun
Traditional DC/Cloud monitoring platform focus more on high availability. Restart, reboot and reimage are the top 3 quick actions for operations engineer, but in current IT environment, Mobile/Cloud architect needs 3rd monitoring platform to help operations engineer to understand more about application, end user experience: widen the monitoring from DC only to Mobile user, deepen the monitoring from system to application and code running, Change alert only to alert first and take action at the same time. The presentation will show to audience why change, best practice and customer benefit of using 3rd monitoring platform, also some case studies of customer satisfaction improvement, lower the average customer cost.
Fred Wu, Tingyun
Fred Wu, VP of Tech & Service, has been with Tingyun for 2 years and has 18 years of experience with the China IT market, for both technical team management and solution architect for ADC and APM, focus on the industry of OTA, E-Commerce, CSP, and Banking.
Service Lifecycle (continued)
Grand Ballroom 2
Reliable Launches at Scale
Sebastian Kirsch, Google Switzerland GmbH
How do you perform up to 70 product and feature launches per week safely and reliably? Google staffed a dedicated team of Site Reliability Engineers to solve this question: Launch Coordination Engineers (LCEs) work across Google's service space to audit new products and features for reliability, act as liaisons between teams involved in a launch, and be gatekeepers. Google designed its launch process to be easy on developers, scalable and provide a consistent bar for reliability at launch. A common checklist helped making the results of a launch review reproducible. This launch checklist was later automated into a self-service tool curated by the LCE team and re-used by other teams. The specialisation on the challenges of launching enabled the LCE team to build a breadth of experience in applicable techniques and allowed them to function as vehicles for knowledge transfer between different parts of the company.
Attendees will learn about challenges unique to launch situations, the advantages of a dedicated LCE team, how to structure a launch process, the outline of a launch checklist, and select techniques for successful launches.
Sebastian Kirsch, Google Switzerland GmbH
Sebastian Kirsch is a Site Reliability Manager for Google in Zürich, Switzerland. He manages the team responsible for the reliability of Google's financial transaction systems. Sebastian joined Google in 2006 in Dublin, Ireland, and has worked both on internal systems like Google's web crawler, as well as on external products like Google Maps and Google Calendar. He specializes in the reliability aspects of new Google products and new features of existing products, ensuring that they meet the same high reliability bar as every other Google service.
Didi: How to Provide a Reliable Ridesharing Service
Ming Hua and Lin Tan, Didi Corporation
Didi is a ridesharing company similar to Uber, which was founded in 2012, but right now Didi is providing transportation services for close to 400 million users across over 400 cities in China. As the infrastructure team, we faced a wide range of technical challenges:
- With a 500% annual requests growth, how to guarantee the service quality for a rapidly expanding system?
- With more than 400 releases per day, how to ensure to iterate applications rapidly and minimize risk?
- How to prevent a large scale system failure, which will immediately affect customers over 400 cities?
- How to put as much efforts as we could into the system stability along with rapidly business iteration
In this session, we will share lessons learned from Didi:
- The importance of a specific stress test and system capacity estimation for our ridesharing service.
- The best practice of Didi’s standard delivery including application requirements and deployment procedure.
- The benefits from an intelligent load balancing strategy based on cities and a high availability system which across different region zones.
- A measurement mechanism for system stability work.
Ming Hua, Didi Corporation
Ming Hua works on Didi SRE principal architecture with a focus on service stability work.
Lin Tan, Didi Corporation
Lin Tan is an SRE Senior Engineer at Didi.
10:50 am–11:20 am
Break with Refreshments
Grand Ballroom Foyer
11:20 am–12:00 pm
Monitoring and Alerting (continued)
Grand Ballroom 1
Measuring the Success of Incident Management at Atlassian
Gerry Millar, Atlassian
When an incident happens it's the worst possible time to be bogged down with confusing systems and processes. A well defined Incident Management Process that's light-weight and supported by good automation offers a way to get fast, easy and predictable results during an incident, but if you don't implement the right things in the right way you risk bad results, such as high time-to-recovery, at critical times.
Find out how Atlassian drives value out of the Incident Management process and what metrics we use to track it. We'll also cover how we created automation to remove the overhead in managing incidents and deep dive into a case study to explain how it all ties together.
The target audience for this conceptual session is people who are involved in the management of incidents, such as SREs and delivery team members.
Gerry Millar, Atlassian
Gerry Millar is a member of the Reliability Process Group, the team responsible for Atlassian's incident response and recovery process. He trains staff, develops automation, measures incident performance, and drives post-incident reviews at Atlassian.
He was originally hired as a technical operations engineer for the company's cloud-facing products and as this group morphed into SRE Gerry's role morphed along with it. He now enjoys writing a bit of Python and ocean swimming around Sydney's beaches.
Service Lifecycle (continued)
Grand Ballroom 2
Managing Changes Seamlessly on Yahoo's Hadoop Infrastructure Servers
Vineeth Vadrevu, Yahoo
Yahoo's grid-ops makes frequent changes (15 per-week) to below entities:
- OperatingSystem configuration (packages, security-patches)
- Hadoop (packages)
- Supporting applications i.e. LDAP, Kerberos, Logging, Monitoring
For a platform like Hadoop at Yahoo's scale; stability, reliability and uptime are highly crucial because of the sheer magnitude to which the platform caters to. The objectives that are set, for pushing changes, so that the platform doesn't take a hit on stability, reliability and uptime are:
- Continuous Delivery and DevOps philosophies are strictly adhered to
- Every node always has right configs
- Each change is seen to be pushed within specific time period and is uniform across all nodes
- Changes made can be visualised, tracked, monitored, validated and if required, can be reverted easily
- Feasible mechanisms are made available in the pipeline to promote pushes across Hadoop clusters in a staged manner so that change rollovers are smooth
- Effective gates are in place to reduce the impact of a wrong change
- Any change reviewed and committed makes it's way automatically to every node
This presentation will share and discuss:
- How the objectives are achieved through automation
- Experiences and lessons learned
We believe our practices can easily be adopted by SREs to effectively manage Large-Scale Infrastructure.
Vineeth Vadrevu, Yahoo
Vineeth is a Sr. Production Engineer, Principal at Yahoo. He is with GridOps team at Yahoo working on Large Scale Production Engineering Applications and Infrastructure Monitoring & Management.
Vineeth has B.Tech in Computer Science (Affiliated to Andhra University) and an Executive MBA in Operations Management from Symbiosis Intl. University, Pune, India.
12:00 pm–2:00 pm
Luncheon
Grand Ballroom Foyer
2:00 pm–3:25 pm
Incident Management
Grand Ballroom 1
Event Correlation: A Fresh Approach towards Reducing MTTR
Renjith Rajan and Rajneesh, LinkedIn
LinkedIn has hundreds of microservices spanning across different data centres and dependent on each other. Even though microservice architecture has its own advantages, often times, identifying the problematic service during an incident/outage could significantly contribute to a high MTTR and unnecessary escalations. Event Correlation Engine is an attempt to algorithmically identify the responsible service quickly and escalate to the right team.
Event Correlation Engine examines the entire service stack by looking at critical downstream latencies, error metrics and other monitoring events to recommend a responsible service. It understands callee caller communication for linkedIn’s complex microservices architecture.The engine uses dynamic thresholds, which it learns by processing last 30 days data, to provide an effective recommendation. We’ll discuss the approach we used in building it and how it is being used at LinkedIn to reduce MTTR and on call escalations.
Renjith Rajan, LinkedIn
Renjith Rajan is a Staff Site Reliability Engineer with LinkedIn's production SRE team. He joined LinkedIn in July 2016 and prior to that he was working with Yahoo in the Ad Serving team.
Rajneesh , LinkedIn
Rajneesh is a Site Reliability Engineering Manager leading the production SRE team at Linkedin, Bangalore. He joined Linkedin in April 2015 and prior to that he was leading the Global Platforms team at Yahoo.
Automated Troubleshooting of Live Site Issues
Sriram Srinivasan, PayPal India Private Ltd.
Troubleshooting of live site issues can be challenging especially when our production stack is made up of over 2000 applications and services. PayPal’s SRE team is also involved in troubleshooting and driving resolution of the various live site issues reported by the customer and merchant support teams. Today to troubleshoot a live site issue, we go to multiple places depending on the issue at hand. Predominantly we go and look into the Centralized Application Logs. Then we also check the various data sources and the in-house alerts. There is so much of information to look for. A lot of effort goes into gathering data about the failed attempt/transaction from various sources internal to PayPal. Thus we needed an automated way to troubleshoot issues. So we have developed an Auto Troubleshooting Platform that aggregates the data from all the underlying data sources, troubleshoots and records the results. The Platform is built in a way that anyone can post any type of ticket and get it troubleshooted automatically. Auto Troubleshooting Results will be available in minutes and the same can be seen through a portal. In this talk, I will highlight the journey that we have undertaken in making this happen.
Sriram Srinivasan, PayPal India Private Ltd.
Sriram Srinivasan is a technologist with over 14 years of experience in Software Development. He worked in multiple teams at PayPal India Private Ltd. in various aspects of the software development lifecycle, including conception, design, development and supporting products. In his current role as Architect at PayPal's SRE team, he got an opportunity to design and develop an Auto Troubleshooting Platform.
Service Expansion
Grand Ballroom 2
"A Unit Test Would Have Caught This:" Small, Cheap, and Effective Testing for Production Engineers
Andrew Ryan, Facebook Inc.
Note: This talk ends at 2:25 pm.
Production Engineers write tons of code for automating and managing distributed systems, ranging from one-off shell scripts to complex frameworks of many thousands of lines. But, in our profession, formal software testing has historically been weak. While there are a wide variety of ways to test software, in this talk, we'll focus on the methods that are easiest to implement and give the best results.
We'll discuss how teams use "small and cheap" tests at Facebook to give us the largest quality improvements with the lowest effort. Examples will include projects written in Bash and Python, as well as our Chef code, which is written in Ruby. We'll also discuss some of the ancillary benefits of testing, including recruiting, training, and team continuity.
Andrew Ryan, Facebook, Inc.
Andrew has been a member of Facebook's Production Engineering team since 2009. He currently works as a member of the Traffic Infrastructure team, helping to make Facebook faster for everyone.
Testing for DR Failover Testing
Zehua Liu, Zendesk
Disaster Recovery is an important area in SRE. A simplified scenario is recovery from a full data centre failure. The Zendesk Chat backend infrastructure operates in a single data centre. The way to be sure that DR works is to perform a real failover. The past failover attempts were full of surprises and unexpected issues, most of them having to do with the applications failing to work after failover, due to various reasons. These unexpected issues led to failed failover tests and/or extended maintenance window due to extra efforts required to bring things back to order, causing bad customer experience.
How can we increase our confidence in the system being working after failing over? We want to confidently declare that it should work, instead of typing the finger-crossed emoji. In this talk, we share our experiment with setting up testing for the DR environment. The biggest question here is whether we would accept writes to the DR DBs other than those coming from replication from production. The compromise is between the risks of production stability and risks of failed DR failover. We will discuss the alternates we considered and the final approach we have adopted.
Zehua Liu, Zendesk Singapore
Zehua establishes and leads the tooling team at Zendesk, where he works on making sure that developers are happy developing what they want to develop and the quality of the products the developers deliver is great. He is currently based in Singapore.
3:25 pm–3:55 pm
Break with Refreshments
Grand Ballroom Foyer
3:55 pm–4:45 pm
Incident Management (continued)
Grand Ballroom 1
Accept Partial Failures, Minimize Service Loss
Daxin Wang, Baidu Inc.
Large Internet products are too complex to completely recover from a failure rapidly, as root cause localization and large scale operation are very time-consuming. If the failures can be isolated to a small part of system, we can transfer user query to the other parts still work, or cut off the failed minor subsystem, which is much more rapid than completely system recovery. This talk will presents some practical experiences for failure isolation in Baidu.
First, we should have at least N+1 data center redundancy, and eliminate unnecessary global “single-point” module. All automated operations should be limited to execute in one data center first. When the system in one data center is damaged by network or software failure, we can transfer user requests to other data centers rapidly, even automatically.
Second, we make the non-essential components of the system detachable. When one of them fails, it can be detached immediately to keep the major function still work for user.
Daxin Wang, Baidu Inc.
Daxin Wang has been working in Baidu SRE team for more than 7 years, focusing on principles of building and operating high available products, including monitoring, HA architecture, safely automation.
Azure SREBot: More than a Chatbot—an Intelligent Bot to Crush Mitigation Time
Cezar Guimaraes, Microsoft
Azure SREBot is more than just a Chat Bot. Azure SREBot is a knowledgeable and intelligent engine that replaces tribal knowledge and automates incident response activities. It is also extensible, allowing other teams to add their own knowledge.
In this talk you will hear how SREBot is being developed and used to reduce the Time to Mitigate (TTM) Azure incidents. We will explain how it was designed and the share the main issues we are facing.
Cezar Guimaraes, Microsoft
Cezar Guimaraes is a Site Reliability Engineer Lead on the Microsoft Azure team. He has more than 15 year of experience and has worked at Microsoft for 11 years as a Software Engineer. Currently he is working on Azure to identify and resolve problems that stand in the way of service uptime through engineering solutions such as bots and intelligence/correlation engines.
Service Expansion (continued)
Grand Ballroom 2
Merou: A Decentralized, Audited Authorization Service
Luke Faraone, Dropbox
Every organization has a system for access control be it spreadsheet, LDAP, IAM, something home grown, or all of the above. Most of these approaches suffer from some combination of hard to use interfaces, incomplete coverage, lack of audit/compliance functionality, and bottlenecks for permission grants.
Merou is Dropbox's homegrown, open source authorization service. It manages a wide range of environments--from the corporate network, to production data centers, and cloud providers like AWS. And it is a transparent system of record that is managed in a decentralized manner by the individuals and teams that own applications and services Merou provides authorization for.
We will present a general overview of our approach, highlight the features that make this system unique, sample some of our current use cases, and present some lessons learned.
Luke Faraone, Dropbox
Luke is a security engineer on Dropbox's Infrastructure Security team, which works to accelerate the secure deployment of internal systems. He is also Dropbox's representative to the TODO Group, an open group of companies who collaborate on running successful and effective open source projects and programs.
In his spare time, Luke contributes to the Debian project as a developer and as a member of the ftpmaster team, which oversees and maintains the well-being of Debian's official package repositories.
Canary in the Internet Mine
Brook Shelley, Turbine Labs
Software releases are often costly, slow, and difficult to roll back. This talk will suggest a different way forward with a flexible, incremental, reversible release workflow using tools like a reverse-proxy and customer-centric monitoring. I'll also talk about header and cookie based dev branch tests, and ways to use an incremental release to test internally as well. There will be a short demo of a release gone bad, and one that succeeds, in order to illustrate the principles of this talk. There are some blinky lights, but not many.
Brook Shelley, Turbine Labs
Brook Shelley lives in Portland, OR with her cat Snorri, where she builds a better software release & response process at Turbine Labs. Her writing has appeared in The Toast, Lean Out, Transfigure, & the Oregon Journal of the Humanities. She speaks at conferences on queer & trans issues, & is co-chair of the board of Basic Rights Oregon.
6:00 pm–8:00 pm
Reception
River Promenade
Wednesday, 24 May 2017
8:00 am–9:00 am
Continental Breakfast
Grand Ballroom Foyer
9:00 am–10:50 am
Performance Tuning
Grand Ballroom 1
InnoDB to MyRocks Migration in Main MySQL Database at Facebook
Yoshinori Matsunobu, Facebook
At Facebook, we open sourced MyRocks—Flash optimized, space and write efficient MySQL database engine. We are in the process of migrating our main MySQL databases—storing Facebook social graphs, massively sharded, low latency and automated services—from InnoDB to MyRocks. We have been very successful so far and have reduced database size by half.
Compared to deploying new database software into new or non-critical services, replacing existing stable database running on very critical services is much harder. You need to pay attention to lots of things, like how to migrate existing data without stopping or slowing down services, how to migrate within reasonable amount of time, and how to continuously verify not to corrupt any data.
In this session, the speaker will talk about MyRocks production deployment story. The following topics will be covered
- Overview of MySQL at Facebook
- What is MyRocks, and why we decided to create yet another database engine
- How Facebook MySQL SRE team collaborated with Engineering team
- How we prepared, executed and monitored InnoDB to MyRocks migration
- Lessons learned from the migration
Yoshinori Matsunobu, Facebook
Yoshinori Matsunobu is a Production Engineer at Facebook, and is leading MyRocks project and deployment. Yoshinori has been around MySQL community for over 10 years. He was a senior consultant at MySQL Inc since 2006 to 2010. Yoshinori created a couple of useful open source product/tools, including MHA (automated MySQL master failover tool) and quickstack.
Golang's Garbage
Andrey Sibiryov, Uber
In the first part of this talk, we’ll chat about the design of Golang’s Garbage Collector (GC) and the theory behind modern garbage collection in general. We’ll go over common GC traits, such as whether a given GC is exact, generational, compacting and so on, try to figure out each trait’s upsides and downsides and how they relate to the current and future Golang’s GC designs.
In the second part, I’ll use a particular in-memory database project as an example to demonstrate some easy and some quite complicated tricks on memory management aimed at overcoming certain design shortcomings and trade-offs made in Golang’s GC. We’ll chat about the ubiquity and the cost of pointers, the runtime optimizations out there that might help reduce the pointer burden (like pointerless maps and channels, and uintptr-based weak pointers), about object pooling and how sync.Pool is a very different animal from your common channel-based pools.
In the third and final part, we’ll hit the rock bottom and chat about some dark magic with native heap allocations and alternative unsafe heaps.
Andrey Sibiryov, Uber
Andrey Sibiryov is an SRE at Uber, New York City. He is mostly focusing on infrastructure monitoring, observability & performance. Some of you might know Andrey through his open-source projects:
- Gorb: In-kernel load balancer. Based on IPVS, the same technology that powers the Docker Swarm routing mesh.
- Tesson: NUMA-aware application sharding tool.
Previously, Andrey led the Cloud Technologies department at Yandex where he built the Cocaine Cloud Platform, and worked on Helios, a Docker-based CI/CD stack at Spotify.
Capacity Planning
Grand Ballroom 2
Capacity Planning and Flow Control
Jun Zhang, Alibaba; Jiajiang Lin
Alibaba has a very rich form of business, behind every business there are many corresponding business systems, each business system is distributed in multiple servers. In such a large distributed system architecture, how to allocate resources to the system becomes a major challenge to site reliability (especially in the dual-11):
- How much resources do we need to allocate for each system?
- How can we ensure that our resource allocation results are consistent with our final business expectations?
We do a reasonable allocation of system resources through Capacity Planning and ensure that the results of Capacity Planning are in line with business expectations through Full-link Pressure Test.
From this topic you can learn:
- Capacity Planning methods and how does Alibaba do Capacity Planning exactly
- Technical implements of Full-link Pressure Test which is the most important weapon for site reliability in Alibaba
Jun Zhang, Alibaba
Jun works in Alibaba as a Senior Technical Specialist of High Availability Architecture. He focuses on how to improve site availability, leading to the construction of a series of high-availability solutions, such as Capacity Planning, Full-link Pressure Test, Flow Limit, Automatic Degradation, Traffic Scheduling, and so on.
Managing Capacity @ LinkedIn
Anuprita Harkare, LinkedIn
Have you ever struggled with planning and managing your data ingestion platform? And spends countless nights figuring out whether you have enough capacity or over provisioned? Is cost to serve a concern to you? If yes, you are not alone.
Linkedin as a platform serves its contents to millions of unique users. This generates huge volume of data from members profiles, connections, posts and other activities on the platform. These voluminous and fast moving datasets needs to be effortlessly ingested from different data sources and should be made available for analysis with low latency and same level of data quality. In this talk you will learn how we are tackling this at LinkedIn
Anuprita Harkare, LinkedIn
Anuprita works for LinkedIn in the Data Systems team as a Site Reliability Engineer. In her current role, she is responsible for LinkedIn’s ETL infrastructure and takes care of critical data pipelines. Her day job consist of automation and building tools using python and java, apart from that she actively works on Hadoop, Hive, Pig, Gobblin, and a multitude of other big data technologies.
10:50 am–11:20 am
Break with Refreshments
Grand Ballroom Foyer
11:20 am–11:55 am
The Future of SRE, What's New and Next?
Grand Ballroom
Distributed Scheduler Hell
Matthew Campbell, DigitalOcean
At DigitalOcean we went through growing pains of trying out 5 of the top major Docker container schedulers, Mesos, Kubernetes, Docker Swarm, Nomad and we even tried manual scheduling of containers. We will walk through how we chose different schedulers for different applications, and tips and tricks for choosing a scheduler to use. Also we will discuss in detail about the internals of distributed schedulers and why most are written in GO.
This talk will start off teaching people the basics of how process schedulers work in linux. Then we will expand to how schedulers like Nomad/Kubernetes work across entire clusters. We will go into detail about how to build a simple scheduler in Go. DigitalOcean also uses a Go based scheduler for deploying Virtual machines, we will delve into how scheduling different types of resources like containers vs virtual machines work. Also we will glance over what some popular scheduling systems are.
Matthew Campbell, DigitalOcean
Matthew Campbell is a Microservices scalability expert at DigitalOcean where he builds the future of cloud services. He is writing a book called "Microservices in Go". He has spoken at over 20 international conferences, including GothamGO, Hashicorp Conf, JS Conf, GO India, UK GOlan, MicroXchng, and Prometheus Conf. He blogs at http://kanwisher.com. Matthew was a founder of Errplane and http://www.langfight.com. In the past, he's worked at Thomson Reuters, Bloomberg, Gucci, and Cartoon Network.
12:00 pm–2:00 pm
Luncheon
Waterfront Ballroom
2:00 pm–3:25 pm
The Future of SRE, What's New and Next? (continued)
Grand Ballroom
SRE Your gRPC—Building Reliable Distributed Systems (Illustrated with gRPC)
Grainne Sheerin and Gabe Krabbe, Google
Distributed systems have sharp edges, and we have a wealth of experience cutting ourselves on them. We want to share our experience with SREs elsewhere, so they can skip making the same mistakes and join us making exciting new ones instead!
We will share practical suggestions from 14 years of failing gracefully:
- In a distributed service, every component is a frontend to another one down the stack. How can it deal with backend failures so that the service as a whole does not go down?
- In a distributed service, every component is a backend for another one up the stack. How can it be scaled and managed, avoiding overload and under-use?
- In a distributed service, latency is often the biggest uncertainty. How can it be kept predictable?
- In a distributed service, availability, processing, and latency costs contributions are hard to assign. When things (inevitably) go wrong, what components are to blame? When they work, where are the biggest opportunities for improvement?
We will cover best and worst practices, using specific gRPC examples for illustration.
Grainne Sheerin, Google
Grainne is a Site Reliability Engineer for Google Ireland. She's a tech lead responsible for Ad Serving infrastructure and has 5 years of experience in production engineering. She a physicist, earning a doctorate in Nanoscience from Dublin City University. Prior to Google, she masqueraded as a strategic relationship manager for Reuters and a network engineer for HEAnet.
Gabe Krabbe, Google
Gabe Krabbe has been a Site Reliability Engineer at Google for over 12 years. He has worked on, and sometimes against, multiple generations of the Ads management and serving infrastructure. Before joining Google, he worked for various companies as a system administrator. He frequently tells his servers and his children that he doesn't care who started it, because it takes two to fight.
Operationalizing DevOps Teaching
John Contad and Matt Witherow, REA Group
We started with a premise: if we provided knowledge to developers on how to operationalize services, then they would do it.
In this talk, I'll talk about the many ways that REA Group has experimented with teaching. From broadcast 50-person classes (Think: SNS), to one-to-one scheduled mentorship roles (Cronjobs), to pull-down models (SQS) - we explore why teaching and mentorship is important to the future of the web, and how formative it is an experience to the mentors.
These experiments have now formed an internal curriculum - but also an open-source model on teaching DevOps fundamentals to women who want to explore the space (See: DevOps Girls Melbourne)
John Contad, REA Group
John Contad is a Senior Systems Engineer @ REA Group, motorcyclist, passionate teacher of all things DevOps and AWS. Always with whiteboard markers. Prior to this, he was a startup rat in Sunshine Coast, working with companies in Malaysia, Singapore, and the US.
Matt Witherow, REA Group
Matt Witherow is an Engineer for the Global Architecture Team @ REA Group. He grew up on a remote cattle farm in country Australia. In the Melbourne community he's been a tutor, lecturer and now aspiring mentor - helping to empower others.
He cooks a pretty good breakfast too.
3:30 pm–4:00 pm
Break with Refreshments
Grand Ballroom Foyer
4:00 pm–5:00 pm
The Future of SRE, What's New and Next? (continued)
Grand Ballroom
Scaling Reliability at Dropbox: Our Journey towards a Distributed Ownership Model
Sat Kriya Khalsa, Dropbox
Reliability is a concern for every company. Dropbox has grown enormously in its 10 years. Given the size of our user base and the data entrusted to us, we’ve come to understand that reliability cannot be owned by any one person or team. This talk will share how Dropbox infrastructure has scaled the responsibility for reliability and provide a framework in order to apply these learnings to companies of any size!
Sat Kriya Khalsa, Dropbox
Sat Kriya is a TPM at Dropbox focused on all things Reliability. She's obsessed with data-driven problem solving and skiing.