LISA18 Program Grid
View the program in mobile-friendly grid format.
Already Registered for LISA18?
Curious to know who else at LISA does what you do? Conferences are a great way to meet your birds-of-a-feather, and this year we've listed various roles on the LISA18 mobile app website. If you'd like, create a profile with your picture and add yourself to a role. You can choose whether to make your profile and schedule public. Then choose the contact details you would like to share, so people can find you, and you can find them.
Downloads for Registered Attendees
(Sign in to your USENIX account to download these files.)
Monday, October 29, 2018
7:30 am–5:00 pm
On-Site Registration and Badge Pickup
5th Ave Prefunction
7:30 am–8:45 am
Continental Breakfast
Sponsored by Squarespace
Legends Prefunction
9:00 am–10:30 am
Keynote Addresses
The Beginning, Present, and Future of Sysadmins
Tameika Reed, Founder of Women In Linux
The journey of System Administration has changed over time. We have seen the early years of containers (jails), virtualization, and now the rise of containers. Currently plenty of automation tools ease the job of migrating from VMware to the cloud with Terraform, migrating monolithic applications to containers, and the ability to have a disaster recovery plan in the cloud. So, what is next for system administrators? You have probably seen the rise of:
- Infrastructure Engineer
- Automation Engineer
- Site Reliability Engineer
- Blockchain Infrastructure Engineer
- AI and ML Infrastructure Engineer
- Chaos Engineer
- System Architect
Potential up-and-coming areas to explore include:
- Quantum Computing
- Quantum Key Distribution and Cryptography
- Intuition Engineering
In this keynote I will explore the components that are transferrable across all those disciplines, which system administration skills you should keep sharpened, and which news skills you might want to add to your professional toolbox.
10:30 am–11:00 am
Break with Refreshments
Legends Prefunction
11:00 am–12:30 pm
Talks I
Introducing Reliability Toolkit: Easy-to-Use Monitoring and Alerting
Janna Brummel and Robin van Zijll, ING
By definition, SREs are responsible for the reliability of sites, but what if they don’t own any sites themselves? Within ING, the largest bank of the Netherlands, BizDevOps teams are autonomous and responsible for the build and run of their services. In theory, that could make the existence of SRE obsolete, right? How can you improve availability for end customers in an environment of engineers with full service ownership? How to convince without the power of intervention? How to improve without being blameful?
We’ll explain how we, a team of 8 SREs among 1700 DevOps engineers, try to improve stability by focussing on software engineering. We created the Reliability Toolkit to help BizDevOps teams with their reliability challenges in the fields of white box monitoring and alerting while minimizing toil.
This talk will explain:
- Our SRE team purpose and why we think our approach with heavy focus on software engineering works for our organization
- The concept of the Reliability Toolkit and introduction of its components and their set up (Prometheus, Alertmanager, Grafana, NGINX Log Aggregator, SMS and ChatOps functionalities)
- How we provision Reliability Toolkit
- How we convince, onboard and educate BizDevOps teams to use the Reliability Toolkit
To conclude we will end our talk with a demo of our Reliability Toolkit
Janna Brummel, ING
Janna is IT chapter lead for the site reliability engineering squad within the Domestic Bank (Retail) for ING in the Netherlands. Her job is to help other teams within the bank to know more about their services' performance and to be able to respond more efficiently to incidents. Before this, Janna worked as business manager and dev engineer of credit cards and debit cards back end systems.
Robin van Zijll, ING
Robin is a Site Reliability Engineer @ ING and PO of the SRE Team, and has years of experience in being on-call for all services offered to our retail customers.
Incident Management at Netflix Velocity
Dave Hahn, Netflix
Netflix—as a service and a system—goes through an enormous amount of change all the time. Our engineering teams make 1000s of changes a days while our customers stream 100,000,000s hour of entertainment every day. At that velocity, an outage seconds or minutes long has real and noticeable impact to our customers. Stir in some Chaos Engineering and things become even more unpredictable.
The talk begins with a story. Netflix had a healthy relationship with Chaos Monkey—our tool to ensure that instance loss didn’t affect a running service application. We’d had such good luck we extended our plans from just Chaos Monkey to more Monkeys that would do nasty things to our environment. A new entry, Latency Monkey, would help us increase the health between our microservices by injecting latency and errors at our common IPC layer. What we thought was a safe, little experiment went completely off the rails. The centralized SRE team, called CORE, realized that we’d have to think differently about outages and managing them if the company was going to be successful moving forward.
This is the story of how the centralized SRE team at Netflix changed and adapted to help the service and our engineering teams prepare for and handle problems—big and small—when they do occur.
Key Takeaways:
- How Netflix prepares for failures
- Incident Handling at velocity requires special expertise
- Preparation and training of everyone that runs services is key for quick recovery
- You should be spending more time after an incident learning that you do during an incident managing
- Outages being unique is an excellent goal—it takes work to make it happen
Dave Hahn, Netflix
Dave Hahn is a Senior SRE in the Cloud Operations & Reliability Engineering organization at Netflix. He has designed tools and systems used by many teams in the organization to support the Netflix service. He has decades of experiences in systems operations, networks, reliability, cron jobs, cable termination, grep, and not taking himself very seriously.
Talks II
SLO Burn—Reducing Alert Fatigue and Maintenance Cost in Systems of Any Size
Jamie Wilkinson, Google
Based on a true story.
As systems grow, they get more components, and more ways to fail. The alerts of the last system's design can slowly "boil the frog", and all of a sudden the SRE team finds they have no time left to address scaling problems because they're constantly firefighting. Alert fatigue sets in and the team will burn out.
Naturally maintenance work will always increase as the system itself grows. To make alerting sustainable, instead of on cause, only page on symptom, and even then by declaring what the acceptable threshold of symptom is -- also known as the SLO (and it's complement, the error budget).
Even at Google scale, many teams are yet to implement the change in their monitoring to realise SLO based alerts. But systems don't need to be the size of a planet to benefit from these patterns.
Whether you're oncall for 10 machines or 10 datacenters, in this talk a well rested champion of work/life balance will show you how to select service objectives, then to construct robust and low maintenance alerting rules using Prometheus for a live demonstration. We'll also discuss the tooling required to help make such a system retain observability in the absence of noisy caused-based alerts, now they're not telling you exactly which components are failing.
Jamie Wilkinson, Google
Jamie Wilkinson is a Site Reliability Engineer at Google. Contributing author to the "SRE Book", he has presented on contemporary topics at prominent conferences such as linux.conf.au, Monitorama, PuppetConf, Velocity, and SRECon. His interests began in monitoring and automation of small installations, but continues with human factors in automation and systems maintenance on large systems. Despite over 15 years in the industry, he is still trying to automate himself out of a job.
Serverless Data Processing and Machine Learning
Sunil Mallya, AWS
Serverless computing reduces infrastructure complexity, provides fine grained billing and easy scalability. Setting up concurrent data processing infrastructure pipelines that supports many users is a complex task, moreover utilization, cost and performance are hard to tune for these pipelines. Machine Learning (ML) workloads are on the uptick, and likes of Apache Spark aim to provide an end to end data to ML story, but run in to the same complexities previously mentioned. These aren't two disjoint data and ML workflows, but share a lot in common.
In this talk, I will present a serverless data and machine learning pipeline that includes a MapReduce framework built on using Amazon S3 and AWS Lambda. We'll see how it can help alleviate issues like concurrent processing, cost and scaling. I will also showcase how machine learning algorithms like K-Means clustering can be built on top of this framework, exploiting the inherent distributed architecture. We'll then discuss the benefits and challenges of the framework with a focus on production deployments for ML models.
Sunil Mallya, AWS
Sunil is a Sr. AI Solutions Architect focused on Deep Learning in the Machine Learning Lab team at AWS working with customers in various industry verticals. Prior to that, he co-founded the neuroscience and machine learning based image analysis and video thumbnail recommendation company Neon labs. He’s also worked on building large scale low latency systems at Zynga and has an acute passion for Serverless computing. He hold a master’s degree in computer science from Brown University.
Training I
Modern Infrastructure Provisioning and CI/CD with Terraform & Terratest
Duncan Hutty, Oracle Cloud Infrastructure
The days of handcrafted systems, services and infrastructure are over - it's the era of Infrastructure As Code.
We write code for orchestration tools to run configuration management systems to ensure that individual applications systems are easily reproducible. We write test suites to verify the behaviour of those applications.
In the same way, with HashiCorp's Terraform, we can write code to provision backend infrastructure such as instances, networks, storage, databases and ancillary services such as load balancers/proxies, DNS, TLS termination and more. And then reproduce the whole thing in multiple environments on many platforms including AWS, Azure, Oracle Cloud Infrastructure, GCP for testing, development and deployment.
This introduction to Terraform will show how to declaratively build and modify your infrastructure with code that can be versioned, tested and deployed with the rigour of modern release engineering, all the way to Continuous Integration and Deployment.
Duncan Hutty, Oracle Cloud Infrastructure
Duncan Hutty tries to make/find something useful and "Make it Work, Make it Right, Make it Better". Mostly this entails reading and writing code to integrate other people's software and trying to not rant too hard when sharing the vision of code & systems that are reproducible, testable and no more complex than necessary.
Training II
Linux Container Internals 1 & 2
Scott McCarty, Red Hat
Have you ever wondered how Linux Containers work? How they really work, deep down inside? How does sVirt/SELinux, SECCOMP, namespaces, and isolation really work? How does the Docker Daemon work? How does Kubernetes talk to the Docker Daemon? How are container images made?
Well, we will answer these questions and more. If you want a deep technical understanding of containers, this is the lab for you. Join Red Hat engineers as we walk you through the deep, dark internals of the container host and what’s packaged in the container image. These hands on labs will give you the knowledge and confidence it takes to leverage your current Linux technical knowledge and apply it to Containers.
Scott McCarty, Red Hat
At Red Hat, Scott McCarty helps drive the roadmap around container runtimes, tools, and images within the OpenShift Container Platform and Red Hat Enterprise Linux. He also liaises with engineering teams, both at the product and upstream project level, to help drive innovation by using feedback from Red Hat customers and partners as drivers to enhance and tailor container features and capabilities for the real world of enterprise IT.
2:00 pm–3:30 pm
Talks I
Designing for Failure: How to Manage Thousands of Hosts Through Automation
Brandon Bercovich, Uber
At Uber, we run thousands of services on top many thousands of hosts using Apache Mesos with the Apache Aurora framework. This setup ensures that when a host breaks a service will automatically get rescheduled to another host, but what happens to the host? What happens when a host is still running services but is misconfigured or has a hardware fault that can be affecting the performance of the service. How about when you want to upgrade the Kernel or other software across your fleet. At Uber, we created CLM or Cluster Lifecycle Manager which is used to answer these questions in a safe and automated way. In this talk we will go through the architecture we are using to make this possible and how we are ensuring our actions don't impact services.
Brandon Bercovich, Uber
I've been working in the industry for over 18 years as a Systems Administrator, a DBA, and now a SRE. At Uber my team manages our Compute platform through automation.
Familiar Smells I’ve Detected in Your Systems Engineering Organization and How to Fix Them
Dave Mangot
Over the course of my career, I’ve had the opportunity to work with a number of organizations on their operational maturity. After doing “systems archeology” a number of times, starting at new organizations, I began recognizing certain signature “smells” that indicated that there was something that could be improved, and often, had a pretty good idea how those situations came to be.
Things like the volume of pager alerts can be indicators of poor signal to noise ratios, or overworked infrastructure, or broken architectures. Things like elaborate change control can be signs of inadequate testing, or lack of automation (as if a review by people unfamiliar with the changes makes it safer). Recovery mechanisms that are never tested are never going to actually work in the case that they are needed except in the most trivial of cases.
There are many such examples with single points of failure, competing change mechanisms, scaling challenges, outsourcing of manual automation (not a typo), badly scoped runbooks, immature monitoring, multi-generational monitoring systems, and more, that are signs that we can do better.
In this talk, we’ll talk about some fun that was had over the years, maturing different infrastructures, learning from failure and success, and how we can take lessons from “mistakes were made” scenarios to increase our performance, lower our MTTR, and help those in the systems engineering organization love their job.
Dave Mangot, N/A
Dave Mangot is the author of Mastering DevOps from Packt Publishing. He’s currently the head of Site Reliability Engineering (SRE) for the SolarWinds Cloud companies and an accomplished systems engineer with over 20 years' experience. He has held positions in various organizations, from small startups to multinational corporations such as Cable & Wireless and Salesforce, from systems administrator to architect. He has led transformations at multiple companies in operational maturity and in a deeper adherence to DevOps thinking. He enjoys time spent as a mentor, speaker, and student to so many talented members of the community.
Code-Yellow: Helping Operations Top-Heavy Teams the Smart Way
Michael Kehoe and Todd Palino, LinkedIn
All engineering teams run into trouble from time to time. Alert fatigue, caused by technical debt or a failure to plan for growth, can quickly burn out SREs, overloading both development and operations with reactive work. Layer in the potential for communication problems between teams, and we can find ourselves in a place so troublesome we cannot easily see a path out. At times like this, our natural instinct as reliability engineers is to double down and fight through the issues. Often, however, we need to step back, assess the situation, and ask for help to put the team back on the road to success.
We will look at the process for Code Yellow, the term we use for this process of “righting the ship”, and discuss how to identify teams that are struggling. Through a look at three separate experiences, we will examine some of the root causes, what steps were taken, and how the engineering organization as a whole supports the process.
Michael Kehoe, LinkedIn
Michael Kehoe is a Staff SRE at LinkedIn who works on building scalable monitoring infrastructure, reliability principles and incident management. Michael previously interned at NASA Ames on their PhoneSat project. Michael's key interests lie in network engineering and automation.
Todd Palino, LinkedIn
Todd Palino is a Senior Staff Site Reliability Engineer at LinkedIn, tasked with keeping some of the largest Kafka and Zookeeper deployments fed and watered. Previously, he was a Systems Engineer at Verisign, developing service management automation for DNS, networking, hardware and operating systems. In his spare time, Todd is the developer of the open source project Burrow, a Kafka consumer monitoring tool, and can be found sharing his experience on both SRE and Apache Kafka at industry conferences and tech talks. He is also the co-author of Kafka: The Definitive Guide, available from O’Reilly Media.
Talks II
How Bad Is Your Toil?: Measuring the Human Impact of Process
Kurt Andersen, LinkedIn
Across a distributed, mostly embedded SRE organization we knew that some teams were being hit harder by toil and especially on-call response load, but how do you quantify that in order to identify contributing factors and improve quality of life for everyone? I'll talk about a series of strategies that we have used separately and in combination over the course of several years to surface some indications of the human load impact. While some of the systems use internal terminology, the processes can be used by any organization that is interested and willing to see what the data might reveal—they just need any local jargon translated to your local dialect.
The talk will include the specific surveys and other tooling that we have experimented with over the last three years as well as what we have learned along the way.
Kurt Andersen, LinkedIn
Kurt Andersen is one of the co-chairs for SREcon18Americas and has been active in the anti-abuse community for over 15 years. He is currently the senior IC for the Product SRE (site reliability engineering) team at LinkedIn. He also works as one of the Program Committee Chairs for the Messaging, Malware, and Mobile Anti-Abuse Working Group (M3AAWG.org). He has spoken at M3AAWG, Velocity, SREcon, and SANOG on various aspects of reliability, authentication, and security.
How to Be Your Security Team's Best Friend
Emily Cole, Agari Data, Inc.
Operations teams and Security teams don't always get along, and Security teams have a reputation for being grumpy and saying "no" a lot. However, there are a few simple things you can do (or may already be doing) that will make your job easier, and make your Security team less grumpy. In this session, you will get examples of security best practices as well as tips from someone who has worked in both Ops and Security.
Emily Cole, Agari Data, Inc.
Emily is currently a Senior Security Engineer for Agari Data, Inc., and spends a lot of time thinking about the ways that DevOps and Security intersect. Emily has performed critical organizational roles of security research, incident response, product security, devops engineer, system administrator, tech support, security expert, operations specialist, and project lead. Emily specializes in Unix security and is a co-author of a book on Solaris Security for the SANS Institute, and serves as a Mentor for SANS' CyberTalent Immersion Academy for Women. She currently holds GSEC, GCED, GPPA, GCIH, and ITIL certifications, and is a Certified Scrum Master.
Isolation without Containers
Tyler McMullen, Fastly
Software Fault Isolation, or SFI, is a way of preventing errors or unexpected behavior in one program from affecting others. Sandboxes, processes, containers, and VMs are all forms of SFI. SFI is a deeply important part of not only operating systems, but also browsers, and even server software.
The ways in which SFI can be implemented vary widely. Operating systems take advantage of hardware capabilities, like the MMU (Memory Management Unit). Others, like processes and containers, use facilities provided by the operating system kernel to provide isolation. Some types of sandboxing even use a combination of the compiler and runtime libraries in order to provide safety.
Each of the methods of implementing SFI have advantages and disadvantages, but we don't often think of them as different options toward a similar end goal. When we consider the growing prevalence of things like edge computing and "Internet of Things", our common patterns start to falter.
In this talk, we'll focus on how sandboxing compilers work. There are important benefits, but also major pitfalls and challenges to making it both safe and fast. We'll talk about machine code generation and optimization, trap handling, memory sandboxing, and how it all integrates into an existing system. This is all based on a real compiler and sandbox, currently in development, that is designed to run many thousands of sandboxes concurrently in server applications.
Tyler McMullen, Fastly
Tyler McMullen is CTO at Fastly, where he’s responsible for the system architecture and leads the company’s technology vision. As part of the founding team, Tyler built the first versions of Fastly’s Instant Purging system, API, and Real-time Analytics. Before Fastly, Tyler worked on text analysis and recommendations at Scribd. A self-described technology curmudgeon, he has experience in everything from web design to kernel development, and loathes all of it. Especially distributed systems.
Training I
Setting Up CI/CD Pipelines with GitLab CI and Jenkins
Aleksey Tsalolikhin, Vertical Sysadmin, Inc.
Attendees will learn how to implement CI/CD pipelines in two popular tools, Jenkins and GitLab CI.
Aleksey Tsalolikhin, Vertical Sysadmin, Inc.
Aleksey Tsalolikhin is a practitioner in operations of information systems. Aleksey strives to improve the lives of fellow practitioners through effective training in excellent technologies. Aleksey is the principal at Vertical Sysadmin, which provides on-site training and consulting in IT Operations and DevOps.
Training II
Containerize Your App Using Docker
Naga Sowjanya Mudunuri, Heroku
Have you tried replicating consistent environments in development, staging & production? Have you tried to keep the operational cost low by efficiently using the computing resources? How about scaling your application when you get popular? Containerization is one of the core tenets for operating at scale. Containers enable predictable environments for your deployments in dev/staging/production and adds to developer efficiency. Containerization lets you run your application on any compute resource whether it is AWS, GCE or a Datacenter. In this tutorial you are going to learn about the basics of Docker, a popular containerization technology, and get hands-on experience containerizing an application. You will learn how to Dockerize an application written in Go/Ruby/Node. You do not need to know all these languages. Knowing one of it helps. If you have an existing app in any of these languages you can containerize it in the class or a sample app will be provided for you to download.
Naga Sowjanya Mudunuri, Heroku
Sowjanya is a Senior Software Engineer at Heroku. She found her love and passion for Software Design and Engineering only 5 years ago. Prior to that she worked as a hardware engineer building SystemVerilog Components at Intel & Xilinx. For the past 5 years she has worked as a Full-Stack & DevOps Engineer building applications and services using Ruby, Rails, Go, Docker, Terraform. Today she is an Engineer on the Runtime team at Heroku orchestrating dynos using Go & Ruby. In her free time she is a volunteer mentor at Women Who Code San Francisco.
3:30 pm–4:00 pm
Break with Refreshments at the Expo
Sponsored by Packet
Legends Ballroom EFG and Music Rows Foyer
4:00 pm–5:30 pm
Talks I
Pass the Torch Without Dropping the Ball: Lessons in Community Management
Rich Bowen, Red Hat
A replacement plan/document is a great community resource, even when you’re not being replaced. A year ago, as the role of OpenStack community manager at Red Hat was moving from one person to another, we started thinking about what needs to be in place to effectively transition a role. More generally, we started thinking about planning, and documenting, for your eventual replacement. We’ll talk about what worked, what didn’t, and what had unexpected benefits for the larger community.
Containers and Security on Planet X
Michael Jennings, Los Alamos National Laboratory
Containers and the modern container ecosystem have thoroughly revolutionized the development, deployment, and delivery of web applications and microservices, but in many ways, the world of high-performance computing (HPC)—traditionally the innovators and trailblazers of the scalable system space—has found itself playing catch-up instead of leading the pack as it once did. No longer the target audience for the community of solution developers, the architects and innovators in HPC initially sought ways to tweak, tailor, configure, or even fork the industry-standard container platform (Docker) to address the unique needs of HPC and the scientific community. Faced with the reality that they were on their own to implement something themselves, multiple solutions emerged which dealt with many of the same challenges in fundamentally different ways and with different implications for performance, usability, reliability, scalability...and of course, security!
For system administrators in the unique world of HPC—whether in academia, government, or the technical computing industry—the current view of the container landscape can be incredibly confusing. We'll sift through the hype, dispel some misconceptions, explore the unique use case of containers in HPC, and examine the modern runtime options for high-performance containerized applications in terms of their primary differentiators—their approaches to image management, runtime execution, and security. We'll also look at future developments going on both inside and outside the HPC container arena and discuss how the landscape will be changing over the next few years. Finally we will provide some simple, straight-forward guidance to help sysadmins new to this space understand your options and make the best possible choice for addressing your own HPC container needs to ensure your users' success and your own peace-of-mind!
Michael Jennings, Los Alamos National Laboratory
Michael Jennings has been a UNIX/Linux Systems Administrator and a C/Perl developer for over 20 years and has been author of or contributor to numerous open source software projects including Eterm, Mezzanine, RPM, Warewulf, and TORQUE. Additionally, he co-founded the Caos Foundation, creators of CentOS, and has been lead developer on 3 separate Linux distributions. He currently works as a Scientist at Los Alamos National Laboratory and is the primary author/maintainer for the LBNL Node Health Check (NHC) project. He is also the Vice President of HPCXXL, the extreme-scale HPC users group.
Getting the Most Out of Your Mesos
David Morrison, Yelp
Apache Mesos is a powerful tool for running workloads across a distributed cluster of machines. It has powered Yelp’s production infrastructure since 2014 and runs an increasing variety of workloads, from traditional stateless services to batch and machine learning workloads. In this talk we describe how we use data and analytics to push Mesos to the limits, while providing a better experience for our users and our developers.
David Morrison, Yelp
David R. Morrison is a software engineer working in scheduling and optimization on the Distributed Systems team at Yelp, where he has developed autoscaling code for Yelp’s most expensive compute clusters. Previously, David worked in research and development at Inverse Limit, where he received federal funding from DARPA and Google’s ATAP program. David received his PhD in computer science from the University of Illinois, Urbana-Champaign under the supervision of Dr. Sheldon Jacobson. David has spoken at the INFORMS Business Analytics conference in 2017, at AWS re:Invent 2016, and given multiple presentations at the INFORMS Annual Meetings and other venues.
Talks II
SRE (and DevOps) at a Startup
Craig Sebenik, Split
The Google "SRE book" gives a great explanation of what SRE does. That model works great for LinkedIn or Facebook. But, what happens when you loose some of those economies of scale? How do you implement those paradigms when the company (or team within a larger corporation) is only a couple of dozen people?
In this talk, I will discuss 2 different approaches; a centralized team that supports development versus a more distributed team that include developers themselves.
I have witnessed both approaches at different startups. I will finish up with the lessons learned and the pitfalls implementing either approach at another company.
Craig Sebenik, Split
Currently, an Infrastructure Engineer at Split. I have been at large companies and small ones. I have seen amazing growth (at LinkedIn and NetApp) and a couple completely fall apart. I am passionate about how SRE (and DevOps) can change how online software is developed and managed.
Cross Atlantic: Scaling Instagram Infrastructure from US to Europe
Sherry Xiao, Facebook
Deploying a service across multiple continents is difficult, especially when you have a stateful service. Facebook now has multiple datacenters across US and Europe, while Instagram infrastructure still remains only in the US. How can we scale Instagram across the ocean? What are the problems we need to solve?
One of the databases Instagram uses heavily is Cassandra. Running Cassandra with too many copies increases the complexity of maintaining this database, not to mention that having the quorum requests travel across the ocean is just... slow. So, we partitioned our dataset! The idea is to have a Cassandra European partition and a US partition, and send the users to their nearest partition.
When we started to put together the plan for deploying Instagram in current European datacenters, we encountered several problems. How do we make sure users have all the data they need stored in the same partition? When one of the European datacenters fails, how do we failover and where do we send that traffic?
This talk will cover:
- The challenges we had during the infrastructure design and disaster recovery planning
- How we use social hash to make sure all the data belonging to one user stays in the same partition as much as possible, and how it helps improving cache miss rate
- The failover plan when one European datacenter fails, including how we shift the traffic around
Sherry Xiao, Facebook
I'm a Production Engineer working on scaling Instagram infrastructure. My team supports all engineering teams at Instagram, and gets involved with a large number of areas like rapidly scaling infrastructure, capacity planning, designing and practicing disaster recovery plans for Instagram.
Delete This: Decommissioning Servers at Scale
Anirudh Ra, Facebook
Facebook's datacenter footprint has increased significantly; we now have 12 locations across USA and Europe. As these new locations come online, we have had to plan for the end-of-life process: decommissioning server racks and replacing them in a timely and streamlined manner. Until recently, decommissioning a cluster entailed a lot of manual work: service oncalls were ticketed by project managers and then migrated off the old hardware onto new hardware, after which hardware was unplugged and rolled out.
We realized the need for automation that covered all of this. We started with a framework that allows for automated service migration, given a list of retiring machines and a list of replacements. We moved on to an automated process that looks at a decommission schedule and kicks off jobs to drain server clusters on time so that old racks can be taken away and new racks rolled into their place.
With this automated process in place, we have learned lessons and figured out how to minimize the time that old servers spend without services running on them before being rolled out of the datacenter. We are also exploring ways to reuse parts of this framework in other ways to increase efficiency.
Anirudh Ra, Facebook
Customer support tech turned production engineer, Anirudh tries to remember that his job is still about helping people succeed. He builds frameworks for service owners to run their services with minimal bother and enjoys baking bread and reading fiction and histories and fictional histories.
Training I
What I Wish I Knew before Going On-call
Chie Shu, Dorothy Jung, and Wenting Wang, Yelp
Being a software engineer means owning a production system—you have many users and your company's revenue relying on your products. Firefighting a broken system is a time-sensitive and stressful part of life as an engineer. New engineers entering an on-call rotation may be overwhelmed by this responsibility. How should we act in an emergency? How can we make a system emergency-friendly? In this workshop, we will share how new and future on-call engineers can be successful by guiding participants through exercises to triage real-life engineering emergency scenarios. We will also cover how on-call engineers can share learnings within an organization to prevent future fires.
Chie Shu, Yelp
Chie Shu is a backend Software Engineer at Yelp. She has worked on improving the revenue-critical Ads data pipeline to be more resilient to system failures and designed heuristics used by executives and Product Managers to assess the financial impact. She is a leader for Yelp’s Awesome Women in Engineering support group, and has organized events to foster an inclusive community for incoming women engineers and allies. Chie holds a Bachelor’s degree in Computational Biology from Cornell University.
Dorothy Jung, Software Engineer, Yelp
Dorothy Jung is a Software Engineer with multiple years of on-call experience. At Yelp she has served as a “pushmaster”, managing and monitoring company-wide deployments to production; and as a release engineering deputy, helping to set up CI/CD pipelines within the Ads organization. She was previously at DreamWorks Animation R&D, where she worked on upgrading the studio’s build management tools. Dorothy holds a bachelor’s degree in Computer Science and French from the University of California, Berkeley.
Wenting Wang, Software Engineer, Yelp
Wenting Wang is a Software Engineer with three years of industry experience. She has been on-call for different teams at Yelp: on the BizApp backend team, where she worked closely with mobile developers and monitored mobile user traffic; and on the Ads team, where she currently develops and maintains revenue-critical real-time processing systems. Wenting received her master’s degree in Computer Science from Shanghai Jiao Tong University and was previously a doctoral candidate in Computer Science focusing on distributed systems at the University of Illinois at Urbana-Champaign.
Training II
30 Years of Making Lives Easier: Perl for System Administrators
Ruth Holloway, cPanel, Inc.
A lot of system administration tasks can be made easier with some “glue” code—and Perl is an excellent glue language, useful for many things. As its creator Larry Wall says, “when is the last time you used duct tape on a duct?”
In this session, attendees will learn a little about Perl, with a focus on its use in system administration. We’ll explore Perl’s mature infrastructure, step through the basics of creating Perl programs, and discuss use cases for system administration tools in Perl.
Ruth Holloway, cPanel, Inc.
Ruth Holloway got into system administration early in her 30-year career, and has spent at least part of her time ever since working to make it easier for herself, and for the people who follow her in the admin's seat. She has been a Perl developer for over 15 years, and currently works for cPanel in Houston, TX.
Ruth is a wife, mother, grandmother, writer, activist, technologist, and mommy to the cutest dog you'll ever meet.
6:00 pm–7:00 pm
Expo Mixer
Sponsored by Apple
Legends Ballroom EFG
Join us at the LISA18 Expo for refreshments, and take the opportunity to learn about the latest products and technologies. Don’t forget to get your Expo passport stamped!
7:00 pm–11:00 pm
Birds-of-a-Feather Sessions (BoFs)
View the full schedule of BoFs on the LISA18 BoFs page.
Tuesday, October 30, 2018
8:00 am–5:00 pm
On-Site Registration and Badge Pickup
5th Ave Prefunction
8:00 am–9:00 am
Continental Breakfast
Sponsored by Microsoft Azure
Legends Prefunction
9:00 am–10:30 am
Keynote Addresses
Anatomy of a Crime: Secure DevOps or Darknet Early Breach Detection
Dr. Sarah Lewis Cortes, Salesforce
Criminals lurk on the darknet, where child abuse forums have exploded in recent years, as well as other crime: intellectual property theft, narcotics, and carding, to name a few. Roman Seleznev convicted in 2017 and now serving 27 years in Seattle, brought darknet carding data breaches to a new level. What role does the darknet play in data breaches? What can operations and infrastructure developers consider when building their systems to help prevent breaches? We review a real life darknet breach at a major retailer, who's 2017 settlement for the first time requires darknet monitoring. We demonstrate how assumptions that hackers could never reach the RAM of Point of Sale (PoS) devices facilitated some of the largest breaches in history. Credit card information stored briefly in the plain text memory buffers in the PoS systems is just one risk that may not occur in development, or may be dismissed as far-fetched. We present considerations for operations and infrastructure developers when building systems.
Sarah Cortes, Salesforce
Dr. Sarah Lewis Cortes earned her undergraduate degree at Harvard University, studied Forensic Sciences at Boston University Medical School, and holds a PhD in Computer Science, Cybersecurity from Northeastern University, specializing in the Darknet, Privacy and Privacy Law as well as IT Security, topics on which she has published extensively. She conducts training and research with the FBI, the Alameda County Sheriff’s Office Digital Forensics Crime Lab, and other LEAs. Prior to undertaking her PhD, Sarah was SVP for Security, IT Audit and Disaster Recovery at Putnam Investments, an investment management firm with over $1 trillion in assets under management.
Do the Right Thing: Building Software in an Age of Social Responsibility
Jeffrey Snover, Microsoft
In an increasingly software-driven world, our role is more critical than ever. We are literally forging the fabric of the future and we need to stop and ask ourselves what kind of world do we want to live in? Our code impacts real people in real ways. Ethical concerns, social responsibility, and inclusive design are important to the work we do, and addressing them well takes time and resources. Join Jeffrey Snover, Technical Fellow at Microsoft, to learn how modernizing development processes can free you up to focus on building great software, minimizing technical debt, protecting sensitive data, and empowering your users.
Jeffrey Snover, Microsoft
Jeffrey Snover is a Technical Fellow at Microsoft and the Chief Architect for Azure Storage and Edge Cloud Group where he focuses on Azure Stack. Snover is the inventor of Windows PowerShell, an object-based distributed automation engine, scripting language, and command line shell. He was also the Chief Architect for Windows Server. Snover joined Microsoft in 1999 as divisional architect for the Management and Services Division, providing technical direction across Microsoft's management technologies and products.
Training I
Developing Applications with Kubernetes
Sean Dague and Will Plusnick, IBM
Microservices revolutionized the way we look at app development and is now one of the most popular programming architectures. Now, Docker alongside Kubernetes is changing the way teams look at deployments of these microservices. Kubernetes provides powerful production-grade orchestration for your "Dockerized" microservices.
In this workshop, you'll get an overview of Kubernetes, and what it provides for application development. You'll then go through the process of building and deploying a microservice application on Kubernetes.
This is a hands-on-keyboard lab, everyone should come with a laptop and a desire to learn. Attendees can use minikube locally, or cloud accounts will be provided. We'll cover:
- Kubernetes basics
- Building Container Images
- Deploying the application with Kubernetes
- Upgrading and scaling the application with Kubernetes
- Debugging your application in Kubernetes
Sean Dague, IBM
Sean Dague has been an Open Source developer for most of his professional life. He's worked on numerous Open Source projects over the years including SystemImager, OpenHPI, Xen, OpenSim, NFS Ganesha, and OpenStack.
In 2003 he created HV Open, which runs a monthly lecture series on open technology topics. It serves the Hudson Valley of New York state.
He is currently a Developer Advocate at IBM, focusing on the areas of Cloud, Containers, Infrastructure, and Open Source.
Training II
Running Excellent Retrospectives: What Happened?
Courtney Eckhardt, Heroku, a Salesforce company
Your site’s back up, you’re back in business. Do you have a way to make sure that problem doesn’t happen again? And if you do, do you like how it works?
Heroku uses a blameless retrospective process to understand and learn from our operational incidents. This tutorial will share the process we use and give you a chance to practice analyzing operational problems using the internal and external communications of a real Heroku operational incident. Along the way, we’ll discuss how Heroku developed this process, what issues we were trying to solve, and how we’re still iterating on it.
Courtney Eckhardt, Heroku, a Salesforce company
Courtney Eckhardt first got into retrospectives when she signed up for comp.risks as an undergrad (and since then, not as much has changed as we’d like to think). Her perspectives on engineering process improvement are strongly informed by the work of Kathy Sierra and Don Norman (among others).
10:30 am–11:00 am
Break with Refreshments at the Expo
Legends Ballroom EFG and Music Rows Foyer
11:00 am–12:30 pm
Talks I
The History of Logging @ Facebook (Abridged)
KC Braunschweig, Facebook
Ten years ago Facebook released the first version of Scribe which has formed the basis for our logging infrastructure ever since. Back then logging was mostly about aggregating files full of text but since then “logs” have evolved to include abstract streams of events and blurred the line with messaging and pub/sub systems. Scribe continues to aggregate massive amounts of log data but now also forms the basis for many real time stream processing applications at Facebook. Having a common transport layer allows us to apply new stream processing and data science tools to business or operational data with equal ease.
I'll look back at some of the early design choices that have allowed Scribe to evolve with our needs. This evolution has required us to replace major components and scale up the infrastructure by several orders of magnitude. I'll walk through the steps in this evolution with particular emphasis on our most recent upgrade: replacing the HDFS-based persistence layer with LogDevice and how we accomplished that upgrade with little user impact.
KC Braunschweig, Facebook
KC Braunschweig has spent the last 6 years as a production engineer working on Facebook infrastructure. He's currently working on distributed coordination services built on Apache Zookeeper. He's previously worked in areas including logging, stream processing, search infrastructure and configuration management.
Container Security
Daniel Walsh, Red Hat
This talk examines all of the technologies used to keep containers separate. We will examine concepts of what needs to be considered when containerization is happening. Whether to run your apps in containers, VMs or on bare metal. It will examine technologies like Linux Capabilities, SECCOMP, SELinux, Device Cgroups, Read Only.
I will talk about new container technologies like KATA Containers which use KVM for separation and advances in the User Namespace.
Daniel Walsh, Red Hat
Consulting engineer at Red Hat leading Container Engineering team. Including CRI-O, Buildah, -podman, containers/storage and containers/image. Docker/Moby project contributor. Led the SELinux project, concentrating on the application space and policy development.
Talks II
How Houghton Mifflin Harcourt Went from Months to Minutes with Infrastructure Delivery
Robert Allen
By leveraging Apache Mesos and Apache Aurora they were able to deliver those applications while cutting annual costs by six figures and begin the process of empowering engineers to focus on innovative products and not the infrastructure. Embracing common open source practices, agile methods and open source technologies HMH began a journey of transforming how innovation occurs in its engineering organizations. This conversation focuses attention on the journey and provides candid examples of the mistakes and successes along the way as well as the practical insights of changing the hearts and minds of engineers, product owners and management. Ultimately leading to a more productive, creative, agile and happier organization.
The odyssey from "Lift and Shift" operations to a Apache Mesos centric self-service platform that establishes the technology foundations which enables engineers to go from idea to delivery in a matter of minutes has lead to a remarkable evolution in ideas and customer satisfaction. This technology exists for everyone, there is no "secret sauce". The challenges that are often less known is the "people problem" which merits understanding, preparation and sharing at all levels of an organization.
Robert Allen[node:field-speakers-institution]
An automotive technician, cabinet maker, cellist, bassist, cook, software engineer, systems admin and, most recently, Director of Engineering for platform strategy at the global education publisher Houghton Mifflin Harcourt. Robert leads the team at HMHCo known as the Bedrock Technical Services group. His team is responsible for designing, developing and supporting the systems that serve as the current and future of educational services, a unique platform that allows engineers to focus on the things that matter most to students and educators worldwide.
Training I
Up and Running Serverless: A Practical Development Workflow for AWS Serverless
Tom McLaughlin, ServerlessOps
This workshop will walk participants through a practical development workflow for AWS serverless systems. It introduces the participants to both serverless architecture and the tools to get up and running. This workshop starts by introducing the audience to a basic monolithic 3-tier serverless web application to familiarize them with the basics of serverless architecture. It proceeds to decompose the application into microservices and introduce them to how Serverless Framework is used to create serverless services. Finally, the participant will be tasked with adding a rudimentary feature to the application as an application of what was learned in the first two modules.
Someone should leave the workshop with an introductory understanding of serverless and the knowledge to continue exploring on their own.
Tom McLaughlin, ServerlessOps
Tom is the founder of ServerlessOps and an experienced operations engineer. He started ServerlessOps after he asked the question, what would he do if servers went away? At a loss for an answer and interested in the future of his profession, he decided to pursue the answer. Tom is actively engaged in promoting serverless infrastructure and engaging with the community to learn more about their thoughts, wants, and concerns are around the topic.
Training II
Running Excellent Retrospectives: Talking for Humans
Courtney Eckhardt, Heroku, a Salesforce company
How many awful meetings have you been to in your life, where people are talking forever and saying nothing, or where they're saying things that make everyone feel bad? Have you been in retrospectives like that? (Did it make you never want to attend a retrospective again?)
Let's do better! Come learn practical techniques for facilitating pleasant, productive, welcoming meetings. We will talk about the structure of welcoming language and discuss when it's necessary to interrupt someone. We'll examine what it means for language to include blame and how to reframe blaming conversations. We'll practice the mental work of understanding things that seem contrafactual but are actually just confusing. When you leave, you'll be ready to make any meeting or retrospective you're in more comfortable and effective, as a leader or an attendee.
Courtney Eckhardt, Heroku, a Salesforce company
Courtney Eckhardt first got into retrospectives when she signed up for comp.risks as an undergrad (and since then, not as much has changed as we’d like to think). Her perspectives on engineering process improvement are strongly informed by the work of Kathy Sierra and Don Norman (among others).
2:00 pm–3:30 pm
Talks I
Kafka on Kubernetes: From Evaluation to Production at Intuit
Shrinand Javadekar, Intuit, Inc.
Kubernetes is fast becoming the platform of choice for running distributed, containerized applications in the cloud. It has great features for availability, scalability, monitoring, ease of deployment, a rich set of tools and an extremely fast-growing ecosystem that is making it ever more useful.
However, running stateful applications such as Kafka on Kubernetes is not a common practice today. At Intuit, we took an experimentation and data driven approach for evaluating Kafka on Kubernetes in AWS. In this talk, we provide details of our requirements, the experimental setup, and details of the tests. Tests included basic functional tests for produce/consume of messages, network isolation tests, cross-region produce/consume, performance and scale tests. We’ll also talk about the various tools used during this testing. We focus on the problems we ran into and how we addressed them.
This talk will demonstrate a Kubernetes cluster running Kafka along with the details of how each component is configured. Specifically, the kafka and zookeeper statefulsets, the configmaps used for storing the server.properties used by all brokers, the service objects for enabling access to the brokers and, last but not least, integration with Splunk and Wavefront for logging and monitoring respectively.
Shrinand Javadekar, Intuit, Inc.
Shrinand Javadekar is a "Kubernetes junkie" in the Platform team at Intuit. His primary focus is to make simple, scalable and reliable Kubernetes clusters, the de-facto platform for developing, deploying and running apps. In the past he has been part of large scale file system and virtualization projects at EMC and VMWare. However, his most fun stints have been working on cloud-native platforms and services at startups such as Maginatics and Applatix!
Overcoming the Challenges of Centralizing Container and Kubernetes Operations
Oleg Chunikhin, Kublr
Containers and Kubernetes are becoming the de-facto standard for software distribution, management, and operations. Development teams see and realize the power of these technologies to improve efficiencies, save time, and enable focus on the unique business requirements of each project, but the process of deploying, running, managing, and handling upgrades for Kubernetes is time-consuming and requires significant in-house expertise. InfoSec, infrastructure, and software operations teams, for example, face myriad challenges when managing a new set of tools and technologies as they integrate them into an existing enterprise infrastructure.
In this session, Oleg Chunikhin, CTO at Kublr, will provide an overview of the general architecture of a centralized Kubernetes operations layer based on open-source components such as Prometheus, Graphana, ELK Stack, KeyCloak, etc. From there, he will outline the unique challenges to consider for each of the following critical elements before embarking on a containerization and Kubernetes strategy:
- Centralized monitoring and log collection; audit
- Identity and access management
- Backup and disaster recovery
- Infrastructure management
By looking at each of these areas, attendees will leave with an understanding of how to ensure successful container and Kubernetes ‘Day 2’ operations.
Oleg Chunikhin, Kublr
Oleg Chunikhin is CTO of Kublr. With nearly 20 years of software architecture and development experience, Oleg is responsible for defining the company’s technology strategy and innovative standards, while forging growth plans. To this end, Oleg has championed the standardization of DevOps in all the company does. He is also committed to driving the adoption of automation and AI applications in innovation-driven industries such as genetics, 3D printing, robotics, neural interfaces, mesh technologies, etc. A popular speaker on all things Kubernetes, Oleg’s speaking credentials include Interop ITX (Las Vegas), Cloud Computing Expo (NYC), and the 21st Cloud Expo (Silicon Valley).
Automating Multi-service Deployments on Kubernetes
Michael Hrivnak, Red Hat, Inc.
Many applications consist of multiple services, such as a database, API service, and frontend. Provisioning them as a single application in Kubernetes can be a challenge, especially if one or more services runs outside your cluster.
The Service Catalog provides a new way to publish, provision, and manage applications on Kubernetes through the use of Service Brokers. The Automation Broker allows users to leverage Ansible Automation to define and orchestrate simple to complex multi-service deployments.
In this session you will learn:
- How to provision a multi-service application on Kubernetes using the Automation Broker.
- How to include external service provisioning in your application’s deployment.
- How to package Ansible Playbooks into a single meta-container for orchestrating the deployment of your application.
- How to publish your own applications in the Kubernetes Service Catalog.
Michael Hrivnak, Red Hat, Inc.
Michael Hrivnak is a Principal Software Engineer at Red Hat. During his time as Team Lead for the Pulp project, he became involved in solving real-world container orchestration problems. He now works in that domain as part of the Automation Broker project. With experience in both software and systems engineering, he is excited to be writing software for systems engineers. Michael is passionate about open source software, live music, general aviation, and reducing energy consumption.
Talks II
Debugging Linux Issues with eBPF
Ivan Babrou, Cloudflare
This is a technical dive into how we used eBPF to solve real-world issues uncovered during an innocent OS upgrade. We'll see how we debugged 10x CPU increase in Kafka after Debian upgrade and what lessons we learned. We'll get from high-level effects like increased CPU to flamegraphs showing us where the problem lies to tracing timers and functions calls in the Linux kernel.
The focus is on tools what operational engineers can use to debug performance issues in production. This particular issue happened at Cloudflare on a Kafka cluster doing 100Gbps of ingress and many multiple of that egress.
This is also an introductory talk to a training on ebpf_exporter by Alexander Huyhn.
Ivan Babrou, Cloudflare
Ivan is a Performance Engineer at Cloudflare. He spends his days finding performance bottlenecks, fixing them and making sure large chunk of internet runs as fast and as efficiently as possible.
Debugging and Optimizing User Experience with Deep Visual Analytics
Sarvesh Nagpal and Toby Walker, Microsoft
Go beyond standard analytics and dive into a world where you can see your product from your user’s perspective. In the wild west of devices and browsers, what you think you are serving and what your users see can be completely different—devices, plugin, proxies and networks can all change and degrade the quality of your experience. And these problems are expensive. In one case at Microsoft a user experience issue caused millions of dollars in lost revenue.
In this talk we'll share how to close the gap between what the user experience you think you are providing and what your actual users experience. By combining rich client and server telemetry, integrating it with big data visualization and ML-based detection, we show how you detect user experience issues and understand their impact on your product. We’ll share case studies of how we apply this to Microsoft’s large scale product in solving real-world problems and identifying opportunities that drive business growth.
Come join us to learn more and see how this can help solve your own product issues.
Sarvesh Nagpal, Microsoft
Sarvesh leads the performance team at Bing, Microsoft and is passionate about solving complex data problems with rich visualizations. Sarvesh holds a M.S. in Computer Science from Columbia University, NY.
What Breaks Our Systems: A Taxonomy of Black Swans
Laura Nolan
Black swan events: unforeseen, unanticipated, and catastrophic issues. These are the incidents that take our systems down, hard, and keep them down for a long time.
By definition, you cannot predict true black swans. But black swans often fall into categories that we’ve seen before.
This talk examines those categories, and how we can harden our systems against these categories of events, which include unforeseen hard capacity limits, cascading failures, queries of death, hidden system dependencies, various forms of deadlock, and more.
Laura Nolan, N/A
Laura Nolan’s background is in Site Reliability Engineering, software engineering, distributed systems and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly ‘Site Reliability Engineering’ book, and is co-chair of SREcon18 Europe/Middle East/Africa. Laura is currently enjoying a well-earned sabbatical (and tinkering with some of her own projects) after 15 years in industry, most recently at Google.
Training I
Deploying a Load-Balanced Python Web Application with Amazon Web Services
Nicholas Hunt-Walker, Starbucks Corporation
For many developers, the experience of deploying an application consists of finding a server to host the app, copying it over, and serving using some system like Nginx or Apache.
And that's ok!
It'll work for small applications that aren't meant to handle large loads.
However, if your product is growing or you're adding a new service to an existing, mature application, chances are you're going to need to learn how to scale your web application.
You can do it manually, of course, but Amazon Web Services provides great tools for managing and scaling your applications automatically.
In this tutorial we'll walk through how to set up those services in the AWS environment, enabling your web application to maintain reliability whether big or small.
Nicholas Hunt-Walker, Starbucks Corporation
After 5.5 years working toward a Ph.D. in astrophysics, I left the field to move into teaching software development at a Seattle-area coding school. Since that move, I've been able to have a hand in training over 100 new software developers. I now work as an application developer at Starbucks on the Emerging Technologies team, helping foster innovation and produce proofs of concept for Starbucks Technology. I'm fiercely interested in good mentorship both as a mentee and a mentor of others, and am always looking for ways to grow personally and professionally.
Training II
A Hacker’s View of Your Network—Analyzing Your Network with Nmap
Joe Schottman
Nmap is a common open-source network mapping and exploration tool used by attackers to enumerate your servers and services. It also features a Lua-based script engine that can be used to rapidly detect vulnerabilities across your network, helping you stay one step in front of threats. Learn the basics of what your network looks like to attackers and how to use their tool to monitor your perimeter, automate vulnerability scanning, and help keep your systems secure.
Joe Schottman, N/A
Joe is a former web developer and system administrator now working in security but whose job duties still manage to end up being "other duties as required." His professional experience includes online video in higher education, high volume news websites, and financial industries. He is focused on testing and working smarter, increasing collaboration between offensive and defensive staff, and helping people understand the underlying concepts rather than relying on ineffable processes and procedures.
3:30 pm–4:00 pm
Break with Refreshments
Sponsored by Facebook
Legends Prefunction
4:00 pm–5:30 pm
Talks I
MySQL Infrastructure Testing Automation at GitHub
Jonah Berquist and Gillian Gunson, GitHub
The database team at GitHub is tasked with keeping the data available and with maintaining its integrity. Our infrastructure automates away much of our operation, but automation requires trust, and trust is gained by testing. This session highlights three examples of infrastructure testing automation that helps us sleep better at night:
- Backups: auto-restores and backup validation. Making backup data accessible to our engineers. What metrics and alerts we have in place.
- Failovers: how we continuously test our failover mechanism, orchestrator. How we setup a failover scenario, what defines a successful failover, how we automate away the cleanup. What we do in production.
- Schema migrations: how we ensure that gh-ost, our schema migration tool, which keeps rewriting our (and your!) data, does the right thing. How we test new branches in production without putting production data at risk.
Jonah Berquist, GitHub
Jonah is the Engineering Manager of the Database Infrastructure team at GitHub. He previously worked as a Senior DBA at Twitter and he began his career with databases as a remote DBA for a variety of customers at Blue Gecko. He enjoys looking at graphs and writing scripts to do his job for him.
Gillian Gunson, GitHub
Gillian has been a database infrastructure engineer at GitHub for two years. Her focus has been on performance troubleshooting and observability. Previous employers include Okta, PalominoDB, Oracle, and Disney. Gillian is based in Vancouver, BC, Canada.
Change Management for Humans
Tiffany Longworth, Zapproved
Unfortunately, rolling out changes for humans is not as easy as merging a pull request. How many times have you seen a new project management tool get rolled out, sometimes with much fanfare and polish, and just not get adopted? Have you ever seen your company announce a new business unit which led to a minor revolt? How scary is the word “reorganization”? If you have ideas on how your group’s processes could be better, want to launch a new tool that will work better as your company grows, or have to adjust the way you do things to meet new regulations, learning the basics of change management will help you to get your plans going, launch them effectively, and ensure they stick around.
I’ve helped implement several tools, projects, and process changes over the years. In this talk, I’ll walk you through the basics of organizational change management with specific examples about
- Why it’s so hard
- Points to consider while implementing
- Tactics to increase adoption
- Keeping the change alive once launched
- Pitfalls to avoid
Tiffany Longworth, Zapproved
Tiffany Longworth is a Site Reliability Engineer at Puppet. She has launched successful projects large and small, but has also worked on projects that were spectacular failures! She likes using her background as a Marine, her training as an English teacher, automation, and cat gifs as much as possible.
Operations Reform: Tom Sawyer-ing Your Way to Operational Excellence
Thomas Limoncelli, Stack Overflow, Inc.
How do you reform your operations organization without demotivating the entire team? They’re proud of the mess they maintain and may be very resistant to change. How can you remove blockers to a devops transformation in operations?
In this talk, Tom will show how the operations of 150 services were reformed without blame or hurt feelings. Like Tom Sawyer convincing others to paint a fence, the team was self-motivated to fix broken processes and reform the system themselves.
Rather than a detailed 500-question audit, the teams were asked to complete a 4-question “gut check” for each subservice. Because the process took minutes instead of weeks, the team could easily produce the result. Because it was a self-evaluation, they were motivated to transform and improve their operational practices. No one felt shamed because it was a self-assessment.
No one felt blamed because services were criticized, not people. The technique makes it easy to judge new projects based on both “theory of constraints”. Shared solutions were easier to identify. The system helped drive training and coaching plans, made visible the business impact of tech debt, and created a data-driven management system where KPIs had failed.
Soon the program was expanded to include more gut checks and more services. This eased the organization into more formal assessments, and provided management roll-ups by service, team and organization.
Lastly, Tom will show how to apply these techniques in any organization.
Thomas Limoncelli, Stack Overflow, Inc.
Tom is the SRE Manager at StackOverflow.com and internationally recognized co-author of books such as the newly updated Volume 1: The Practice of System and Network Administration, 3rd Edition. He is also known for Time Management for Sysadmins and Volume 2: The Practice of Cloud System Administration. He is an internationally recognized author, speaker, system administrator, and DevOps advocate. He’s previously worked at small and large companies including Google, Bell Labs/Lucent, and AT&T. His blog is http://EverythingSysadmin.com and he tweets @YesThatTom.
Talks II
Managing OS Release Transitions at Netflix Scale
Edward Hunter, Netflix
Netflix runs over 150k+ instances of Ubuntu inside the Amazon cloud (AWS) supporting hundreds of micro-services to serve over 125m customers worldwide. A small team of engineers is responsible for maintaining and evolving the base OS (BaseAMI) on which every service depends. Over the past year or so we have migrated the majority of the fleet from Ubuntu's Trusty release to Xenial. When Bionic released we were ready to start moving services very shortly after the release date.
Our goals with the migration were simple:
- Don't break Netflix
- Minimize developer pain/complexity during the migration
- Be ready for the next release of Ubuntu as soon as practical after it's release
Meeting these goals required changes to packaging, tools and processes. This talk will reveal some of what we do to manage the OS and allow Netflix to deploy it quickly to thousands of VMs on a daily basis. It will also look at what it takes to stay up-to-date with patches and other changes in the ecosystem all while supporting our users, both internal and external, 24x7.
Edward Hunter, Netflix
Ed is an engineering leader at Netflix responsible for Performance, OS, and Capacity engineering. Prior to Netflix he spent time at Juniper Networks as a director of OS engineering for Junos. He also spent many years at Sun Microsystems managing part of the Solaris team, working in Sunsoft and Sunlabs. He finished his time at Sun as chief of staff to the CTO.
Solving All the Problems with systemd
Alvaro Leiva Geisse, Instagram
Often system administrators have to choose one of two options: On one end, traditional service management has a service starting with all privileges, and a full view of your system, and on the other end we have containers, with a restrictive, more controlled view of your system. But, with a modern kernel and systemd, it is no longer one or the other, but you can actually take the best of both approaches and decide which components to apply to your service.
Do you like the concept of packaging dependencies of containers, but also like the idea of sharing the network with your server from a traditional service manager? Do you want to restrict the access to the files on your system from containers, but also want to be able to manage your service from your server like traditional service management allows you? It turns out that you can have it all.
In this presentation I will show all the service techniques to deploy services in Linux that use and abuse systemd, from spinning up a simple service, to actually running your service isolated on a systemd container, and everything in the middle. I'll also show you how to use these features with other traditional techniques, like socket and path activation, service watchdog. scheduling tasks to be executed later on, and what happens when a service goes down.
You already have systemd installed on your server...Why not take full advantage of its capacities?
Alvaro Leiva Geisse, Instagram
I love Python, I grew up in a small town in Chile and one weekend, 16 years ago, I had the flu and could not go out. I decided to learn how to code in Python and that was the beginning of the road that would move us all to Northern California so that I could join the Production Engineering team at Instagram. I also like eating and cooking (in that order).
Unikraft: Unikernels Made Easy
Simon Kuenzer, NEC Laboratories Europe GmbH
Unikernels are still terrible to develop and maintain. Applications have to be ported manually to non-standard OSes before gaining from impressive benefits, like superb performance, great isolation, and a small trusted compute. Unikernels can be instantiated in tens of milliseconds or less. They are tiny with low memory footprints of a few MBs or even KBs. They can achieve high network throughput of 10-40 Gb/s with a single CPU core and they enable running thousands of concurrent instances.
We are going to present the Xen/Linux Foundation's open source Unikraft project. Its high level goal is to provide an automated tool to build unikernels without requiring the time-consuming, expert work as today. In addition, Unikraft targets support for multiple "platforms": Xen, KVM, containers and bare-metal. Images are automatically produced for multiple of these platforms without requiring any additional time from users.
Simon Kuenzer, NEC Laboratories Europe GmbH
Simon is a senior systems researcher passionate in virtualization and Unikernels. He works for NEC Labs since the past 6 years and has expertise in NFV and fast packet frameworks like Intel DPDK and Netmap. He is the maintainer of Unikraft, a Unikernel build framework that got released open source as incubator under the Xen project and Linux Foundation. Simon does currently his Ph.D. at University of Liege and received his Diploma degree in computer science with focus on robotics and operating systems at Karlsruhe Institute of Technology (KIT) in Germany.
Training I
De-Obfuscating Talent Acquisition
Lori Barfield, RaiseMe
Managers: Is your team’s workload growing faster than you can hire new talent? Do you agonize over which candidate would be best for your team? Do you have trouble persuading candidates to say yes, even when it looks like a perfect fit? Come to our workshop, we can help.
Executives: Do you have a manager under you who has open headcount requisitions so stale, they have birthdays? Send them to our workshop, we can help.
HR Professionals: Do candidate quizzes and cookie cutter requirements complicate the selection process so much, you’re turning off the real talent? Let us help you design a better process, a process that will resonate with you as well as your engineering managers.
The engineering team expansion process is daunting. When you look closely at how the sausage is made at most companies, it’s clear that engineering managers aren’t usually aware of best practices. The few who are lucky enough to receive strong mentoring can leverage that good advice their whole careers; the rest have to learn by making mistakes.
The goal of this class is to give attendees a distillation of best practices for creating new job positions and filling them. Taught by a long time engineering leader and hiring manager, it has a unique analytical approach that will resonate with engineers. This workshop will be the mentor they always needed.
Lori Barfield, RaiseMe
Lori joined her first Internet startup as a senior UNIX system administrator. When that company went public, she was hooked, and helping smaller firms prevail against well established rivals has been her passion ever since. She is currently COO at VerticalSysadmin and a chair at the SCaLE and ShellCon conferences in Southern California. In 2017 she developed RaiseMe, a unique career development effort for nonprofits, which has successfully helped people obtain their first engineering roles in the information security industry.
Lori is a mother of five and enjoys the occasional escape to the woods with her husband.
Training II
Attacking & Auditing Docker Containers
Madhu Akula, Appsecco
Developers and Operations teams (DevOps) have moved towards containers and modern technologies. Attackers are catching up with these technologies and finding security flaws in them. In this workshop, we will look at how we can test for security issues and vulnerabilities in Dockerised environments. Throughout the workshop we will learn how we can find security misconfigurations, insecure defaults and container escape techniques to gain access to host operating system (or) clusters. In the workshop, we will look at real world scenarios where attackers compromised containers to gain the access to applications, data and other assets.
By the end of workshop participants will be able to:
- Understand Docker security architecture
- Audit containerised environments
- Perform container escapes to get access to host environments
The participants will get the following:
- A Gitbook(pdf, epub, mobi) with complete workshop content
- Virtual machines to learn & practice
- Other references to learn more about topics covered in the workshop
Madhu Akula, Appsecco
Madhu is a security ninja and published author. Madhu’s research papers are frequently selected for major security industry conferences including Defcon 26,24 , Blackhat USA 2018, Appsec EU 2018, All Day DevOps (2016, 2017), DevSecCon (London, Singapore, Boston), DevOpsDays India, c0c0n, Serverless Summit, null, etc. His research has identified many vulnerabilities in over 200 organisations including US Department of Homeland Security, Google, Microsoft, Yahoo, Adobe, LinkedIn, Ebay, AT&T, Blackberry, Cisco, Barracuda etc. He is co-author of Security Automation with Ansible2 book published by Packt Publishing in December 2017, which is listed as a resource by the RedHat Ansible itself.
6:00 pm–8:00 pm
Conference Reception
Broadway Ballroom A–E
Take a break from your laptops, and join us at our annual LISA Reception for dinner, drinks, and the opportunity to connect with other attendees, speakers, and conference organizers.
8:00 pm–11:00 pm
Birds-of-a-Feather Sessions (BoFs)
View the full schedule of BoFs on the LISA18 BoFs page.
Wednesday, October 31, 2018
8:00 am–12:00 pm
On-Site Registration and Badge Pickup
8:00 am–9:00 am
Continental Breakfast
Legends Prefunction
9:00 am–10:30 am
Talks I
How Our Security Requirements Turned Us into Accidental Chaos Engineers
Paul Carleton, Stripe Inc.
This talk will cover a security focused project that evolved into a chaos injection system.
The system is called “Lifespan Management” and it enforces a lifespan on a cloud hosted VM. After the lifespan expires, the host is terminated, and a replacement is brought up. It has the benefits of making it easier to apply fixes for CVE’s (CVE comes out on day X, we know hosts will age out by day Y), and reducing the value of a compromised machine (“I’ve finally captured a host! It’s being shutdown?? No!”)
This seemed simple enough, but the complexity it uncovered made for a fun, year-long adventure in chaos engineering.
In this talk, I’ll cover the evolution of the system, and some lessons we learned along the way like:
- All termination API calls are not created equal
- Zero failing health checks does not mean a host is healthy
- Answering “Was this the chaos system?” quickly is essential
I’ll also include anecdotes like how it helped with Spectre/Meltdown mitigations, how it mercilessly killed all our kubernetes workers, and how it locked us out of our QA environment.
Paul Carleton, Stripe Inc.
I am an infrastructure engineer on the Cloud team at Stripe and I want to make systems that fix themselves. Outside of work, I write about tech, hike, bike, and spend too much time thinking about where the moon is in relation to Earth.
Securing a Security Company
Patrick Cable, Threat Stack, Inc.
Security is hard. Organizations and businesses tend to sacrifice security for speed, which often leads to undesirable security outcomes for organizations. There's good news though: system engineers, administrators, ops professionals of the world are in a unique spot to make security in their organization better! This is especially true for engineers in smaller organizations and startups, because you don't need to be a Security Person™ to make an organization more secure.
In this talk we'll dig into how a security company thinks and acts about security internally - and the lessons you can take away from it. What did we start with? Where are our pain points? Where are we going? We'll talk about threat models, the pain of constraints, how you can get into trouble with cryptography, the importance of UX, vendor assessments and incident response. At the end, you'll have cultural, engineering, and architecture ideas to take back to your organization and implement.
Patrick Cable, Threat Stack, Inc.
Patrick Cable is a Sr. Infrastructure Security Engineer at Threat Stack. He works to ensure the security of the Threat Stack Platform by collaborating with other departments, implementing security tools, and building new technology to make security easier for everyone in the organization. Prior to working at Threat Stack, Patrick was Associate Staff in the Secure and Resilient Systems Group at MIT Lincoln Laboratory where he worked on improving cloud security in research environments.
We Already Have Nice Things, Use Them!
Sabice Arkenvirr
An alarming number of companies are using custom tools to provide functionality readily provided by standard, and frequently open source, tools in the systems administrator's toolbox. From deploying servers to services, and maintenance throughout the life of a deployment, there is much to be gained by using an industry standard, and there are very few instances when something entirely custom is actually required.
This talk will delineate the most painful pitfalls of rolling your own tooling. It will also walk through the rare scenarios where custom solutions are acceptable and the best way to approach the process of writing and maintaining these projects.
Sabice Arkenvirr[node:field-speakers-institution]
Sabice is an automation & operations engineer at Quid and former consultant based out of San Francisco. She has worked remotely and on site with customers across the industry to help improve their processes and create a clear path forward for maintenance of their systems.
Talks II
Five-sigma Network Events (and How to Find Them)
John O'Neil, Edgewise Networks
Networks are complex systems and too often, despite their best effort, no one knows everything about what's going on. And most of the knowledge about the network is about typical activity. But what about the atypical activity?
There are many reasons to want to find unusual behavior in your network. The biggest reason is that it may be a sign of something new and unexpected—rather than the usual stuff—driving the activity. This doesn't necessarily imply that a network intrusion in underway. There are many other possibilities, both innocuous and dangerous. In any case, though, unusual behavior is probably something you want to know.
There are a variety of tools related to "anomaly detection" or "outlier detection," and this talk isn't about any of them. Instead, this talk is an introduction to writing your own tools for detecting unusual network events. We'll use Python, with some easily available pip installations, and look at some simple approaches to the problem that answer some interesting questions and scale well.
The code will be made available, but the point is not that the code is a complete solution—the point is, rather, that it's a starting point for creating tools that tell netops folks what they want (and need) to know.
John O'Neil, Edgewise Networks
John O’Neil is the Data Scientist at Edgewise Networks. He writes and designs software for data analysis and analytics, search engines, natural language processing and machine learning. He has a PhD in linguistics from Harvard University, and is the author of more than twenty papers in Computer Science, Linguistics, and associated fields, and has given talks at numerous professional and academic conferences.
Reducing Chaos in Production: Testing vs. Monitoring
Robert Treat, OmniTI
While no one disputes the good in finding and fixing issues before deploying to production, relying on traditional testing methods in the age of data-intensive, internet scale software has proven to be incomplete. The ability to identify and fix production issues quickly is crucial and requires insight into usage patterns and trends across the entire application architecture. This talk touches on deficiencies of common testing methods, provide real-world examples of discovering odd edge cases with both testing and monitoring, and offers recommendations on metric instrumentation to help companies identify and act on business-affecting problems.
Robert Treat, OmniTI
After a career spent building data-intensive, mission-critical systems at fortune 500 companies across the e-commerce, healthcare, financial services, and SaaS industries, Robert currently spends his days as CEO of OmniTI, a technical services firm focused on providing web services and infrastructure management. A long time open source contributor, he is a published author and speaks at conferences worldwide on topics such as open source, databases, and managing operations at scale. He occasionally blogs at http://xzilla.net.
Don't Burn Out or Fade Away: Conquering Occupational Burnout
Avleen Vig, Facebook
Occupational burnout is felt by many people at some point in their career. We'll discuss what burnout is, the causes, symptoms, and impacts of burnout, as well as ways to recover from it. This talk brings together scientific research, with a personal story of burning out and returning to health over the course of 12 months.
Occupational burnout is a concern for employees and employers. It is common across all industries and companies regardless of the number of employees, revenue, types of projects or hours worked.
Burnout is also deceptive, often the person experiencing it doesn’t even realize they are in the middle of an episode. At best, a burnt out employee can work their way through it. At worst, they break friendships, burn bridges, and leave what would otherwise be happy employment with a trail of sadness in their wake.
During this talk, we’ll take a walk through a recent, long-lasting episode of burnout, discussing each point and how it relates to the research on the topic, and what the current research teaches us. We’ll especially look as methods to identify burnout in yourself and others, and how to break the cycle before it’s too late.
Avleen Vig, Facebook
Avleen is a Production Engineer at Facebook, where he helps scale Facebook’s infrastructure. Before joining Facebook he worked at several large tech companies, including EarthLink, Google, and Etsy.
Training I
Inclusion of Women and Underrepresented Individuals in IT Workplaces
Jeri-Elayne Smith, The Citadel, The Military College of South Carolina
In this experiential training session, individuals will discuss and reflect upon inclusion in the workplace. After taking an individual focus, they will learn a research-based model of Ubuntic Inclusion. The session will end with each attendee creating the start of an "Inclusion Action Plan."
Jeri-Elayne Smith, The Citadel, The Military College of South Carolina
J. Goosby Smith, Ph.D. is an Associate Professor of Management and Assistant Provost for Diversity Equity and Inclusion at The Citadel, The Military College of South Carolina located in Charleston. Smith holds Ph.D. and M.B.A. degrees in Organizational Behavior from Case Western Reserve University, and a B.S. in Computer Science from Spelman College. After a decade in IT for Digital Equipment Corporation (DEC) and other firms, Smith made the transition to a career as a diversity and inclusion professional.
Training II
Systems Tracing and Trace Visualization Lab—Part I
Suchakrapani Sharma, Shiftleft Inc; Geneviève Bastien, École Polytechnique de Montréal; Mohamad Gebai, SUSE
Trace Visualization Lab is a two-part lab session that introduces participants to system tracing and trace visualizations that are an invaluable technique to understand in-depth system behavior and reach root-cause of problems faster than just CLI tracing modes. The focus of this lab is on post-mortem analysis (the system failed and we want to understand the root cause). The tutorial first introduces attendees to system tracing, trace collection, storage/aggregation/filtering and eventually visualization techniques such as flamecharts, flamegraphs, timeline views, critical flow view (inter-process flow), container and VM views, resource usage etc).
The introductory sessions (part I) also explain why and when is systems tracing required and its role in supporting related performance anlaysis techniques such as distributed tracing, profiling, debugging and service log analysis. Various visual techniques will be shown and hands on activities that showcase how different "views" can help in scenarios such as resource contention, latency analysis would be carried out.
For this lab, primary trace collection will be through LTTng, Perf/Ftrace and primary trace visualization system will be Trace Compass.
The Trace Visualization Lab has been open sourced and is publicly accessible at: https://github.com/tuxology/tracevizlab.
Suchakrapani Sharma, Shiftleft Inc
Suchakra is currently a Scientist at ShiftLeft Inc. He completed his PhD in Computer Engineering from École Polytechnique de Montréal where he worked on eBPF and hardware-assisted tracing techniques for advanced systems performance analysis. He has been involved in research on systems performance analysis and has delivered talks at LISA 2017 (San Francisco), Tracing Summit 2015 and 2016 (Seattle and Berlin) where he has demonstrated advanced kernel and userspace tracing tools. He also developed one of the first hardware-trace based VM analysis techniques. More information about him can be found at : https://suchakra.wordpress.com/about
Geneviève Bastien, École Polytechnique de Montréal
Geneviève Bastien is a research associate at the Dorsal Laboratory of École Polytechnique de Montréal. She is a contributor to the Trace Compass and LTTng projects. Her mission is to make the students' life easier when in comes to prototyping cool new analyses and to make sure that their contributions make it into the hands of the end users. She's also involved in Community Wireless Networks in Montréal and in organizations that promote the use of free software in the public sphere.
Mohamad Gebai, SUSE
Mohamad Gebai is a Senior Software Engineer at SUSE. He currently works on software-defined storage solutions, notably Ceph, and has previously been part of the Azure Storage team at Microsoft. He has also been a research associate and a guest lecturer at Polytechnique Montreal. He lectured on Cloud computing technologies.
10:30 am–11:00 am
Break with Refreshments
Legends Prefunction
11:00 am–12:30 pm
Talks I
Engineering an IP Portfolio
Matthew Horton, Arnold & Porter
A strong IP portfolio can be a powerful asset to a company. Of course, it provides economic value to the company or inventors. But a well-orchestrated IP portfolio development effort can also stimulate inventiveness and problem-solving, foster a culture of innovation, and enhance the prestige of the company.
This talk will focus on the ways that engineers can help capture IP value for their organizations. Engineers play a crucial role in the development of a robust IP portfolio. Streamlining communication between the legal and engineering teams and the processes for developing and capturing new developments gives the company a valuable advantage in achieving this goal. This talk will discuss strategies on improving these communication lines, and advise engineers on how to take a more active role in the development of the company's IP portfolio.
In addition, the talk will touch on broad strokes of IP law, glaring issues relevant to the computing industry, like what is “patent-eligible subject matter” after the Supreme Court’s Alice decision, and how Oracle v. Google is affecting the ability to copyright software code. We will also address practical applications and consideration of various portfolio development strategies in the day-to-day setting, such as whether a company should pursue an IP program when the company is primarily an open source shop and how that could be accomplished.
Matthew Horton, Arnold & Porter
Matt Horton counsels clients on strategic intellectual property portfolio development, focusing on business-oriented and industry-oriented portfolio strategies. Matt also supports IP litigation and cybersecurity efforts. He has extensive information technology and cybersecurity experience, serving as a systems engineer and cybersecurity engineer on teams responsible for federal network infrastructures. Matt holds an LLM in Intellectual Property from The George Washington University Law School, and a JD from Tulane University. He also has a Masters in Computer & Information Technology from the University of Pennsylvania, and a BS in Information Technology from George Mason University, with a concentration in Networking and Cybersecurity.
Operational Excellence in April Fools’ Pranks: Being Funny Is Serious Work!
Thomas Limoncelli, Stack Overflow, Inc.
One of the most “high stakes launch” you can do is the April Fools Prank (AFP). The usual best practices don’t always work and there is no second chance: you can’t just re-launch on April 2nd. Many silos must sign off on the plan, yet the plan must be kept secret. Beta testing is particularly difficult. Loads shift from zero to millions in one day with little room for mistakes.
In this talk, I’ll discuss how to communicate across the company to get buy-in, how to load test using “dark launches”, rollbacks using feature flags, and other techniques for assuring a high stakes launch happens correctly. I’ll include stories from many AFPs both that I’ve observed and been a part of.
Thomas Limoncelli, Stack Overflow, Inc.
Tom is the SRE Manager at StackOverflow.com and internationally recognized co-author of books such as the newly updated Volume 1: The Practice of System and Network Administration, 3rd Edition. He is also known for Time Management for Sysadmins and Volume 2: The Practice of Cloud System Administration. He is an internationally recognized author, speaker, system administrator, and DevOps advocate. He’s previously worked at small and large companies including Google, Bell Labs/Lucent, and AT&T. His blog is http://EverythingSysadmin.com and he tweets @YesThatTom.
Talks II
Keeping the Balance: Load Balancing Demystified
Murali Suriar, Google, and Laura Nolan
Can you explain the entire path that an IP packet takes from your users to your binary? What about a web request? Do you understand the tradeoffs that different kinds of load balancing techniques make? If not, this talk is for you.
Load balancing is hard, and it is made up of many disparate technologies. It cuts across network, transport and application layers. We'll describe different flavours of load balancing (network, naming, application) and how they work.
We will then discuss example use cases, and which load balancing approaches are most appropriate in each case. We'll also relate these to several design patterns for high-availability services that have developed over the years. Finally, we'll relate the techniques we've been discussing to well-known open source technologies and to the major cloud load balancing services.
Murali Suriar, Google
Murali Suriar is lapsed computer science graduate, turned network engineer, turned SRE. Currently working at Google running cluster filesystem and locking services. Left Google to get on a boat. Got bored and came back.
Laura Nolan, N/A
Laura Nolan’s background is in Site Reliability Engineering, software engineering, distributed systems and computer science. She wrote the 'Managing Critical State' chapter in the O'Reilly ‘Site Reliability Engineering’ book, and is co-chair of SREcon18 Europe/Middle East/Africa. Laura is currently enjoying a well-earned sabbatical (and tinkering with some of her own projects) after 15 years in industry, most recently at Google.
Apache Kafka and KSQL in Action: Let’s Build a Streaming Data Pipeline!
Robin Moffatt, Confluent
Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again!
Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with the Kafka Connect API, which is part of Apache Kafka. KSQL is the open-source SQL streaming engine for Apache Kafka, and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.
In this talk we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, the Kafka Connect API, and KSQL.
Gasp as we filter events in real time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!
This will be a practical talk, after which attendees will have a clear idea of the power of stream processing, and how to get started with it using the open-source Apache Kafka and KSQL projects.
Robin Moffatt, Confluent
Robin is a Developer Advocate at Confluent, as well as an Oracle ACE Director and Developer Champion. His career has always involved data, from the old worlds of COBOL and DB2, through the worlds of Oracle and Hadoop, and into the current world with Kafka. His particular interests are analytics, systems architecture, performance testing and optimization. He blogs at http://cnfl.io/rmoff and http://rmoff.net/ (and previously http://ritt.md/rmoff) and can be found tweeting grumpy geek thoughts as @rmoff. Outside of work he enjoys drinking good beer and eating fried breakfasts, although generally not at the same time.
Training I
How to Write Effective Training
Ever wanted to teach a class but didn't know how to get started? This presentation will walk students through an organized process that can help take a class idea .. develop an outline, class materials, exercises and finally a proposal suitable for a conference or event. We will discuss how to use idea construction to break an objective into component parts and then discuss different methods for teaching those components so that a student can grasp the concept and apply it. We will write simple workshop exercises that then help reinforce the objective and give students enough information to be able to take the idea back to their environment and apply it. We'll then go over general proposal requirements and some tips on how to write a winning submission.
Branson Matheson, Cisco
Branson is a 30-year veteran of system architecture, administration, and security. He started as a cryptologist for the US Navy and has since worked on NASA shuttle and aerospace projects, TSA security and monitoring systems, secure mobile communications, and Internet search engines. He has also run his own company while continuing to support many open source projects. Branson speaks to and trains sysadmins and security personnel world wide; and he is currently a senior technical lead for Cisco Cloud Services. Branson has several credentials; and generally likes to spend time responding to the statement "I bet you can't...."
How to Give Effective Training
Branson Matheson, Cisco
As conference attendees, we have respect for those that stand before us to give forth upon topics diverse and engaging. It's always been said that public speaking is a challenge, and infront of technical peers it can seem more difficult but it doesn't have to be. We will discuss taking a training course and developing it into an entertaining presentation that will facilitate the learning process and leave a lasting impression upon attendees. We will discuss several methods for preparing to give a presentation, ways to improve existing material and tips and techniques for enhancing the experience for both students and instructor. Shakespeare said "the play's the thing" and while you won't be ready to perform as Elphaba, you will have a toolkit of methods and thoughts to improve your presentations.
Branson Matheson, Cisco
Branson is a 30-year veteran of system architecture, administration, and security. He started as a cryptologist for the US Navy and has since worked on NASA shuttle and aerospace projects, TSA security and monitoring systems, secure mobile communications, and Internet search engines. He has also run his own company while continuing to support many open source projects. Branson speaks to and trains sysadmins and security personnel world wide; and he is currently a senior technical lead for Cisco Cloud Services. Branson has several credentials; and generally likes to spend time responding to the statement "I bet you can't...."
Training II
Systems Tracing and Trace Visualization Lab—Part II
Suchakrapani Sharma, Shiftleft Inc; Geneviève Bastien, École Polytechnique de Montréal; Mohamad Gebai, SUSE
Trace Visualization Lab is a two-part lab session that introduces participants to system tracing and trace visualizations that are an invaluable technique to understand in-depth system behavior and reach root-cause of problems faster than just CLI tracing modes. The focus of this lab is on post-mortem analysis (the system failed and we want to understand the root cause). The tutorial first introduces attendees to system tracing, trace collection, storage/aggregation/filtering and eventually visualization techniques such as flamecharts, flamegraphs, timeline views, critical flow view (inter-process flow), container and VM views, resource usage etc).
The advanced sessions (part 2) cater to views in Trace Compass that support analysis of containerized workloads, VMs, IRQs and creating custom views in Trace Compass. The session also challenges users to solve a scenario and hunt bugs in real workloads that use all visual tools and techniques in tandem. Attending "Trace Visualization Lab - Part I" is a pre-requisite for this session.
For this lab, primary trace collection will be through LTTng, Perf/Ftrace and primary trace visualization system will be Trace Compass.
The Trace Visualization Lab has been open sourced and is publicly accessible at: https://github.com/tuxology/tracevizlab.
Suchakrapani Sharma, Shiftleft Inc
Suchakra is currently a Scientist at ShiftLeft Inc. He completed his PhD in Computer Engineering from École Polytechnique de Montréal where he worked on eBPF and hardware-assisted tracing techniques for advanced systems performance analysis. He has been involved in research on systems performance analysis and has delivered talks at LISA 2017 (San Francisco), Tracing Summit 2015 and 2016 (Seattle and Berlin) where he has demonstrated advanced kernel and userspace tracing tools. He also developed one of the first hardware-trace based VM analysis techniques. More information about him can be found at : https://suchakra.wordpress.com/about
Geneviève Bastien, École Polytechnique de Montréal
Geneviève Bastien is a research associate at the Dorsal Laboratory of École Polytechnique de Montréal. She is a contributor to the Trace Compass and LTTng projects. Her mission is to make the students' life easier when in comes to prototyping cool new analyses and to make sure that their contributions make it into the hands of the end users. She's also involved in Community Wireless Networks in Montréal and in organizations that promote the use of free software in the public sphere.
Mohamad Gebai, SUSE
Mohamad Gebai is a Senior Software Engineer at SUSE. He currently works on software-defined storage solutions, notably Ceph, and has previously been part of the Azure Storage team at Microsoft. He has also been a research associate and a guest lecturer at Polytechnique Montreal. He lectured on Cloud computing technologies.
Hidden Linux Metrics with Prometheus eBPF Exporter
Alexander Huynh, Cloudflare, and Ivan Babrou
While there are plenty of readily available metrics for monitoring Linux kernel, many gems remain hidden. With the help of recent developments in eBPF, it is now possible to run safe programs in the kernel to collect arbitrary information with little to no overhead. A few examples include:
- Disk latency and io size histograms
- Run queue (scheduler) latency
- Page cache efficiency
- Directory cache efficiency
- LLC (aka L3 cache) efficiency
- Kernel timer counters
- System-wide TCP retransmits
Practically any event from "perf list" output and any kernel function can be traced, analyzed and turned into a Prometheus metric with almost arbitrary labels attached to it.
If you are already familiar with BCC tools, you may think if ebpf_exporter as bcc tools turned into prometheus metrics.
In this tutorial we’ll go over eBPF basics, how to write programs and get insights into a running system.
Alexander Huynh, Cloudflare
Performance engineer, specializing in fine tuning existing systems, and the R&D of future systems.
Ivan Babrou, Cloudflare
Ivan is a Performance Engineer at Cloudflare. He spends his days finding performance bottlenecks, fixing them and making sure large chunk of internet runs as fast and as efficiently as possible.
12:30 pm–2:00 pm
Conference Luncheon
Broadway Ballroom E–K
2:00 pm–3:30 pm
Talks I
Mastering Near-Real-Time Telemetry and Big Data: Invaluable Superpowers for Ordinary SREs
Ivan Ivanov, Netflix
One of fundamental requirements for typical SRE team is the capability to have solid operational insights into the holistic state of the supported system. This usually involves collecting, aggregating, correlating, visualizing and reacting on data generated by diverse set of data sources.
As part of this talk I will go through the high level approach, sample data sources, and some implementation details, used by the Netflix Open Connect CDN Reliability engineering team while supporting the infrastructure and services hosted on thousands of physical servers and providing streaming video delivery for hundreds of millions of clients.
I will cover some samples of the usage of Hive, Presto, Spark, Elasticsearch, Tableau, and Netflix developed tools for monitoring, alerting, debugging, long term analysis, planning, etc. I will also speak about the benefits of correlating detailed data from server and client reported telemetry.
For clarity—this is not a talk about how to build, implement, develop or support Big Data or near real time telemetry systems. This is all about how you can use them as a platform and a powerful toolset for making your operations team stronger.
Ivan Ivanov, Netflix
Ivan is a Senior CDN Reliability Engineer on the Netflix Open Connect team. He has been designing, deploying, supporting, and optimizing online services on a global scale in various operations roles for the last 17+ years. He is focusing on service reliability, availability, scalability, quality of experience (QoE), service optimizations, investigating and troubleshooting server/client/network/application related issues. Prior to joining Netflix Ivan was Principal Service Engineer at Microsoft working on Windows Update, Microsoft.com and Windows Store.
Datastore Axes: Choosing the Scalability Direction You Need
Nicolai Plum, Booking.com Ltd
This talk will discuss the different orthogonal axes that data storage systems scale up along. Storage system designs can grow successfully in one or more of read rate, write rate, data size and data complexity. They cannot grow in all of these directions; modern highly scalable storage systems are a compromise. This talk will discuss the restrictions of hardware that force the need for smart software, then describe how to scale up an RDBMS and NoSQL datastores. It will cover strong and weak points of a variety of system architectures with examples of how they succeeded and failed in operational use, as well as the types of solution that can be applied to solve challenges in organisational objectives that need scalability along particular axes.
Nicolai Plum, Booking.com Ltd
Currently working as a Database Engineer in the Database Engineering Team of Booking.com, previously systems administration and systems architecture positions at Booking.com and Systems Administrator positions at UUNET UK.
Make Your System Firmware Faster, More Flexible and Reliable with LinuxBoot
Andrea Barberio and David Hendricks, Facebook
UEFI has largely replaced the traditional BIOS in servers, but it is still often limited and has bugs which can take a long time to be fixed by vendors. LinuxBoot, an open source effort, offers a solution to these and other problems by embedding Linux into a machine's firmware, and using it to initialise hardware and start the boot process. Amongst other advantages, this provides significantly faster boot time, enables us to more securely netboot systems over HTTPS and gives access to real debugging tooling. We will talk about the advantages of running LinuxBoot for reliability, speed and flexibility, and how you can try it today.
Andrea Barberio, Facebook
Production Engineer at Facebook Dublin.
Currently working on LinuxBoot and Provisioning services. Previously worked on Facebook's DNS infrastructure and AWS's network monitoring systems as a software engineer.
My interests span from network analysis and security, to reverse code engineering, from systems lifecycle to firmware automation. More about me at https://insomniac.slackware.it.
David Hendricks, Facebook
David ("dhendrix") has been involved with open source firmware and tools for many years, first as an intern at Los Alamos National Laboratory and later as a software engineer at Google working on ChromeOS and at Facebook working on Open Compute Project and Telecom Infrastructure Project. He is working to address challenges of development, deployment, maintenance, and security of system firmware at scale.
Talks II
Modern HTTP Routing
Sandor Szücs, Zalando SE
Modern HTTP routing should support traffic switching, provide fine granular visibility, add resiliency patterns, be extensible, and should be easily configured by development teams. At Zalando SE, we run more than 80 Kubernetes clusters and developed open source tools to support the mentioned features. One core component, skipper, we internally call the swiss army knife of HTTP. It provides filters to modify HTTP data, headers, and body. It allows us to specify route matching with predicates. Using these principles we can support deployment patterns like blue-green, shadow traffic, or A/B tests.
Shadow traffic is one less known feature that enables users to test new architecture with production traffic, without disrupting production.
Modern l7 software load balancers have to be simple to scale, provide visibility in depth, and enhance resiliency. Retries, circuit breaker, or rate limits are interesting cases that can support your applications being more resilient to failure. How to do rate limits in a growing and shrinking cluster, that means changing number of replicas in the backend and also load balancer instances?
The talk will show a modern HTTP router and it's features and show some practical use cases from our production setup that can be reused by the audience.
Sandor Szücs, Zalando SE
Sandor Szücs builds infrastructure software with Kubernetes, that enable feature teams to be productive. He strongly believes in devops and open source. He likes to code Go and focus on the Ingress part of Kubernetes. Sandor was formerly a System Engineer managing production systems in baremetal datacenters with LXC, Puppet, hardware load balancers and storage systems.
Taking Over & Managing Large Messy Systems
Steve Mushero, ChinaNetCloud
People love to build new systems from scratch, but the real world is full of large & messy systems with limited documentation, manual processes, huge technical debt, and a myriad of problems. These often need to be taken over, stabilized, managed, optimized & upgraded for the long-term, usually without downtime, a process can take a week, a month, or a year. This is a large-scale Managed Service Provider's look at their last decade of the tools and processes to do this on 1, 10, and 100+ million-user systems on the ever-chaotic Chinese Internet.
Steve Mushero, ChinaNetCloud
Steve Mushero is CEO of Ops Platform provider Siglos, and CEO of ChinaNetCloud, China's first Internet Managed Service Provider, AWS Partner, and manager of hundreds of large-scale (up to hundreds of millions of users each) systems. He's previously been CTO in a variety of organizations in Silicon Valley, Seattle, New York, and around the world.
Reborn in IT
Lori Barfield, RaiseMe
Some of the most talented people we work with come into the IT profession orthogonally, after starting out in different fields at the beginning of their careers. Also, seasoned engineers who have always been in IT often find themselves in a rut after a few years, and want to shift to a different industry vertical. Making the move into management or out of management are similar challenges. These are all re-born candidates, and they are uniquely qualified to handle some of the hardest jobs in IT. Yet, they all face the same challenge: How to get picked for their dream role if they are coming from a different background. First they have to do battle with the notoriously cookie-cutter hiring guidelines in the IT Industry.
Based on a successful non-profit career development program in Southern California, this talk addresses the big challenges with making IT engineering dreams come true:
- Which certifications and training programs are the most respected
- What you should and shouldn't do with your resume
- What types of roles and employers you should target
- How to help recruiters help you
- Sample language for hiring managers trying to justify a special placement
- How LUGs and Conferences help
Lori Barfield, RaiseMe
Lori joined her first Internet startup as a senior UNIX system administrator. When that company went public, she was hooked, and helping smaller firms prevail against well established rivals has been her passion ever since. She is currently COO at VerticalSysadmin and a chair at the SCaLE and ShellCon conferences in Southern California. In 2017 she developed RaiseMe, a unique career development effort for nonprofits, which has successfully helped people obtain their first engineering roles in the information security industry.
Lori is a mother of five and enjoys the occasional escape to the woods with her husband.
Training I
Test Infrastructure Driven Design
Joaquin Menchaca, RocketLawyer
Many organizations struggle to eliminate tech debt with existing bubble gum and scripts or change configuration code, or look toward adopting immutable production patterns with Docker and Kubernetes. The question is how to do this and build confidence that when modules or whole systems are swapped out everything will still work.
Joaquin Menchaca, RocketLawyer
Joaquin started in the industry straight out of highschool at Apple Computer as a QA Engineer and later a Tool Developer. As QA roles transformed, he moved into information systems as system administrator, systems engineer, and now a Senior DevOps Engineer (Yes, he knows DevOps is a philosophy). Currently, Joaquin is partnered with RocketLawyer, helping them transform their organization to apply DevOps principles that will break down silos and foster collaboration.
Training II
Linux Systems Troubleshooting
Thomas Uphill, Narrabilis
In the days of immutable infrastructure, what's the point of troubleshooting anything?
In my experience, when there is a problem with the image/application/container, the immutable infrastructure will just keep recreating the problem. Knowing how to diagnose common problems is still an important skill. Moreover, the machines that control images and containers are longer running than the containers they run, problems on these machines still need to be found and fixed.
In this tutorial we will look at the problems seen with Linux systems. We'll start with how various subsystems work: networking; filesystems; users/groups; and permissions. We'll then move on to look at tools to inspect running systems and of course strace. The focus will be on "off the shelf" tools and how to use them.
Thomas Uphill, Narrabilis
Thomas is a veteran System Administrator who has recently switched to a development role. He's the author of several books on Puppet and has spoken at past LISA conferences on a variety of topics. His primary work environment is Linux and he's a VIM user.
3:30 pm–4:00 pm
Break with Refreshments
Legends Prefunction
4:00 pm–4:15 pm
Closing Remarks
Legends Ballroom ABCD
LISA18 Takeaways
Program Co-Chairs: Rikki Endsley, Opensource.com, and Brendan Gregg, Netflix
4:15 pm–5:15 pm
Keynote Addresses
Serverless Ops: What to Do / This is What We Do, When the Server Goes Away
Tom McLaughlin, ServerlessOps
With the rise of serverless architecture, many of the common day-to-day operations tasks will change dramatically, if not disappear completely. We as Operations professionals will be challenged to redefine our roles and responsibilities within the technology organization as serverless abstracts away the server and its respective OS to cloud service providers. No stranger to this scenario, we will not only be tasked with solving these engineering obstacles introduced by the new serverless paradigm, but we will also need to prove our value to the business in the face of a changing technology landscape… again.
This is a combination professional/cultural and technical talk. We’ll start by discussing the disruption that serverless presents to operations and why. While DevOps and public cloud are becoming commonplace, serverless is the beginning of a new disruption cycle. We need to understand why serverless is disruptive and learn from the lessons of the past.
The talk will continue on to discuss the value of operations work and understanding the relationship between work and value. Not all work has the same value and we need to understand this so we prioritize the best use of our time.
Finally I will walk through the current state of serverless engineering and tools, and show how and where we fit in. For our career longevity and security we need to understand how we fit.
A few areas I’ll discuss include:
- DevOps and public cloud… And how serverless is starting a new disruption cycle 10 years later.
- Understanding and determining the value of your work
- Moving up the value chain and closer to customer success metrics.
- Team reorganization to better align with business success.
- Serverless architectural decision making
- Performance management and cost containment
- Failure monitoring and service handling
- Security risk and concerns
If you’re an Operations engineer and you had all your host and OS related work removed from you, would you know what to do to stay busy and demonstrate your value to your organization?
Tom McLaughlin, ServerlessOps
Tom is the founder of ServerlessOps and an experienced operations engineer. He started ServerlessOps after he asked the question, what would he do if servers went away? At a loss for an answer and interested in the future of his profession, he decided to pursue the answer. Tom is actively engaged in promoting serverless infrastructure and engaging with the community to learn more about their thoughts, wants, and concerns are around the topic.
Chaos Engineering
Chaos Engineering is a helpful tool in understanding your system’s unknowns, but it is not the means to an end for achieving resilience. Instead, it helps to instill higher confidence in the ability to cope and be resilient in the face of inevitable failures.
This session will go over lessons learned and the impact to this confidence that Chaos Engineering has had at Netflix. As John Allspaw has said, "Resilience is the story of the outage that didn’t happen." The session will cover those stories from Chaos vulnerabilities that the Netflix team has found, how we follow those vulnerabilities, and how Chaos Engineering is incorporated into our day-to-day culture.
Nora Jones, Netflix
Nora is a Senior Software Engineer at Netflix and a student of Human Factors and Systems Safety at Lund University. She is passionate about resilient software, people, and the intersection of those two worlds.
She recently co-wrote the book on Chaos Engineering with her teammates at Netflix and keynoted AWS re:Invent to an audience of over 40,000 people about the benefits and business case behind implementing Chaos Engineering.