All the times listed below are in Eastern Standard Time (EST).
Monday, December 7
9:00 am–10:00 am
10:00 am–10:15 am
Program Co-Chairs: Nora Jones, Jeli.io, and Mike Rembetsy, Bloomberg
10:15 am–11:45 am
Opening Plenary Session
Laura Maguire, PhD
If you ask a group of engineers how they resolved a particularly difficult outage they typically talk about the dashboards that got pulled up, the logs they looked at, the node someone restarted, or the jobs they killed on their way to restoring the service. But that doesn't do much to tell us how, given conditions of uncertainty and time pressure, practitioners flexibly apply their knowledge to novel problems.
In other words, how did an engineer know what was the 'right' thing to do in spite of ambiguous data or who was the 'right' person to help diagnose a particularly out-of-control problem?
Over the last 3 years, as part of her dissertation work, Dr. Maguire studied both established and ad hoc teams of engineers responding in real-time to service outages ranging from minor disruptions to potentially organizational-viability-crushing events. In her research, she examined 62 cases of incident response across 4 organizations of varying scale and complexity to understand how engineering teams manage these costs of coordination under differing circumstances.
In this talk, Dr. Maguire will highlight some surprising (and provocative!) findings such as:
- Incident management works very differently than existing domain models (like GoogleSRE) suggest and incident command can actually be counterproductive to fast resolution;
- The choreography of the cognitive work in this joint activity is shown much more subtle and highly integrated into the technical efforts of dynamic fault management than previously understood;
- Tooling designed to aid coordination can incur additional cognitive costs for practitioners;
- Strategies of 'adaptive choreography' enable practitioners to cope with dynamic events and dynamic coordination demands;
- How tooling and intra-organizational dependencies can shift costs of coordination across time and organizational boundaries, increasing complexity for SREs.
Laura leads the research program at Jeli.io, where she studies software engineers as they cope with the cognitive complexities of keeping distributed, continuous deployment systems reliably functioning and helps to translate those findings into a product that is advancing the state of the art of incident management in the software industry. Her research interests lie in resilience engineering, coordination design, resilient systems control, and building tooling to enable adaptive capacity across distributed work teams. She was a researcher with the SNAFU Catchers Consortium from 2017–2020, working closely with large and medium-sized digital service companies to identify and support resilient performance within their engineering teams. Laura has a Master's degree in Human Factors & Systems Safety, a Ph.D. in Integrated Systems Engineering from the Ohio State University, and extensive experience working in industrial safety & risk management
Liz Fong-Jones, honeycomb.io
You don't need to write automation or deploy on Kubernetes to gain benefits from resilience engineering! Learn how Honeycomb improved the reliability of our Zookeeper, Kafka, and stateful storage systems through terminating nodes on purpose. We'll discuss the initial manual experiments we ran, the bugs in our automatic replacement tools we uncovered, and what steps we needed to progress towards continuously running the experiments. Today, no node at Honeycomb lives longer than 12 months, and we automatically recycle nodes every week.
Liz is a developer advocate, labor and ethics organizer, and Site Reliability Engineer (SRE) with 16+ years of experience. She is an advocate at Honeycomb for the SRE and Observability communities, and previously was an SRE working on products ranging from the Google Cloud Load Balancer to Google Flights.
11:55 am–12:35 pm
Alex Elman, Indeed
In the aftermath of a serious outage leadership aims to improve performance and avoid future incidents. With so much data to analyze it's difficult to know where to direct attention. How do we know we're getting better? Focusing solely on shallow one-dimensional measures of progress like MTTR, incident count, and severity obscures the deeper lessons. Holding teams to performance metrics based on things they can't control can be demoralizing. Incidents subtly influence organizations through system changes, new designs, budgets, policies, procedures, and hiring. Thorough incident analysis uncovers these unseen influences and their contributions to safety. Incident analysis produces artifacts such as interview transcripts, annotated timelines, contributing factors, and themes. This enables meta-analyses across incidents uncovering previously unseen opportunities. By providing leaders with richer data, it'll unlock insights into reliability, organizational learning, and opportunities for strategic investments. This fosters deeper trust between leaders and practitioners and yields healthier happier teams.
For the past nine years Alex Elman has been helping Indeed cope with ever-increasing complexity and scale. He is a founding member of the Site Reliability Engineering team. Alex leads the Resilience Engineering team focused on learning from incidents, chaos engineering, and fault-tolerant design patterns.
Zachary Meath, Two Sigma Investments, LP
What makes for a healthy tech culture, and how do our technical decisions influence it? Zachary Meath works on Two Sigma's DevOps platform “COIN”, which uses Jenkins hosted in Kubernetes to provide a scalable environment in which teams can run their CI/CD pipelines. Due its ubiquity, COIN plays a key role in DevOps culture at Two Sigma. In this session, Zack will discuss how culture was at the forefront of COIN’s design, and how the two influenced each other over time.
Zachary Meath is a Software Engineer at Two Sigma, where he is responsible for various SDLC tools specializing in all things DevOps. Before his current role, Zack worked at Goldman Sachs and IBM, where he focused on network automation. Zack's key interests lie in reliability best practices, scalable systems, and hiking.
12:40 pm–1:20 pm
Marco Coulter, AppDynamics
The concepts of SLI, SLO, and Error Budgets are there to balance risk (rates of change) and reward (business contentment). Using such metrics as red lines to punish teams, or force acceptance of risk by the business is missing the point. Experiences from SLA's in-service contracts inform a conversation identifying that SLI, SLO, and Error Budgets are better as a basis for conversations about the stress an application can withstand. This must include the business stress, as well as the infrastructure stress.
This session takes Goodhart's law from economic policy as a frame for reconsidering SLI's and SLO's. Leave this session inspired to approach your SLO negotiations in the best possible way.
As the Technical Evangelist for AIOps at AppDynamics, Marco Coulter is passionate about the experience humans have when interacting with technology. A former startup CTO, Marco has progressed from operator to leadership roles at CSC, CA Technologies, and more recently 451 Research, where he led the data science team. He earned the nickname "the tech-whisperer" for his skills in translating business drivers for a technical audience and technical concepts for business leaders. When taking the rare break from technology, Marco can be found harvesting fresh vegetables from his NYC garden.
Mohit Suley, Microsoft
Observability systems are usually designed to answer two broad questions - 'Is my service doing OK?', and - 'Is my business doing OK?'. There is a third perspective that often doesn't get enough attention (unless it's clearly linked to the first two) - 'Is my Customer Experience OK?'. It is fair to say that there's never a clear metric for this question.
This talk explains our motivation for stepping out of our metrics-centered 'comfort zone' and the journey that ensued: developing a habit of engaging face-to-face with some of our customers; figuring out ways to experience what they did; open-sourcing a high-scale tool to capture this data; setting up broader direct-to-team channels of communication from customers; and, re-thinking performance metrics.
If you are curious about why being a Customer Advocate makes you a better SRE, this talk is for you.
Mohit is an Engineering Manager on Bing's UX Foundation team. Designing systems to proactively improve availability and make customers happy is a core mission for them. In his spare time, he loves to go for long walks, tinkers with hardware, and chases his unachievable goal of reading more books than Bill Gates.
1:20 pm–1:50 pm
1:50 pm–2:30 pm
Despite thousands of squawking alerts and a morass of dashboards our complex systems remain firmly mysterious. Incidents continue to pop up in places that, frankly, they should not. In this talk, we'll leverage techniques from dozens of companies to learn successes and failures, how to spread that hard-earned knowledge via observability and visualizations, and how to productize the process internally to drive down incident impact, improve customer experience, and reduce stress.
Cory Watson is an engineer at Stripe, leading high impact, customer-focused projects around reliability. Cory started his journey of observability as an SRE at Twitter, founded the observability team at Stripe, and spent time at vendors SignalFx and Splunk. He is a strong voice in the observability community, through OSS, popular tweets, blog posts, and speaking engagements.
Cory has over 20 years of software engineering experience and is an active founder/contributor of several successful Open Source projects. Before finding his passion for reliability, he worked in several industries such as e-commerce, consulting, healthcare, and fintech.
Daniel "Spoons" Spoonhower, Lightstep
Adopting Kubernetes, deploying a service mesh, or breaking up a monolith are all ways of building distributed software systems, but if we are going to build and operate software at scale, we need to think about how to build scalable and distributed people systems too.
In this talk, I'll cover a journey from a monolithic team (and a small set of collectively owned services) to a set of teams and many more services. I'll talk about how to use documentation, divide oncall responsibilities, and set clear objectives, as well as when to ask humans to drive and maintain the process (be it system documentation or alert runbooks) and when to depend on automated processes that use telemetry from the application itself.
Successfully building distributed ownership requires not just defining how we are going to hold teams accountable, but also giving those teams agency to make things better. That agency is often overlooked but is critical to success.
Daniel "Spoons" Spoonhower is CTO and a co-founder at Lightstep. He is an author of Distributed Tracing in Practice (O'Reilly Media, 2020). Previously, Spoons spent almost six years at Google as part of Google's infrastructure and Cloud Platform teams. He has published papers on the performance of parallel programs, garbage collection, and real-time programming. He has a Ph.D. in programming languages from Carnegie Mellon University but still hasn't found one he loves.
2:35 pm–2:55 pm
Daria Barteneva, Microsoft
Noisy pagers can lead to pager fatigue, but noisiness is a subjective idea: every team will have their own answer as to how many alerts lead to a noisy pager. There's an assumption that more must mean worse. That led me to wonder: does the number of pages correlate to on-call satisfaction? In this talk, we will look at the results of the study that confirms that a higher number of pages doesn't correlate with on-call satisfaction. The sense of agency to improve the on-call experience and drive the change in a collaborative manner had a stronger impact on team morale than just a number of pages. We will look at different aspects of team culture that have a high impact on engineers' on-call satisfaction and discuss behavioral patterns that contribute to a positive on-call experience.
Daria Barteneva is currently Senior Software Engineer in Observability Platform in Azure. With a background in Applied Mathematics, Artificial Intelligence, and Music, Daria is passionate about machine learning, diversity in tech, and opera. In her current role, Daria is focused on changing organizational culture, processes, and platforms to improve service reliability and on-call experience. Daria is originally from Moscow, Russia, having spent 20 years in Portugal and now lives in Dublin, Ireland.
Dave Stanke, Google
Somewhere out there, many layers of infrastructure and many miles of fiber away from us, there is someone whose product experience is dependent on the systems that we develop and maintain. Site Reliability Engineering (SRE) asserts that it is that person—the user, not our monitoring—who decides whether our systems are performing acceptably. But people have the unfortunate qualities of being inconsistent, irrational, and emotional. You know: human. What in the world do they want? And why does it even matter to those of us who maintain [insert your backend system here]? This talk will explore how tools like SLOs and Error Budgets help align platform engineering and operations work with product-oriented business outcomes. Along the way, we'll attempt to reconcile our objective, rational techniques with the uncomfortable truth that the ultimate arbiters of technical success are squishy, subjective humans.
Dave Stanke (twitter: @davidstanke) is a Developer Advocate for Google Cloud Platform, specializing in DevOps, Site Reliability Engineering (SRE), and other flavors of technical relationship therapy. He loves chatting with practitioners: listening to stories, telling stories, sharing a healthy cry. Prior to Google, he was the CTO of OvationTix/TheaterMania, a SaaS startup in the performing arts industry, where he specialized in feeding memory to Java servers. He chose on purpose to live in New Jersey, where he enjoys baking, indie rock, and fatherhood.
3:00 pm–3:20 pm
Anika Mukherji, Pinterest
Code ownership is critical as your codebase grows and the company scales. SREs need to have an answer to the question, “Who owns this piece of code?” However, implementing an ownership framework involves a tactical dance of technical solutions, communications, and client contracts, as defining and enforcing ownership is a tricky task. Come learn how Pinterest designed a scalable ownership framework for its API Platform, then applied it to over 1700 API endpoints, involving over 70 teams across the engineering organization.
Anika joined Pinterest a year and a half ago as a backend performance engineer. She then transitioned into a new role as the single SRE for the Core Product organization at Pinterest, responsible for the reliability of the "Pinner" facing products. She works closely with both platform and product teams, acting as a liaison between infrastructure and product. Anika is passionate about learning and serving customers well, whether they be those of the Pinterest product or other engineers.
Shiri Yitzhaki, Itzhak Tueg, and Tapan Shah, Amdocs
How do you implement SRE in a big managed services organization (~5,500 employees), running operations for fifty big Telecommunications & Media companies?
Amdocs managed services organization has implemented SRE to revolutionize the way we run operation by treating operations as a software challenge with inner sourcing and reuse as key principles, in order to improve customer experience and operations efficiency.
To make the change we established a program that focused on three dimensions: people—upskilled the teams and implementing blameless postmortem, processes—changed the day to day to include development of automations in agile, and technology—introduced new tools and technologies.
Itzhak Tueg worked as a software program manager in the telecommunication and media industry. For the past two years, he has led the SRE implementation program at Amdocs Managed Services group.
3:25 pm–3:45 pm
Moshe Zadka, Twisted Matrix Laboratories
Jupyter is commonly thought of as a "data science tool". But the same features that make it appealing to data scientists make it appealing for Site Reliability Engineering: dynamic exploration and ability to share results. The talk will set up an "incident" where a cache slowdown is causing site problems and will show how we can use Jupyter to triage and remediate the problem. I'll also cover post-incident best practices: how to make sure that what has been done is properly documented and ready for the incident retrospective.
Moshe has been a DevOps/SRE since before those terms existed, caring deeply about software reliability, build reproducibility, and other such things. He has worked in companies as small as three people and as big as tens of thousands—usually someplace around where software meets system administration.
Bill Johnson, Microsoft
An SREs primary goal is to balance the technical and operational aspects of a system to drive reliability. Reliability and sustainability are very similar and SREs are uniquely positioned to balance a 3rd area: Environmental sustainability. This talk will detail these 3 areas, how it relates to SREs, as well as provide some tangible examples of what you can do in your team today to be a champion of Technical, Operational, and Environmental sustainability efforts.
Bill is a Principal SRE responsible for the Azure SRE AKS (Azure Kubernetes Service) Engagement and a member for Azure Sustainability v-team focused on reducing Azure's carbon footprint for internal teams and customers.
Tuesday, December 8
9:00 am–10:00 am
10:00 am–10:40 am
J. Paul Reed, Netflix
One of the core tenets of both modern SRE and DevOps professional practices is "Automate all the things!"
But are there processes we shouldn't automate? And what if HOW we automate actively causes us and the systems we're responsible for harm? We'll take a look at what human factors have to do with automation as well as at some of the impacts and challenges pervasive automation has presented for system administrators and SREs, along with some important considerations when automating our complex, living socio-technical systems, and some strategies to cope when the shell scripts strike back!
J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful consulting firm, he now spends his days as a Senior Applied Resilience Engineer on Netflix's Critical Operations & Reliability Engineering (CORE) team, focusing on incident analysis, systemic risk identification and mitigation, applied Resilience Engineering, and human factors expressed in the streaming leader's various sociotechnical systems.
Effie Mouzeli and Giuseppe Lavagetto, Wikimedia Foundation
At Wikimedia, we are running one of the top 15 traffic websites on the internet! Our infrastructure is powered by free software, with MediaWiki at its core. To improve performance, in 2014 we happily migrated from mod-php to Facebook's HHVM (Hip Hop virtual machine), and everything was well until September 2017: when Facebook announced that it would be dropping PHP support.
This is the story of the long project to migrate our application clusters from HHVM to php-fpm, and the application itself from PHP5 to PHP7, while serving billions of page views per month. We want to share the good, the bad, the ugly, and the questionable decisions we made in order to successfully migrate and give the SRE perspective of a complex migration, broken down into small pieces. Moreover, the centerpiece of this talk is how we benefited from testing in production, which played a key role during this project.
Effie studied physics and distributed scientific computing but didn't turn out to be a physicist or a scientific computer scientist. She has worked as a systems engineer/SRE at a number of startups and small organizations (most of which are not with us anymore), where her responsibilities were usually automation, infrastructure architecture, and working closely with developers. Currently, she is on the SRE team that takes care of Wikipedia and its sister projects at the Wikimedia Foundation.
Giuseppe is an astrophysicist turned wannabe technologist. He has worked for the last 12 years on the infrastructure of large-scale websites, and for the last 6 on the infrastructure of your favourite free encyclopedia.
10:45 am–11:25 am
Mikolaj Pawlikowski, Bloomberg
Chaos Engineering is steadily transforming from a gimmick to a serious, scientific discipline focused on observing and measuring the effects of the failure in systems of all shapes and sizes, in order to verify their behavior experimentally.
Unfortunately the Internet is still full of slogans like "breaking things in production," which—while well-intentioned—can be harmful to the understanding of what Chaos Engineering is really about. In this talk, I'd like to argue that adopting Chaos Engineering can prove to be a very good investment, regardless of the nature of the system in question.
To do that, I'm going to cover three case studies: a single process, a JVM application, and a set of microservices running on Kubernetes.
Mikolaj Pawlikowski is a Software Engineering Team Leader at Bloomberg and author of Chaos Engineering: Crash test your applications. You might also know him from for his Kubernetes tools PowerfulSeal and Goldpinger.
Dan Lüdtke, Google Inc.
May I introduce "Skinny", an education-focused, distributed lock service.
With the help of Skinny, we will:
- briefly look at the Paxos protocol
- see an example of a typical Paxos run
- design a simple distributed consensus protocol
- learn the tricky parts of implementing our simple distributed consensus protocol
- gradually move from theory-level to coding-level, solving small challenges (network, availability, fault-tolerance) along the way
This talk addresses engineers who had little exposure to the inner workings of distributed consensus, who want to learn about distributed consensus as they start building distributed systems, and who worked with ready-made distributed consensus solutions such as Zookeper and etcd but strive to understand the underlying theory as well.
Disclaimer: This work is not affiliated with any company (including Google) and purely educational!
Dan served his country, worked as a security consultant, once wrote a book about IPv6, contributes to open source software projects, helps to organize large hacker events, runs an autonomous system for fun, and dreams of space travel. He works at Google Stadia as a Site Reliability Engineering Manager.
11:30 am–12:10 pm
Incident Response in Unfamiliar Sociotechnical Systems: One Incident Commander's Challenges Supporting Inter-organizational Anomaly Response in the Age of COVID-19
Morgan Collins, Salesforce
The Incident Command System (ICS) has become an increasingly utilized tool for incident response within modern enterprise software and cloud computing organizations. With the unique challenges of the 2020 COVID-19 pandemic, companies are also seeing the introduction of additional strain on organizations that are working to facilitate mission critical support while adjusting to rapidly increasing service load as a result of a historic increase in remote work and web commerce opportunities. I welcome you to join me as I go into hilarious detail on the origins of ICS, the ways that I have seen it adopted within private enterprise, and the challenges that have been encountered from the increased interorganizational incident response as a result of COVID-19. From there, I'll discuss ways I recommend tackling these problems, as well as how to utilize immediate post incident discussions to strengthen lessons learned and expertise moving forward.
Morgan Collins has worked in technical engineering and operations for almost twenty years. From ISPs to managed security and cloud providers, Morgan's experience has had a strong focus on systems engineering and incident response. Most recently Morgan has worked as a Director of Site Reliability at Salesforce before moving to his current role within the company as a Principal SRE focused on incident response and analysis.
Shubheksha Jalan and Leslie Carr
Infrastructure and related domains aren't very friendly to beginners and people early in their tech career. There is a lack of resources and information about how to transition to these domains or how to get a job with little to no prior experience. We were all once in that position and this conversation between Shubheksha and Leslie will share what Shubheksha has learned and what we can do as a community to make it better for those coming after us. We strongly believe that we should be leaving doors we're lucky enough to pass through wide open for those coming after us.
Shubheksha Jalan is a software engineer keenly interested in complex problems at the intersection of distributed systems, reliability, and infrastructure at scale. She enjoys breaking down hard technical concepts through doodles, illustrations, and easy-to-understand blog posts. When she's not too busy fighting computers, she enjoys being outdoors and dabbling in arts and craft.
Leslie Carr is an Engineering Manager at Quip. Leslie transformed from a productive engineer into a pointy-haired manager while at Clover Health. In her past life, Leslie worked at Cumulus Networks in DevOps, helping to push automation in the network world. Prior to that, she was on the production side of the world at many large websites, such as Google, Craigslist, and Wikimedia. Leslie is a lover and user of open source and automation. She dreams of robots taking over all of our jobs one day.
12:15 pm–12:55 pm
Tom Limoncelli, Stack Overflow, Inc.
A low-context SRE culture is one where the info you need to do your job is available, visible, and accessible. Creating a low-context culture improves remote teamwork, new-hire onboarding, enables project hopping, and helps you better handle a page you receive at 2 am.
A high-context SRE culture is one where most knowledge is unspoken, or people depend on history/experience rather than explicit telling. This can be very frustrating for new employees and a disaster in a "remote only"/COVID-19 team.
You'd rather work in a low-context SRE culture right? I'll explain how to lower the context using (1) the power of defaults, (2) "make right easy", and (3) a culture of ubiquitous documentation. I conclude with ways to get people to write documentation (even if they hate writing documentation).
Tom an internationally recognized author, speaker, system administrator, and DevOps advocate. He manages the SRE teams at Stack Overflow, Inc, and previously worked at Google, Bell Labs/Lucent, AT&T, and others. His books include Time Management for System Administrators (O'Reilly), The Practice of System and Network Administration (3rd edition), and The Practice of Cloud System Administration. In 2005, he received the USENIX SAGE Outstanding Achievement Award.
Zach Thomas, Genesys
You want to make the most of your SRE initiatives, but you don't have Google's scale. You can run an effective SRE program with only one person, but you need leverage. We will show you how to amplify your reach when you can't be everywhere at once. You will learn how to use data, tools, and communication to have an outsized impact on your organization.
Zach leads the cloud SRE team for Genesys. He is fascinated by the way complex human systems influence complex technology systems and vice versa. He has 20 years of experience building information systems for the web, in the domains of education, collaboration, and telecommunications. In a past life, he was a founding member of Okkervil River, quite a good indie rock band.
1:25 pm–2:05 pm
Christina Yakomin, Vanguard
Weather alert!! Vanguard's Chaos and Resilience Engineering team has implemented a home-grown Chaos Engineering platform, "The Climate of Chaos", featuring self-service experiments with meteorology-themed names like Cyclone, Tornado, and Blizzard. In this session, we'll break down the fundamentals of chaos engineering, the value of the practice, and why we decided to build our own platform from scratch, rather than leveraging a vendor product or an open source library. We'll also get into the details of our architecture, our client engagement model, our risk management strategy, and our integrated observability and reporting tooling. By running targeted experiments with our custom tool suite, we have ensured that our systems are ready to weather any storm, no matter how much chaos is in the forecast!
This structured technical talk will be geared towards an audience of engineers, and will be especially applicable to those working in large enterprise environments.
Christina is a Site Reliability Engineer in Vanguard's Chief Technology Office. She has worked at the company's Malvern, PA headquarters since graduating from Villanova University in 2015 with an undergraduate degree in Computer Science. Throughout her career, she has developed an expansive skill set in front- and back-end web development, as well as cloud infrastructure and automation, and she has earned three Amazon Web Services associate certifications. Christina has also worked closely with the Women's Initiative for Leadership Success at Vanguard, both internally at the company and externally in the local community, to further the career advancement of women and girls - in particular within the tech industry. In her spare time, Christina is passionate about traveling; she has visited over 20 different countries and 25 U.S. states so far!
David Argent, Amazon
There's no holy book of best practices for running large online services. We rely on what we've learned along the way, often taught to us by having things break. Failure is a great but expensive teacher, and it's usually better to learn from someone else's mistakes. I've had a long career of mistakes I've made or experienced first hand to draw on, and built a list of conceptual lessons to be learned from them. This is a non-exhaustive list of things to think about when designing and running a large scale online service, rather than a prescriptive checklist.
With over 20 years of experience in the tech industry, and job titles ranging from Technical Writer, to Systems Engineer, to Program Manager, to Lead Problem Engineer (my personal favorite), I've worn more than a few hats and been victimized by more than a few badly designed online services. 19 years at Microsoft had me working on various TV-division projects, Windows Phone, Cortana, and Bing until my new adventure with Amazon, where I help run what, depending on who you talk to, is the largest no-SQL installation in the world.
2:10 pm–2:50 pm
James Wickett, Verica.io
Doing security in an SRE context with a particular focus on the software delivery pipeline can be a bit disjointed. This talk takes a pragmatic approach to each phase of the software delivery pipeline, and you will be armed with philosophy, questions, and tooling that will get security moving at the speed of your SRE organization.
James works as a Sr. Security Engineer and Developer Advocate at Verica and is he is the author of several courses on DevOps and DevSecOps at LinkedIn Learning. His courses include DevOps Foundations, Infrastructure as Code, DevSecOps: Automated Security Testing, Continuous Delivery (CI/CD), Site Reliability Engineering, and more.
James is the creator and founder of the Lonestar Application Security Conference, which is the largest annual security conference in Austin, TX. He also runs DevOps Days Austin and Serverless Days Austin. He previously served on the global DevOps Days board.
In his spare time, he is trying to learn how to make a perfect BBQ brisket.
Leonid Belkind, StackPulse
The COVID-driven new WFH/all-remote model has amplified traditional challenges remote teams can face with incident response and reliability. Silos and reduced information exchange, challenges onboarding or cross training engineers, increased noise and toil – just to name a few. All of these make it harder for teams to continue to deliver reliable services, at a time when reliable software is what's keeping the world connected.
Instead of trying to translate existing roles and responsibilities, processes, and methods to the new normal – we'll share how to improve reliability by 'dis-organizing' and democratizing the SRE function – empowering the entire engineering team to own reliability and adopt SRE mindset. We'll cover goals for SREs/SWEs, training for SWEs, automating knowledge management/sharing, getting started with code-based incident response playbooks – and the role of the SRE in orchestrating it all.
Leonid Belkind is a Co-Founder and CTO at StackPulse, a Site Reliability Engineering orchestration platform. Prior to StackPulse, Leonid co-founded (and was CTO of) Luminate where he guided this enterprise-grade service from inception, to widespread Fortune 500 adoption to acquisition by Symantec. Before Luminate, Leonid managed software development organizations at CheckPoint.
Through his career, Leonid has witnessed modern Software Engineering practices come and replace the traditional ones, first around Continuous Integration and Delivery pipelines, then Infrastructure Management and Monitoring, and onwards as software services have replaced on-premise products. Throughout this journey Leonid has become passionate about building reliability-first architectures, methodologies and organizational culture.
2:55 pm–3:35 pm
Robert Barron, IBM
2020 marked the 50th anniversary of the Apollo 13 mission - what was supposed to be a "routine" third landing on the Moon turned into a dramatic odyssey as an explosion nearly destroyed the spacecraft when it was on the way to the Moon.
It was the incredible efforts of the astronauts in space and the engineers on the ground that ensured the safe return of the astronauts.
What were these efforts? While the terms "Site Reliability Engineering", "Observability", "Chaos Engineering", and others did not exist in 1970, NASA's training and preparation of the astronauts and engineers cover many of the principles and practices of modern Site Reliability Engineering.
How did NASA prepare for the mission? How did they work to solve the problems which occurred during the missions? And how did they learn and improve their systems for future missions.
Find out by attending this session.
Robert is the AIOps lead in IBM's Garage for Technical Solution Acceleration. He is an SRE and ChatOps evangelist who enjoys helping others solve problems even more than he enjoys solving them himself. Robert has over 20 years of experience in IT development & operations and is happiest when learning something new. Robert lives in Israel with his wonderful wife and two children. His hobbies include history, space exploration, and bird photography.
Pauline Narvas, Wayne Bridgman, Graeme Bye, Amreen Firdouse, Anand Bobade, Shiv Patil, BT
Implementing Site Reliability Engineering principles, values, and building an SRE team at a large enterprise has proven to be quite challenging. It turns out creating an SRE team is much more complex than just copying Google or renaming your Ops team to SRE.
Instead of jumping on what Google has done, to begin our SRE journey, we identified issues that were of most importance to the Business first. For us, it was security, reducing cloud sprawl, and getting our cloud costs under control. Off the back of this, we established our own standards.
This will be a structured talk where we will share the journey of building our SRE team, the main challenges that we've faced in a large enterprise, some reflections of what we learned along the way, and advice to newly formed SRE teams.
Pauline is a Site Reliability Engineer at BT, where she is part of the newly formed team to bring the SRE values to life within the organization. She's also a Women in Tech advocate, blogger, and enjoys weight training.
Wayne is a Principal DevOps and Site Reliability Engineer consultant at BT. He has over 15 years of experience in leading strategic cloud architecture transformations and enterprise-wide DevOps and CI/CD initiatives in the Finance, Insurance, and Telecommunications sectors.
Graeme is a DevOps Engineering Manager at BT. Having worked in IT across financial and telecom sectors over 20 years, leading engineering teams across all stages of the Software development lifecycle, he now oversees the Environment governance in Digital Engineering. His latest challenge? Leading the progressive development of a new SRE team from inception.
Designated as a Site Reliability Engineer at THBS. With over 5 years of experience in Cloud, Operations, and DevOps. Amreen began her career as a Technical Consultant (AWS and Azure) at a start-up then later joined the Production Operations Team then Platform Services for EE/BT, handling security, monitoring, incident management, and learning infrastructure as code. Amreen is now part of the SRE team.
Anand has 7+ years of experience in design, implementing, and managing cloud infrastructure and containers. Experience in virtualization, NOC engineer, orchestration, AWS cloud, IaaS & PaaS in medial, US government, and telecom sectors. He is now a Site Reliability Engineer at BT.
Shiv has over 4.5 years of experience in managing AWS infrastructure and automation. He enjoys writing scripts in Python and finding new ways to improve the performance of systems. He is currently a Site Reliability Engineer at BT.
3:40 pm–4:00 pm
Prathyusha Charagondla, Adobe
This talk follows the tale of a young SRE navigating her way through a growing, multi-faceted field. Coupled with her gains, mistakes, and learnings, this talk covers her top 3 learnings and occasional parallels to workplace navigation. After graduation, I started as a Site Reliability Engineer. This talk will briefly cover the core values of the trending field of Site Reliability Engineering, honing in on my top 3 SRE learnings—effective incident management, adequate monitoring coverage, and favoring proactive over reactive. In my journey of overcoming the border from academia to industry, I have learned a lot about navigating the workforce in this multifaceted field, thus it will also draw some parallels between my SRE learnings and day-to-day adventures navigating the workplace.
Prathyusha is a Site Reliability Engineer at Adobe, engaging with core internal platforms to improve their reliability. Having worked previously with machine learning applications, she looks for creative ways to integrate it into service reliability. She is currently working towards her Masters in Information and Data Science from the University of California, Berkeley, with an emphasis in Machine Learning. In her free time, she enjoys teaching yoga, traveling, and trying new food.
Ryan Doherty, LinkedIn
Racing crapcans ($500 cars) is a lot like being a SRE. Build a reliable system, monitor it, go oncall, and complain about how everything is broken all the time. Learn what racing crapcans is like and the lessons I've learned over 8 years of mechanical, personal and team failures that apply to being a SRE. Lots of anecdotes of racing incidents, pictures of cars in various states of rapid unscheduled disassembly and occasionally photos of cars racing.
Ryan is a Staff Site Reliability Engineer at LinkedIn. His passions at work at performance tuning, usability and monitoring. In his spare time he races in the 24 Hours of Lemons racing series in the United States.
4:05 pm–4:25 pm
Alex Nauda, Nobl9
Kubernetes promised that your pets would become cattle, but cloud environments in practice behave much more like rabbits. They aren't quite a pet, but still high maintenance. They reproduce too quickly, and you know someone in your team cares about each one. How can you securely manage the proliferation of virtual infrastructure across dev, staging, pre-production, and real honest-to-goodness prod environments? How can you ensure product velocity and rapid iteration while ensuring that everything gets the attention and support that it deserves?
The key is understanding how much you and your organization care about each environment. One way to capture this in metrics is by defining and measuring service level objectives (SLOs).
Alex Nauda is CTO of Nobl9, he is helping organizations improve the reliability and performance of their cloud-native applications. He started his career in the performance management of Data Warehousing in the days of magnetic storage and backplanes. Since the days of the web, he has focused on product development in media and the public cloud.
Alex lives in Boston where he grows vegetables under LEDs and teaches juggling at a non-profit community circus school.
Fred Moyer, Zendesk
Learn how Zendesk developed formulas for implementing SLIs, SLOs, and Error Budgets at scale across a team of 1,000 engineers.
Error Budgets tell us when we should stop working on features and instead work on reliability. Because we use them to prioritize expensive resources (not to mention protect our revenue streams), we want them to be as accurate as possible. How do you empower 1,000+ engineers to solve these problems correctly in systems at scale?
Fred is an SRE and resident SLOgician (like statistician, not magician) at Zendesk. He previously worked with high scale telemetry at Circonus, and scaled large web systems at Turnitin. Fred developed the first Istio community adapter in 2018, and was a White Camel Award winner in 2013. He likes to daydream about SLOs and Error Budgets while riding his mountain bike.
4:30 pm–4:50 pm
Abhishek Srikanth, Facebook, Inc.
This talk covers the evolution of traffic and load balancing solutions on a team focused on providing real-time streaming infrastructure. It starts with the change in architecture from a co-located service model to a distributed one and then covers the system's evolution to a full-mesh connection based architecture. The rationale behind these migrations, the lessons learned, and the benefits reaped are to be covered by the talk. Key takeaways include the advantages of not co-locating services, rendezvous hashing (highest random weight hashing algorithm that allows clients to achieve distributed agreement) and its practical applications in solving traffic problems, benefits of a full mesh connection and the opportunities it unlocks, along with a few other lessons learned along the way.
Abhishek is a Software Engineer at Facebook's Real-Time Infrastructure team. He has spent the past 3 years working on various problems in the traffic routing and load balancing space.
Nishant Roy, Pinterest
Despite its growing popularity as a systems language, Go programs are susceptible to severe performance regressions at large scale. In systems with high memory usage, garbage collection (GC) can cause performance regressions by cannibalizing resources from the main program. Heavy GC cycles can add hundreds of milliseconds of latency to a request, resulting in degraded user experience.
This talk will provide an overview of how Go GC works and how it may cause performance regressions in your system. It will also provide some ways to profile your system's memory usage and identify which part of the code is the culprit. Finally, we will discuss some methods through which you can optimize your system for better performance.
Nishant Roy is a software engineer at Pinterest, responsible for designing the ads-serving architecture to enable product launches, improving performance and reliability, while simplifying systems to minimize cost and maximize developer velocity.
4:50 pm–6:00 pm
Wednesday, December 9
9:00 am–10:00 am
10:00 am–10:40 am
Kyle Guichard and Venus So, Bill.com
In the early days of the pandemic, the Internet was melting as everyone adjusted to working remotely and adjusting to sudden increases in traffic. Remember the days of routine outages and meltdowns of various top brands? We saw a significant increase in customer activity as we helped small businesses manage their back-office processes from their home offices and mobile phones. During this time, we experienced issues with many of our service providers.
However, instead of melting down, we improved our production resilience, safeguarded against customer-facing incidents, and increased our availability. In this talk, we will discuss how we managed to grow our business and improve our availability by focusing on three things:
- Leading a cross-functional culture change towards extreme ownership of production
- Improved resilience in our payments platform with additional redundancy, self-healing, and retries
- Focusing on third-party management and business processes to scale
Kyle Guichard is a VP of Engineering at Bill.com leading the Data Platform and Technology Operations team in the Bay Area. With over 15 years of experience administering mission critical systems from payroll to managing back-office payments, he is passionate about leveraging the cloud and maximizing resilience in systems.
Venus is Sr. Director of Engineering at Bill.com leading Digital Payments and Risk teams. She has over 10 years of experience leading mission critical engineering teams from small startups to Fortune 500 companies. At Bill.com, her team went through explosive growth—operational with 130+ countries with $96B+ money moved as the start-up scaled to a successful public company.
Rob Hirschfeld, RackN
What is PXE anyway and why does it work? System bootstrapping is one of the great mysteries in IT. In this talk, we'll break the entire bootstrapping process down into components. That will allow us to discuss ways to make it faster, simpler, and more reliable. Most importantly, we'll show how you can improve the automation options involved in day two operations for anything that boots.
Rob has been creating software to automate infrastructure for over 20 years. His latest startup, RackN, focuses on providing IaC infrastructure and provisioning for distributed Cloud, Edge, and Enterprise data centers. He has been involved in numerous open source efforts including LF Edge (Board), CNCF (SIG Chair), and OpenStack (Board). He is also building a forward-looking operator community at http://the2030.cloud. Rob received degrees from Duke University and Louisiana State University.
10:45 am–11:25 am
Raj Shekhar and Mehmet Can Kurt, Quantcast
This talk is about how we migrated from an in-house legacy datastore that handles 1.5 million lookup requests (per second) to a more reliable, flexible, and cheaper system.
We will talk about how to achieve three major goals for the migration process:
- at least similar or better performance than the legacy system
- ensure the quality and correctness of the data served by the new system, and
- do the first two without any downtime on production and affecting the company's main revenue generating product.
This talk is aimed at software engineers and site reliability engineers who are thinking of replacing a critical part of their distributed system.
Raj is a Staff System Engineer at Quantcast, working on maintaining the uptime and reliability of the servers tracking Quantcast pixels across the web. He enjoys poking sleeping dragons in large scale distributed systems. When not working, you can find him planning his next getaway, usually to a place accessible by motorcycle.
Andreas Grabner, Dynatrace
In 2019, I helped 100+ teams analyze performance and architectural issues in their distributed applications. In this session, I present the common patterns such as N+1 Call & Query, Payload Flood, Too Granular, Tight Coupling, Bad Timeouts/Retries/Backoff, and Inefficient Dependencies. I teach you how you can detect these patterns yourself fully automated in your CI/CD pipeline using the CNCF project Keptn and its SLI/SLO based Quality Gate approach.
Andreas Grabner (@grabnerandi) has 20+ years of experience as a software developer, tester, and architect and is an advocate for high-performing cloud scale applications. He is a regular contributor to the DevOps community, a frequent speaker at technology conferences, and regularly publishes articles on blog.dynatrace.com. In his spare time, you can most likely find him on one of the salsa dance floors of the world!
11:30 am–12:10 pm
Boyan Krosnov, StorPool Storage
Many companies build new-age KVM clouds, only to find out that their applications & workloads do not perform well. In this talk, I'll show the audience how to get the most out of their KVM cloud and how to optimize it for performance: People will understand why performance matters and how to measure it properly. I'll show how to optimize CPU and memory for ultimate performance and how to tune the storage layer for performance. I'll outline what are the main components of an efficient new-age cloud and which network components work best. In addition, I'll show how to select the right hardware to achieve unmatched performance for new-age cloud and applications.
Boyan Krosnov is a Co-Founder and Chief Product Officer of StorPool Storage, an SDS company specializing in high performance, high-reliability storage systems for public and private clouds. He is in charge of cloud infrastructure in the KVM ecosystem, helping small and large cloud operators build and maintain all aspects of their public and private clouds. In previous lives, Boyan was a co-founder of a packet processing (SDN) company and built several service providers (ISPs, FTTH) from scratch.
John Blaho, Catchpoint
Creating the perfect cocktail is a subjective effort. You may select the finest ingredients, using high performing equipment with a highly trained hand, but judgment falls on the consumer. It is always about the end-user experience. SREs are some of the most skilled professionals of their craft and having knowledge of end-user expectations to assure a good experience.
In this session, we will measure out the ingredients and prepare the perfect network cocktail with use cases and monitoring examples:
- Lay out the basics: Measure SLOs to ensure ongoing vendor obligations are met.
- Alcohol base: Monitor DNS & BGP for consistent reachability.
- Fresh ingredients: Assure CDN performance doesn't affect the service.
- Mix and Serve: Take the outside-in approach to optimize the customer experience.
With more than 15 years of product marketing experience, John Paul (JP) Blaho has focused on understanding the buyer journey and the unique personas that influence IT network purchase decisions. Mr. Blaho has worked for leading IT organizations such as NETSCOUT, Sungard Availability Services, Symantec, and IBM Security. JP received his BS degree from Bethany College in Bethany, West Virginia, and received his MBA from Northeastern University's D'Amore-McKim School of Business in Boston, Massachusetts.
12:15 pm–12:55 pm
Markus A. Kuppe, Microsoft
TLA+ is a language for the specification and verification of discrete systems, including concurrent and distributed algorithms. The behavior of systems is described in the form of state machines, written in a language based on mathematical set theory and temporal logic. The same language also serves to express safety and liveness properties. TLA+ is supported by tools for computer-assisted verification, including model checkers for verifying finite instances and an interactive proof system. In summary, TLA+ is a design tool to build reliable systems that don't need patching at 3 am on weekends.
In this talk, we will take the hands-on approach and study the powers of TLA+ by collectively solving a subtle concurrency issue that (as we will pretend) has brought our system down even though the system was heavily tested.
Markus Kuppe is a Principal Research Software Development Engineer at Microsoft Research in Redmond, WA. As an engineer, his focus is on making spec-driven development (with TLA+) more popular among fellow engineers: This includes scaling verification to larger, real-world problems, and building tools to combine spec-driven development with established software engineering processes.
1:25 pm–2:05 pm
Danny Chen, Bloomberg LP
Page reference sampling provides an important view into the intersection of an operating system's memory management subsystem and program behavior. Historically, page reference sampling has involved modifications to kernel code. Solaris provides a little known/advertised interface for doing page reference sampling without kernel changes.
We will show how to use this interface to get data that can be useful for capacity planning and for tweaking some of the memory management knobs that the system provides. We'll also compare this with what we can get on Linux systems.
Danny Chen started his career 40 years ago as a UNIX performance engineer at Bell Laboratories where he was a co-developer of one of first general purpose UNIX kernel tracing facilities (USENIX/1988: CASPER the Friendly Daemon). He also contributed performance improvements to the SVR4 virtual memory implementation (USENIX/1990: "Insuring Improved VM Performance - Some No-Fault Policies). He has worked on low latency market data systems, messaging systems, distributed transaction management, capacity planning, and enterprise systems monitoring.
Mark Hahn and Thomas Hahn, TCB Technologies, Inc.
Security is important in almost all applications and TLS is used to secure communications between components and to end users or outside APIs. One difficulty with TLS can be managing certificates.
Mark Hahn is Director of Cloud Strategies and DevOps for Ciber Global, a consulting firm. In this role he is responsible for all things related to software velocity.
Ted Hahn is an SRE for hire working on planet-scale distributed systems.
2:10 pm–3:45 pm
Closing Plenary Session
Moderator: Nora Jones, Jeli.io
Panelists: Fred Hebert, Postmates; Lorin Hochstein, Netflix; Vanessa Huerta Granda, Enova
We are all in the middle of a worldwide incident in which we are all incident responders in some way. The large-scale disruption that Covid-19 has introduced to day-to-day functioning of society has profound implications for the software systems that are a part of daily life. And the experts tasked with keeping those systems running are no exception.
In this panel, we will talk through the 5 key themes that emerged from a series of two facilitated video calls with engineers and managers from several different organizations, along with expert suggestions on how to adapt to some of these problems, all backed up by research. We will interview 3 participants from organizations that had to unexpectedly scale and adjust during this time and how they managed this.
Nora is the Founder and CEO of Jeli.io. She is a dedicated and driven technology leader and software engineer with a passion for the intersection between how people and software work in practice in distributed systems. In November 2017 she keynoted at AWS re:Invent to share her experiences helping organizations large and small reach crucial availability with an audience of ~40,000 people, helping kick off the Chaos Engineering movement we see today. She created and founded the www.learningfromincidents.io movement to develop and open-source cross-organization learnings and analysis from reliability incidents across various organizations, and the business impacts of doing so.
Fred Hebert is the author of three programming books about Erlang and Elixir, and using these languages for testing and production work. He co-founded and is a board member at the Erlang Ecosystem Foundation, and his currently a staff engineer at Postmates, with a focus on architecture reviews, learning from incidents, and poking at various things. He was previously Systems Architect, principal member of technical staff on cloud platforms, worked in real-time bidding, and provided commercial programming training.
Lorin Hochstein is a Sr. Software Engineer on the Managed Delivery Team at Netflix. He was previously Sr. Software Engineer at SendGrid Labs, Lead Architect for Cloud Services at Nimbis Services, Computer Scientist at the University of Southern California’s Information Sciences Institute, and Assistant Professor in the Department of Computer Science and Engineering at the University of Nebraska–Lincoln.
Vanessa Huerta Granda (she/her) manages Resilience Engineering at Enova where she focuses on their Production Incident process, learning from incidents, and leading the on-call rotation of Incident Commanders. She has led the Chicago Women in Technology Conference and started the Hispanic or Latinx Alliance @ Enova. She is passionate about continuous improvement, getting teams to talk to each other, and Diversity and Inclusion in Tech.
Program Co-Chairs: Nora Jones, Jeli.io, and Mike Rembetsy, Bloomberg
What does Kubernetes, React, a compiler, or even the OS Kernel all have in common?
They're all a form of abstraction! You might be asking yourself is that a bad thing? If so, why?
Our systems are too complex and we keep building layers of abstractions in a vain effort to simplify these same systems. We start out with a simple problem and somehow end up with a solution that requires more investment and maintenance than the original problem itself.
Why do we do this? You can argue that our industry incentivizes this because you can use it to get that new job, get a raise, or gain that 31337 clout. You can argue that we do this to ourselves by conflating simplicity with lack of intelligence (or experience). Either way, we're paying a cost for all this abstraction:
- The abstraction has taken on a life of its own and takes you with it
- We lose proficiency with the foundation
- We lose sight of the original problem
- We don't forget about the people this is supposed to help
- We keep using and modifying the abstraction even if it is no longer appropriate
Not all abstractions are bad, we just don don't know where to stop. So what do we do about it? We can't reasonably expect never to use an abstraction again. In my session, I'd like to share my experience with conquering the never-ending chase down the abstraction rabbit hole and to hear your stories and solutions too.
André Henry is a systems engineer at Venmo and a lifelong hacker. If you ever wondered who would buy a supercomputer off eBay, you've thought about him. André lives at the intersection of lasers, cats, and tech. You can find him at a bookstore or conference home, with his cat Zuko, always learning and sharing.