Jose Leitao and David Rothera, Facebook Want to learn how Facebook operates their global network to support more than 1.3 billion users? We will be describing the technologies and methods we use to manage Facebook's production network. The neteng org at Facebook has built/leverage several systems for managing and operating the production network, including an audit framework, alarms daemons, drainers, and an automatic remediation engine. This talk will focus on these technologies and how they have helped improve user experience, administer complexity, automate day-to-day operations, mitigate impact, and increase reliability.
Want to learn how Facebook operates their global network to support more than 1.3 billion users? We will be describing the technologies and methods we use to manage Facebook's production network. The neteng org at Facebook has built/leverage several systems for managing and operating the production network, including an audit framework, alarms daemons, drainers, and an automatic remediation engine. This talk will focus on these technologies and how they have helped improve user experience, administer complexity, automate day-to-day operations, mitigate impact, and increase reliability.
Jose Leitao and David Rothera are production netengs in the Network Infrastructure Engineering team at Facebook. Their team responsibilities include maintaining, monitoring, and improving the global production network infrastructure.
Containers have been at the core of how many web-scale companies build their distributed systems for years. More recently containers have become an increasingly popular tool for developers everywhere. As containers are moved from a developer's workstation to production, they are generally managed by cluster management systems like Kubernetes or Omega. Containers have been at the core of how many web-scale companies build their distributed systems for years. More recently containers have become an increasingly popular tool for developers everywhere. As containers are moved from a developer's workstation to production, they are generally managed by cluster management systems like Kubernetes or Omega.
These cluster management systems offer significant opportunities to improve operations across the data center. In particular, they offer the following opportunities. By separating concerns along a variety of axis (developers from managing machines, applications from the OS on which they run, etc.) container cluster management systems enable ops specialization. Different operations teams can focus on kernel and machine maintenance, cluster operations, and application operations, and they can do these jobs in relative isolation. This focus and isolation means that the teams are significantly more productive, and less likely to make mistakes due to inexperience or interactions that they don't fully understand.
Additionally, the homgenization of the container cluster, and the presence of the cluster management system, make management tools (roll outs, monitoring, health maintenance) a property of the cluster environment, not each individual application. This means that these tools are deployed once for an entire cluster, and they are designed and deployed by experts who are, again, specialized to their specific task.
Finally, container clusters and cluster management apis enable the switch to immutable infrastructure and the development of patterns that make it possible to do operations without manual manipulating individual machines.
Brendan Burns is a Staff Software Engineer at Google, Inc and a founder of the Kubernetes project. He works in the Google Cloud Platform, leading engineering efforts to make the Google Cloud Platform the best place to run containers. He also has managed several other cloud teams, including the Managed VMs team and Cloud DNS. Prior to Cloud, he was a lead engineer in Google's web search infrastructure, building backends that powered social and personal search. Prior to working at Google, he was a professor at Union College in Schenectady, NY. He received a PhD in Computer Science from the University of Massachusetts Amherst, and a BA in Computer Science and Studio Art from Williams College.
Facebook is one of the largest sites in the world, with multiple datacenters (and POPs in multiple continents) hosting a pretty large amount of machines. This talk is about the evolution of the DHCP production infrastructure at Facebook.
Facebook is one of the largest sites in the world, with multiple datacenters (and POPs in multiple continents) hosting a pretty large amount of machines. This talk is about the evolution of the DHCP production infrastructure at Facebook.
In this talk we will use the DHCP case as an example to discuss why it's good to design your systems to be stateless, and the fine line between leveraging OSS projects where possible and take a “Not Invented Here” approach instead. We will also talk about the challenges of driving large scope projects from remote offices and the importance of possessing skills in both systems and software development fields.
We'll look at DHCP in Facebook in both IPv4 and IPv6 worlds, we will dive into old architecture and its limitations. and then talk about how the Cluster Operations team in Dublin leveraged the ISC KEA open source project to migrate from a stateful service to a stateless one, discussing challenges faced in the process and the benefits we gained.
Angelo is a Production Engineer at Facebook. He joined the company in early 2011 as a Site Reliability Engineer and recently moved to the Cluster Operations Team. In this period he has contributed to various projects, like our cluster turnup tool Kobold and F.B.A.R. (the Facebook Auto Remediation tool). More recently he has been involved in revamping the DHCP architecture for the Facebook production network, which he will discuss in this talk. He is interested in automation tools and large-scale distributed systems.
Matt Provost, Avere Systems Over the 2014 Christmas holiday there was major engineering work scheduled on some of London's main train lines into King's Cross Station. This overran the outage window by a day and significantly disrupted the travel plans of tens of thousands of passengers. In January, Network Rail published a detailed report with their findings detailing the causes of the overrun.
Over the 2014 Christmas holiday there was major engineering work scheduled on some of London's main train lines into King's Cross Station. This overran the outage window by a day and significantly disrupted the travel plans of tens of thousands of passengers. In January, Network Rail published a detailed report with their findings detailing the causes of the overrun.
What can SREs learn from physical infrastructure maintenance and outage procedures? Even in this new era of cloud computing, someone still has to build and maintain the infrastructure that keeps the cloud working. With redundant systems and even data centres, there are still going to be times when the underlying infrastructure needs an outage for preventative maintenance to replace older equipment. or planned work on electrical or HVAC systems.
What lessons can be learned from the Christmas King's Cross outage that we can apply to data centre infrastructure? Even with cloud providers this is still relevant: Verizon Cloud had a 40-hour scheduled outage in January. Although there were underlying technical problems during the railway maintenance which added up and caused the delays, the larger problems were around planning, staff rotation, and communication. Teams in the field were focused on solving technical problems as they came up, and not escalating these problems to managers who could see the bigger picture and communicate with other teams to form an overall strategy and make go/no-go decisions based on accurate information. This situation can be even worse in a data centre where time estimation is notoriously unreliable.
In this talk, I will break down and analyse the King's Cross report and relate each finding back to the data centre environment.
Matt Provost started as a systems administrator in 1998. Before moving to London in 2014 he was the Systems Manager at Weta Digital in Wellington, New Zealand, where he oversaw the commissioning of a new water cooled data centre which hosted the render infrastructure for Avatar, Rise of the Planet of the Apes, Tintin, and the Hobbit films. During the production of Avatar, this hosted 7 systems in the Top 500 Supercomputer list. Matt has presented at LISA conferences about storage performance management, monitoring, and complex systems failure analysis.
|
(session starts at 09:30)
Cloud-scale automation requires significant changes in the operation of a data center. Depending on the usage pattern of the cloud, either cloud provider or cloud consumer, the impact to your data center differs. In the industry, the technologies covered under the term "Software Defined Data Center (SDDC)" are supposed to address these changes. But are Software Defined Network and Software Defined Storage really enough? I will show that there is more required to achieve a true "Software Defined Environment (SDE)." This talk will describe an architectural view on all aspects of SDE and how it addresses the requirements triggered by cloud environments. Cloud-scale automation requires significant changes in the operation of a data center. Depending on the usage pattern of the cloud, either cloud provider or cloud consumer, the impact to your data center differs. In the industry, the technologies covered under the term "Software Defined Data Center (SDDC)" are supposed to address these changes. But are Software Defined Network and Software Defined Storage really enough? I will show that there is more required to achieve a true "Software Defined Environment (SDE)." This talk will describe an architectural view on all aspects of SDE and how it addresses the requirements triggered by cloud environments.
After achieving a degree in Computer Science from RWTH Aachen, Astrid joined IBM Research and Development as a developer. She went through a career as software and firmware developer for System z and Power Systems, reaching the level of an IT architect. After that Astrid changed direction into technical sales as an IT architect for IBM systems as well as for IBM cloud solutions. She consulted clients in Europe, the U.S. and Asia on their transition into the cloud. The clients either wanted to expand their business and become a cloud provider or wanted more efficiency in their data centers by consuming cloud offerings. This gave Astrid a good background to understand the benefits of a Software Defined Environment. In 2014 she joined a team that developed a client adoption model for SDE.
For years, the conversation about using TLS was little more than an argument over whether proper key management was worth the effort. But today’s news reports are riddled with stories about data theft and Internet espionage, and secure content delivery is the new normal. Now SSL is dead, new CVEs show up fast and furious, and our reaction time to the latest bug reports is measured not only in hours but customers at risk.
For years, the conversation about using TLS was little more than an argument over whether proper key management was worth the effort. But today’s news reports are riddled with stories about data theft and Internet espionage, and secure content delivery is the new normal. Now SSL is dead, new CVEs show up fast and furious, and our reaction time to the latest bug reports is measured not only in hours but customers at risk.
In a world where we expect forward secrecy and elliptic curves to save us, we have to realize that it’s never as easy as flipping a few switches. We have to balance the performance and cost implications of different grades of security while keeping an eye on both compatibility and the latest threats.
This talk will discuss what forward secrecy is, and how it’s achieved. It will also describe the mechanics of Diffie-Hellman exchanges and how we measure the “cost” of enabling them on different platforms, as well as the benefits of ECC. We discuss how we validate and benchmark different points of encryption termination (notably appliance ADCs). We will specifically describe how we used our methodology to accomplish the HTTP to HTTPS migration of our webmail platform, and how we overcame the problems that we ran into along the way.
Chris Niemira is an AOL veteran who spent over seven years running the public gateways for the AOL Mail system, one of the world's largest email platforms. Today, he works as a reliability engineer, writing tools and running analyses to help to ensure the performance and availability of many of AOL's high traffic properties across the Internet. He previously spent time building solutions and running web properties in the Banking and Pharmaceutical industries (as well as some dot-coms we won't talk about), and also is currently pursuing an MBA and Master of Finance.
SRE Tooling is often limited to command line interfaces, making them inaccessible to a "non-SRE" audience. The barrier of entry to doing usable UI for SRE often seems insurmountable and plagued with many horror stories, coupled with the natural tendency of SRE Engineers to not have the necessary training to do UI design. This talk will present a framework and approach for integrating and building UIs into the SRE toolbox. Topics covered will include general architectural approaches to library design for SRE tooling in order to facilitate UI buildout, overview of some basic UI pitfalls and best practices, an approach to crafting your own unique toolset with minimal development investment, and and ongoing maintenance. SRE Tooling is often limited to command line interfaces, making them inaccessible to a "non-SRE" audience. The barrier of entry to doing usable UI for SRE often seems insurmountable and plagued with many horror stories, coupled with the natural tendency of SRE Engineers to not have the necessary training to do UI design. This talk will present a framework and approach for integrating and building UIs into the SRE toolbox. Topics covered will include general architectural approaches to library design for SRE tooling in order to facilitate UI buildout, overview of some basic UI pitfalls and best practices, an approach to crafting your own unique toolset with minimal development investment, and and ongoing maintenance.
Michael Avrukin is a Site Reliability Engineering Manager at Google. Prior to Google Michael worked at numerous startups, helping to scale and build out teams across multiple countries and continents working with AWS, AliYun, GCE, and other cloud platforms to deliver enterprise-level QoS solutions and services. Michael spent his early career doing UI development on Windows, OS X, and later with early JavaScript before transitioning to back-end and infrastructure engineering where he brought his passion for visual eye candy.
|
The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest. The Unconference area is a reserved room for attendees to propose their own talks. This may also be useful for followup discussion on topics of interest.
|
Laura Nolan and Diego Elio Pettenò, Google This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.
The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:
- A high-level design that scales horizontally
- Initial SLIs, SLOs
- Estimates for hardware required to run each component
- If time permits, monitoring design and disaster testing scenarios
Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too. This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.
The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:
- A high-level design that scales horizontally
- Initial SLIs, SLOs
- Estimates for hardware required to run each component
- If time permits, monitoring design and disaster testing scenarios
Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too.
Laura Nolan has been a Site Reliability Engineer at Google for the past two years, and has worked with multiple large systems that use distributed consensus to achieve reliable, stable multihoming that really works without human babysitting. Prior to that she kept the e-commerce site gilt.com fast and stable during flash sales and worked for the Irish software company Curam Software, which is now part of IBM Smarter Cities. She is a keen traveller and scuba diver and an international-level weightlifting referee.
|
Melita Mihaljevic, Facebook “There are only two hard things in Computer Science: cache invalidation and naming things.”—Phil Karlton
Facebook serves 1.3 billion users across multiple regions. To make sure that all users have a consistent experience on the site, we built cache invalidation pipeline. This talk will cover cache invalidation pipeline for both caching solutions Memcache and TAO cross multiple regions. The talk will also touch a bit on how we monitor and debug cache consistency problems.
“There are only two hard things in Computer Science: cache invalidation and naming things.”—Phil Karlton
Facebook serves 1.3 billion users across multiple regions. To make sure that all users have a consistent experience on the site, we built cache invalidation pipeline. This talk will cover cache invalidation pipeline for both caching solutions Memcache and TAO cross multiple regions. The talk will also touch a bit on how we monitor and debug cache consistency problems.
Melita Mihaljevic is a Production Engineer on the Global Consistency team at Facebook. The team is the first responder for keeping the cache consistent and for the health for Facebook's proprietary real-time data streaming infrastructure. They ensure that users have a consistent experience on the site across all their devices. Melita is actively developing and maintaining one of the cache invalidations pipeline services.
Andre Masella, Ontario Institute for Cancer Research Creating configuration files has always been pushed into the domain of “not programming,” but configuration files have a way of growing more complex. There is a struggle between keeping a configuration terse, by having the system infer information automatically, and explicit without having duplication. Either the configuration file develops embedded domain-specific programming languages (e.g., Apache, Asterisk, Postfix) or a text-based macro language is put in front (e.g., M4-based wrapper for sendmail, automake as an M4-based wrapper around Make). Creating configuration files has always been pushed into the domain of “not programming,” but configuration files have a way of growing more complex. There is a struggle between keeping a configuration terse, by having the system infer information automatically, and explicit without having duplication. Either the configuration file develops embedded domain-specific programming languages (e.g., Apache, Asterisk, Postfix) or a text-based macro language is put in front (e.g., M4-based wrapper for sendmail, automake as an M4-based wrapper around Make). The middle path is to put a structured macro language in front of a configuration; a language with the smarts to semantically verify the configuration (unlike a text-based macro language) and that has well-defined, observable semantics outside the binaries being configured (unlike an in-configuration DSL).
There are three languages currently working toward this goal: NixOS, Jsonnet, and Flabbergast. Both Jsonnet and Flabbergast descend from Google's proprietary configuration language[4] with the intention of changing design decisions that made this language difficult to use.
In particular, one of the common myths is that the complexity of the configuration is related to the structure of the configuration format. That is, there is a tacit assumption that INI configurations are semantically simpler than JSON ones; which is demonstrably false. In understanding how semantically rich configurations are, writing configurations can be elevated to a real discipline that is related to but distinct from general purpose programming and, thereby, focus making configurations easier to understand, easier to write, and more sophisticated.
Andre Masella previously worked at Google as an SRE supporting the AdSense serving stack. While there, he spent most of his time refactoring the configuration files of the serving stack in Google's much-despised proprietary configuration language and developing idioms to manage the complexity. After leaving, he quested to create a configuration language with the same expressive power, but more simple to understand and write effectively, yielding Flabbergast.
Florian Pfeiffer, Gutefrage.net GmbH "Treat your data center as a single machine." This idea has been getting more and more traction over the last years. Beside a couple of other projects, there's Apache Mesos which is providing a solution to easily create this one big single pool of your resources. There are companies which are running it on 10,000s of machines, but for an architecture that relies on easy scalability and good resilience, it completely makes sense to run it on a small cluster as well. A framework that runs upon Mesos is the scheduler Aurora, which takes care about how many instances of a job should run on which machines, and reschedules running jobs if machines in your cluster die. "Treat your data center as a single machine." This idea has been getting more and more traction over the last years. Beside a couple of other projects, there's Apache Mesos which is providing a solution to easily create this one big single pool of your resources. There are companies which are running it on 10,000s of machines, but for an architecture that relies on easy scalability and good resilience, it completely makes sense to run it on a small cluster as well. A framework that runs upon Mesos is the scheduler Aurora, which takes care about how many instances of a job should run on which machines, and reschedules running jobs if machines in your cluster die.
After the introduction of these projects, I will show you how and why Gutefrage.net has glued together those technologies with Jenkins to implement a continuous deployment workflow, that takes care of our 100+ daily deployments on a relatively small Mesos cluster and has the goals of providing a fault tolerant and low latency user experience to our customers.
Florian is Head of Data and Infrastructure at gutefrage.net. Before that he learned the ropes at Yahoo! Together with his team he accepts the challenges that running Germany's biggest Q&A site brings. As an agile company, this not only means the usual scaling and high availability topics, but also multiple daily releases and branchless development with feature switches.
In February 2014 he introduced Mesos as the basic building block for the next generation platform for gutefrage.net
Dave O'Connor, Google Dublin Lots of thought is given to how to organise oncall rotations — around people's schedules, around periods of critical coverage, and with fairness in mind. Less thought is given to the human aspect of oncall —how it affects people's ability to get other work done, their general cognitive flow state, and burnout rates. This talk will present a paper used internally in several SRE teams at Google to organise rotations around people, bearing in mind that people are not machines. Lots of thought is given to how to organise oncall rotations — around people's schedules, around periods of critical coverage, and with fairness in mind. Less thought is given to the human aspect of oncall —how it affects people's ability to get other work done, their general cognitive flow state, and burnout rates. This talk will present a paper used internally in several SRE teams at Google to organise rotations around people, bearing in mind that people are not machines.
Dave O'Connor is a Senior Site Reliability Manager at Google. He has been at Google for almost 11 years, 9 of which were spent oncall, and organising oncall rotations. He has spent time on several teams in Google SRE, and currently manages the teams that run Google's storage in their Dublin, Ireland office. His specialty is being spectacularly grumpy at being interrupted, both for himself and on people's behalf; he has spent many years building teams that handle heavy interrupt load without everyone hating their lives.
|
Taking some lessons from Meryl Streep as Anna Wintour, and the ancient Greek philosopher Heraclitus, we look at the role of SRE and devops within organisations. What is it that we do, and why? What are some organisational factors that contribute to the success of an SRE/devops group, and its failure? What incentives need to be in place, and what happens if they are missing?
Taking some lessons from Meryl Streep as Anna Wintour, and the ancient Greek philosopher Heraclitus, we look at the role of SRE and devops within organisations. What is it that we do, and why? What are some organisational factors that contribute to the success of an SRE/devops group, and its failure? What incentives need to be in place, and what happens if they are missing?
Niall Murphy is currently head of the Ads Reliability Engineering function in Google Dublin, ostensibly in charge of the infrastructure supporting $18bn/quarter. Prior to Google he was in Amazon.COM Network Engineering, and a variety of self- or other-founded startups and Irish Internet institutions. He is the author of a book, a lecture course, and numerous articles, and is probably one of the few people in the world with a degree in Computer Science & Mathematics, and Poetry Studies.
Time Series metrics can be an important part of a comprehensive monitoring solution. Betfair will present a talk on their experiences running OpenTSDB and a new open source tool called OpenTSP, designed to streamline the process of gathering and delivering system metrics quickly and reliably to multiple endpoints so that you can use any of your favourite tools to analyse the stream.
Time Series metrics can be an important part of a comprehensive monitoring solution. Betfair will present a talk on their experiences running OpenTSDB and a new open source tool called OpenTSP, designed to streamline the process of gathering and delivering system metrics quickly and reliably to multiple endpoints so that you can use any of your favourite tools to analyse the stream.
James Brooks is a Senior Engineer with Betfair's site reliability engineering team. Having spent time at Sun Microsystems and Interoute Communications, James is keenly aware of the power of well-implemented monitoring to be a driver for continuous improvement.
Vilian Atmadzhov is the youngest member of Betfair's SRE team. Fresh from University, he has taken on a number of projects at Betfair, including the design of a config generator for Riemann, to allow our teams to rapidly deploy alerting and dashboarding functionality
The programme committee will select a number of 5-minute lightning talks from those proposed by attendees, on topics that interest them. This is usually used to spur more discussion in the breakout areas. The programme committee will select a number of 5-minute lightning talks from those proposed by attendees, on topics that interest them. This is usually used to spur more discussion in the breakout areas.
|
(session starts at 12:30)
Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. However, leading a debriefing is not straightforward and done haphazardly can bring cultural and technical damage to an organization.
This 3-hour session will cover the theory and fundamentals of the “New View” on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations. Many organizations want to learn from failures. Postmortem debriefings and documents are a part of that learning process. However, leading a debriefing is not straightforward and done haphazardly can bring cultural and technical damage to an organization.
This 3-hour session will cover the theory and fundamentals of the “New View” on complex systems failure and “human error”, as well as techniques for facilitating an adverse event debriefing. Attendees should walk away with a more evolved sense of accident/outage investigation and a model to explore in their own organizations.
What will be covered
- Foundations and limitations of generating post-hoc narratives
- Fundamentals of the New View: accountability, responsibility, risk, and "safety"
- Debriefing techniques to facilitate dialogue with diverse perspectives and potential cognitive biases
- Plotting your exploration of dynamic fault management: the phases of anomaly response, communication, and diagnosis
- Interviewing tips and tricks: handling defensiveness and setting the stage for a productive and blame-free environment
- We will use case studies of known accidents and outages to discuss concepts
- How to think about the scope, purpose, and implementation of recommendations and remediation items
The session is intended to be very interactive, and sections will require back and forth with the attendees on the various topics
Background reading for attendees can be found here: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.411.4985&rep=rep1&type=pdf
|
(continued from 09:00 session. Prometheus workshop starts at 12:30)
Laura Nolan and Diego Elio Pettenò, Google This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.
The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:
- A high-level design that scales horizontally
- Initial SLIs, SLOs
- Estimates for hardware required to run each component
- If time permits, monitoring design and disaster testing scenarios
Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too. This is a half-day version of the SRE Classroom event that has been run at other USENIX conferences in 2013 and 2014. It won't include much in the way of formal presentations, as previous SRE Classroom events have done. We expect this event to appeal to senior and experienced engineers.
The workshop problem will be a real-world large-scale engineering problem that requires some distributed-systems knowhow to tackle. Attendees will work in groups with Googlers on the problem, trying to come up with:
- A high-level design that scales horizontally
- Initial SLIs, SLOs
- Estimates for hardware required to run each component
- If time permits, monitoring design and disaster testing scenarios
Attendees will be provided with a workbook that guides them through tackling a problem like this, with some worked examples, so junior attendees should be able to make progress and learn too.
Laura Nolan has been a Site Reliability Engineer at Google for the past two years, and has worked with multiple large systems that use distributed consensus to achieve reliable, stable multihoming that really works without human babysitting. Prior to that she kept the e-commerce site gilt.com fast and stable during flash sales and worked for the Irish software company Curam Software, which is now part of IBM Smarter Cities. She is a keen traveller and scuba diver and an international-level weightlifting referee.
Julius Volz and Björn Rabenstein, SoundCloud Ltd. Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.
In an introductory talk, we will explain the architecture of Prometheus and the motivation behind it. Taking the instrumentation and monitoring of services at SoundCloud as an example, we will demonstrate how Prometheus helps us stay on top of a growing microservice architecture and detect and investigate outages.
In this workshop, participants will set up all critical components of the Prometheus ecosystem to monitor some toy services. Prometheus is a popular open-source monitoring system and time series database written in Go. It features a multi-dimensional data model, a flexible query language, and integrates aspects all the way from client-side instrumentation to alerting.
In an introductory talk, we will explain the architecture of Prometheus and the motivation behind it. Taking the instrumentation and monitoring of services at SoundCloud as an example, we will demonstrate how Prometheus helps us stay on top of a growing microservice architecture and detect and investigate outages.
In this workshop, participants will set up all critical components of the Prometheus ecosystem to monitor some toy services.
Julius and Björn are production engineers at SoundCloud. Coincidentally, both were SREs at Google in their respective previous lives. Julius is a co-founder of the Prometheus project, and both are maintainers and main contributors.
|