Responding to and Learning from Distributed Incidents

May 8, 2023

Case Study

Authors:

Article shepherded by:

Laura Nolan

An Incident Begins

Picture the situation: users are flooding our support channel saying that their jobs are not starting.

Ideally we would have received an automated alert, something like "Number of started jobs over the last 5 to 10 minutes is 0". This is an example of a symptom-based alert: it says nothing about why jobs may not be starting. There are two main reasons to alert on symptoms instead of causes. Firstly, symptom-based alerts [1] are directly related to what users expect from your platform. They want to run their jobs in a timely fashion, and we need to know quickly if user needs are not being met. Secondly, in complex systems there are so many possible causes for problems that finding them all ahead of time and configuring alerting rules for all of them is impossible. We would inevitably have gaps in our alerting. When we alert on symptoms we can configure one alert which covers all potential causes for those symptoms. Furthermore, it means we don’t have to try to determine alert queries and thresholds for all potential causes — this normally reduces alert noise and makes for a more maintainable alerting configuration.

The only downside of symptom-based alerting is that once you receive an alert, you have to get from the symptom — your alert — to the cause. This is also true when users report problems. Let's return to our case study: users were unable to run their jobs. We had no idea yet what had caused this symptom. At the start of the incident, the impact was not known either. Only a couple of big consumers had raised the issue, so during the initial investigation we assumed some part of the fleet was still operational and just backlogged.

The system in question was our version control system, GitLab. We run a fleet of servers (GitLab runners) that run builds, tests, and other automated tasks when new commits are made. Additionally, we run GitLab Rails servers, which provide administrative access to our GitLab instance. At the time this incident occurred, we had just made an upgrade to our GitLab system.

Our GitLab update process was:

Drain traffic away from the system
Enable maintenance mode on the API, in order to avoid writes to the DB
Wait for all in-progress tasks to complete and write to the DB
Shut down services on the GitLab Rails servers
Upgrade the OS packages
Run DB migrations
Restart all services on the GitLab Rails servers
Check that core functionality works, including accessing the web UI, cloning a repository, pushing changes to our main repositories
Disable maintenance mode on the API and reenable traffic

Spot-checking logs on some GitLab runners we saw something unusual: many messages like "Runner 12345 is not healthy and will be disabled."

While we were investigating the system recovered fully by itself, exactly 1 hour after the outage started.

Troubleshooting and Investigation Methodologies

After the situation mysteriously resolved itself, we needed to know what happened: we didn’t know whether we were at risk of it happening again.

In order to troubleshoot any system, we need to know (or be able to find out) how it works.

How do GitLab runners know what jobs to run? Every 3 seconds (by default, this can be configured) every runner sends a request to an API endpoint within GitLab with a POST body containing its runner ID and other identifying information, such as tags that specify what jobs the runner is capable of executing. The GitLab API queries its DB to identify potential jobs for the runner and, if there are jobs waiting that match the selection criteria, returns one job to the runner to execute.

Figure 1: Runners poll the GitLab API every 3 seconds for runnable jobs. The GitLab API retrieves information from the GitLab DB.

To get one job to one runner we make four network calls: runner to GitLab API, GitLab API to GitLab DB, and the return path for each of these. This is a simplification: we ignore things like load balancers, DNS and SQL proxies for convenience. Any of these network calls can potentially fail.

Figure 2: Failure modes in network calls - request may not arrive, server may fail while serving the request, the response may not arrive.

Scanning the logs for the GitLab runner fleet, we found that every single runner was affected by the outage. The following grep command found the same log line on all runners:

# we know from past experience that this operation is not going to consume
# large amounts of resources so this was deemed fine - be careful running
# something like this without prior experience!
grep '2022-10-10T1[01].*Runner [0-9]* is not healthy and will be disabled!'
/var/log/gitlab-runner/current.log

Command to search for the relevant error on GitLab runners, using the tool grep (plus a script or tool such as parallel-ssh that can run commands across multiple hosts).

The CI infrastructure was nonfunctional for an hour. This indicated to us that the problem was something global and systemic, not something that affected only a few hosts. But we still didn’t know why the runners disabled themselves, what it meant for a runner to be disabled, or why they all came back after an hour.

We applied a standard troubleshooting methodology: form a hypothesis to explain observed symptoms, and gather data to prove or disprove that hypothesis. Iterate until the incident has been satisfactorily explained.

Figure 3: Troubleshooting methodology. Iteratively create hypotheses and gather data to prove or disprove them. Write a detailed analysis after the problem is resolved and perform any follow-up actions identified.

The same troubleshooting methodology can be applied both to locally-running software and to distributed systems, although distributed systems complicate matters because of their greater complexity. Distributed systems can throw up new and unforeseen problems, which can be especially difficult to deal with due to a lack of established debug tooling or documentation. Novel problems need to be tackled on the fly. An outage is not where we want to start learning about new behavior and building a whole new mental model.

Here's how we can build a hypothesis systematically. First, we identify a component that has an effect on the problem you're seeing. Next, confirm whether that component is behaving according to our mental model of its intended behavior. If it seems to be behaving correctly, we move on to the next component. If it isn't behaving as we think it should, we correct our mental model of that component/sub-system’s behavior, and, if appropriate, develop a strategy to align the system into the "correct" behavior. If our outage has not resolved or been explained, we proceed with the next mismatch between our mental model and the system's behavior.

This loop drives learning: it is the reason that incidents (and simulated incidents) can be such powerful sources of information about system behavior. Established teams which have a lot of experience with a particular system normally have a good baseline understanding of their systems, which, combined with solid knowledge of debugging and observability tools, usually means that they can analyze and mitigate incidents rapidly.

Solving the GitLab Runner Incident

Returning to our GitLab problem, we inferred from the error message that there is some fleetwide problem with the runners’ health checking mechanism, caused by the upgrade procedure.

The source code for GitLab runners is publicly available, so we can start by finding the code that emits log lines like "Runner 12345 is not healthy and will be disabled!"

Reading the code, we find that this message is issued when the runner has 3 consecutive failures when communicating with the GitLab API. HTTP status codes 403 Forbidden and 400 Client Error are considered unhealthy responses. On receiving the third 4xx response, the runners disable themselves for one hour.

Figure 4: Healthy and unhealthy responses from the GitLab API. HTTP 403 forbidden and HTTP 400 client error are unhealthy responses. HTTP 201 created and HTTP 204 no content as well as any other code not in the unhealthy responses list are considered healthy responses.

Based on the fact that we saw the "Runner 12345 is not healthy and will be disabled!" log messages, the runners must have gotten at least 3 responses with either HTTP 403 or HTTP 400. Now we could start investigating reasons why this could have occurred. Some potential hypotheses:

The new version of the GitLab runner is broken and all runners are sending invalid requests to GitLab
Something else related to our upgrade has been causing all runners to receive a HTTP 403 response
This has nothing to do with the upgrade and was something that just happened to coincide with it

The version of the GitLab runner we upgraded to had already been out for a few months so it was unlikely we would be the first to discover such a significant bug.

However, it was worth asking whether any step in the upgrade procedure was capable of disrupting the runners by causing 403 forbidden responses to be returned by the GitLab API. The maintenance mode seemed like a good candidate to investigate since it affects the API, which is the component of the system which does not seem to be behaving as anticipated.

In the documentation for the maintenance mode feature we find the culprit:

“For most JSON requests, POST, PUT, PATCH, and DELETE are blocked, and the API returns a 403 response with the error message: You cannot perform write operations on a read-only instance.”

From our logs, we saw that the runners started to accept jobs again exactly an hour after we disabled maintenance mode on the GitLab API, confirming the hypothesis.

Unless we re-enabled maintenance mode, we were unlikely to see a recurrence of the problem with the Gitlab runners.

Updating Mental Models: Post-Incident Follow Up

The post-incident follow up is critical: it is how we institutionalize and share the knowledge we have gained during the incident investigation. While your memory is fresh (and relevant logs and metrics still exist), write a document that includes a timeline of your investigation, and a description of your investigation, the assumptions and hypotheses you made during it, and the contributing factors you discovered.

You may notice improvements that you can make to your systems as a result of what you have learned. In this case, you might amend the upgrade process, perhaps modify the runners to retry the API sooner, or update your alerting configuration (recall that this incident was discovered through user reports).

The true value of the post-incident follow up, however, is learning. During normal work and during incidents, operators interact with various representations of the system: monitoring graphs, architecture documentation, and user reports. From these representations, they form a view of the system. The map is not the territory, and these representations are not perfect reflections of the system: no system is completely observable. As a result, system operators develop incorrect or incomplete mental models about system behavior. In our example incident, operators did not know that the API maintenance mode could impact the runner behavior for the following hour.

During incidents and during normal day to day work, it is important to remember that you’re dealing with representations of your system and that you have a mental model of the system which can never exactly match the real system. This is why diverse teams with lots of different perspectives are so valuable: combining several mental models of the same system makes us less likely to miss crucial insights. This is also why incident response and incident analysis are so vital: they are our best mechanism for updating our mental models of our systems to better match reality, so that the gap between representations and reality does not grow over time. Incidents are an incredibly valuable opportunity for learning: don’t let your incidents go to waste!

Appendix

References:

[1] Rob Ewaschuk, ‘My Philosophy on Alerting’. https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Article Categories:

SRE

Last updated May 5, 2023

Authors:

Bio: Philipp Böschen is a Site Reliability Engineer at Booking.com, working on some of the core developer experience systems and maintaining the source code repositories. His main focus apart from these areas is with build systems and how we can strategically learn from our mistakes and outages.

[email protected]