Improving On-call Fatigue

September 6, 2021

Research

Authors:

Article shepherded by:

Effie Mouzeli

Many years ago, I remember being sleepless in anticipation of an on-call shift. Being on-call means being available for work at any time when needed, especially for an emergency situation. Back then, my on-call experience was extremely stressful and highly unpredictable. Discussing it with my team at the time didn't change much. The general feeling was that on-call is trying, stressful, and no one likes it. It is not supposed to be fun, and nothing can be done about it.

Later on, I joined a team where the on-call experience was completely different. My colleagues would describe their on-call shifts positively: “I love it, I am learning so much!”, and “It’s fun and I look forward to it”. This was surprising, and motivated me to discover what can make an on-call experience positive.

Multiple studies in other fields like medicine and emergency response show that on-call generates high physiological stress, and sometimes results in chronic fatigue and burnout [1, 2, 3]. While noisy pagers can lead to pager fatigue, noisiness is subjective: every team has their own answer as to how many alerts lead to a noisy pager. That led me to wonder: is it just the number of pages that correlate to on-call satisfaction or are there other factors involved?

I surveyed and interviewed ~240 engineers of different levels. Each one rated their on-call satisfaction, and provided detailed feedback about their experience. A year later, I repeated the same experiment with an even larger study group, evaluating additional dimensions of the on-call experience. My initial results showed that factors such as team culture, autonomy, and empowerment to drive change, contribute significantly to on-call satisfaction.

In the next sections we will look at how your team can improve the on-call experience and use on-call to learn and grow a great team culture in a measurable way.

Building blocks of on-call satisfaction

Let's review some engineering practices from teams with high on-call satisfaction.

“Remember teamwork begins by building trust. And the only way to do that is to overcome our need for invulnerability.”
― Patrick Lencioni,

Technical literacy and hands-on experience were one of the big contributors to on-call well-being. Successful teams had established effective onboarding processes, invested in training, and in keeping their documentation up to date.

Good communication and collaboration have a force multiplier effect on team efficiency. As George Bernard Shaw once said: “The single biggest problem in communication is the illusion that it has taken place”. Teams I surveyed with high on-call satisfaction had a bottom-up culture, with top-down support (managers avoided dominating the discussion). During regular weekly meetings, engineers would review incident trends, systemic problems, documentation improvements, automation, logging work items and identifying owners, etc. Teams conducted blameless incident retrospectives, focusing on platform and processes and not on the individual. Engineers felt safe sharing their opinion.

Successful teams had a culture of accountability and ownership, where achievements were celebrated. Teams had recurring Engineering Reviews and Demo meetings.

Teams with high on-call satisfaction had established an effective feedback loop. On-call engineers would fill in surveys as soon as their on-call was over, including detailed feedback about short/mid-term improvements.

Finally, happy teams demonstrated high levels of empathy, looking proactively at opportunities to support each other.

Understanding stress: The NUTS model

Dr. Sonia Lupien at the University of Montreal [4] developed the NUTS model to identify stress triggers: Novelty, Unpredictability, Threat to the Ego, and Sense of Control. We can apply the NUTS model to engineering on-calls to address stressors in a targeted way.

Novelty: You find out about a new platform or process you are supposed to use during your upcoming on-call shift, but you haven’t had time to practice it.
Unpredictability: A partner team keeps sending urgent and unexpected requests to your team.
Threat to the ego: Your colleague questions decisions you made during an incident.
Sense of Control: You are working from home and dealing with an escalation when your internet provider has a service interruption.

Improving on-call

Leadership support is always an important factor in the success of organisational change. Survey or interview your team members to identify improvement opportunities ([5] and [6] have useful advice on this). Get key stakeholders and allies among product groups, operations, engineering — and, of course, leadership — on board by providing data and arguments relevant to them. Agree on expectations for improvements, which could include reducing time engineers spend dealing with incidents, reducing toil, improving retention, creating more effective onboarding processes, and improving knowledge sharing, and cross-team collaboration. And don’t forget the key argument: happy teams are more productive [7]!

Defining an on-call satisfaction SLI/SLO

Service Level Objectives (SLOs) are best known for their use with production systems, but they have broader uses. We can define SLOs for our team’s on-call satisfaction and stress levels. Some examples of potential on-call satisfaction Service Level Indicators (SLIs) that you can track automatically are:

Number of alerts (although, as we have seen, this is not the only important indicator)
Sleep interruptions during on-call
Duration of incidents
How many other interruptions an engineer had during their on-call (unexpected calls, meetings, or requests for assistance)

Other SLIs can be collected during meetings or via surveys:

How many incidents were novel? Were engineers aware of recent and upcoming changes to the system, and did they know about potential issues?
How many incidents were unexpected? Were there means to anticipate the incident?
Did on-call engineers feel judged? Nick Stenning’s talk on learning from incidents [8] and the Etsy Debriefing Facilitation Guide [9] can help you solve this issue.
Did engineers have all the tools and training they needed to feel in control, such as runbooks, dashboards and troubleshooting guides?

The most important thing is to discuss potential on-call SLIs with your colleagues, and to agree on what the acceptable thresholds for those indicators are.

Before on-call

Hands-on experience is critical to gain confidence in using the available tools and to reduce the novelty factor in dealing with incidents. Here are some ways to prepare new engineers for on-call:

Provide opportunities to shadow others during their on-call.
Extracting long-tenured engineers’ “tribal knowledge” — about the systems and their failure modes as well as their contacts across the organisation — is critical to give other engineers the same level of autonomy and confidence in dealing with critical issues. You can build an “on-call survival kit” where everything that doesn’t fit into a service runbook can go.
Organize regular on-call drills to develop hands-on experience with systems, tools, and processes.
Show empathy and support, especially if someone is preparing to be on-call for the first time - new on-callers should have someone they can escalate to if necessary.
Create a checklist with everything an engineer may need to validate in advance. There is nothing worse than realizing you forgot your access token, your phone number is not up to date in the escalation system, or that your access to a critical system has expired.

During on-call

Team members and managers play a crucial role during their colleagues’ on-calls. On-call engineers who are swamped with ongoing issues, may need support, but may be too overloaded to pause to ask for it. After a tough shift, you may need to proactively offer on-callers some time to recover, while engaging on-call backup. If possible, create automated “on-call overload” alerts — as you would do with any service issue — and make sure someone, such as a secondary on-caller or a manager, is available to help. Small acts that demonstrate empathy — bringing lunch during an incident, or covering the pager so an on-caller can take a break — can mean a lot to an individual on-caller, and can strengthen your team.

After on-call

After on-call is over, it’s time for learning, planning improvements, and recovery. Learn from any incidents that occurred, and, where relevant, share what you learn with other teams by organizing cross-team incident retrospectives.

Plan for improvements/fixes identified as part of post-incident reviews or from on-call engineer feedback. Every individual engineer should be empowered to drive for improvement [10]. Treat your runbooks and troubleshooting guides as first-class citizens and track any potential improvements that are identified, or new scenarios that need to be included.

Healthy teams continuously review potential risks in their systems, making sure monitoring and mitigation steps are in place to address those. Incident reviews and on-call engineer surveys are valuable channels for identifying new risks.

As for recovery, consider implementing an “on-call day off” for engineers to use after their shift is over. Asking “How was your on-call? How do you feel? Do you need anything?” helps others to feel supported and heard.

Finally, capture all the indicators you defined for your on-call satisfaction SLI, and review your on-call SLI/SLO trend and improvement opportunities.

Summing up

Number of pages alone is not enough to describe the on-call experience. A culture of trust, ownership, accountability, effective communication, mutual support, and collaboration is critical to building a successful team with a healthy on-call rotation: this establishes the foundation to improve processes and technology, and this in turn drives a better on-call experience and better service reliability.

Appendix

References:

[1] Desta Fekedulegn et al, ‘Fatigue and on-duty injury among police officers: The BCOPS study’, Journal of Safety Research (60, February 2017), 43-51.https://doi.org/10.1016/j.jsr.2016.11.006

[2] Susana Rodrigues et al, ‘Stress among on-duty firefighters: an ambulatory assessment study’, PeerJ Life and Environment (December 11 2018). https://doi.org/10.7717/peerj.5967

[3] Luke Witherspoon et al, ‘Is it time to rethink how we page physicians? Understanding paging patterns in a tertiary care hospital’, BMC Health Services Research (19, December 2019). https://doi.org/10.1186/s12913-019-4844-0

[4] M Gabrielle Pagé et al, ‘The Stressful Characteristics of Pain That Drive You NUTS: A Qualitative Exploration of a Stress Model to Understand the Chronic Pain Experience’, Pain Medicine (22:5, May 2021), 1095-1108. https://doi.org/10.1093/pm/pnaa370

[5] Patrick M. Lencioni, The Five Dysfunctions of a Team: A Leadership Fable (Jossey-Bass, 2011).

[6] Alex Hidalgo, Implementing Service Level Objectives (O'Reilly Media, 2020).

[7] Clement Bellet et al, ‘Does Employee Happiness have an Impact on Productivity?’, Saïd Business School WP (October 14, 2019). http://dx.doi.org/10.2139/ssrn.3470734

[8] Nick Stenning, ‘Building Resilience: How to Learn More from Incidents’, at SREcon19 EMEA. https://www.usenix.org/conference/srecon19emea/presentation/stenning

[9] John Allspaw et al, ‘Debriefing Facilitation Guide’ (Etsy, 2016). https://extfiles.etsy.com/DebriefingFacilitationGuide.pdf

[10] Amy Tobey, ‘One on One SRE’, at SREcon19 EMEA. https://www.usenix.org/conference/srecon19emea/presentation/tobey

Article Categories:

SRE

Last updated February 8, 2023

Authors:

Daria Barteneva is currently Principal Site Reliability Engineer in Observability Engineering at Azure. With a background in Applied Mathematics, Artificial Intelligence and Music, Daria is passionate about data mining, diversity in tech and opera. In her current role, Daria is focused on changing organizational culture, processes and platform to improve service reliability and on-call experience. Daria is originally from Moscow, Russia, having spent 20 years in Portugal and now lives in Dublin, Ireland.

[email protected]