SAGE - Feature


Riding the Escalator

mellis_mark

by Mark K. Mellis
<mkm@mellis.com>

Mark K. Mellis is principal engineer of Mellis and Associates, a Silicon Valley consulting firm specializing in Unix systems and IP networking. His interests include computer telephony, reliable network infrastructure design, jazz, and rubber chickens.

I've always been challenged by the management of flipped-out users ­ ones who have a problem that, to them, is the most pressing thing in the world. I "feel their pain," yet I have to balance their needs in the context of the entire organization that I support. As usual, I'm not the only one with this problem, and also as usual, the other folks have a pretty good tool for managing the problem. In my best "egoless sysadmin" style, I've adopted it for my own use. My new magic bullet is the defined escalation procedure.

How It Works

You classify your incoming calls (or tickets or issues or whatever your favorite euphemism might be) by priority. Each classification has expected service goals associated with it. Those service goals are published. If a call isn't progressing according to the service goals for its classification, it moves to a higher classification, and those involved are notified. This continues until the call is closed or it reaches the top of the escalation ladder. The goal is never to have to escalate calls beyond their initial level. In addition to autoescalation, a call can be escalated at will by the submitter or by the sysadmin organization. However, the path for each call is defined, and all parties know in advance what their options are when progress isn't as apparent as they wished it was. By defining the path through which a problem is addressed and providing a clear mechanism to escalate the issue when necessary, you help move frustration away from the personalities involved and focus on the problem and its resolution.

Implications for the Organization

A defined and published escalation procedure helps relieve users' stress because it gives them a way to bring their problems to higher authority when they don't think they are getting the attention they deserve.

Implicit within the escalation procedure is a defined priority for each issue. When you have a clear idea of the priority of a problem, you can allocate resources to it in a fair manner, and if it is escalated, the increased resources of the higher level become available, assuring that appropriate resources are channelled to critical problems.

Managers don't like to be blindsided. There's nothing like having your boss called on the carpet for a problem you haven't told her about yet. If escalations are actively managed, these situations are less likely to occur.

Don't forget, you must define the escalation procedure in writing and publish it throughout your user community and their management. If users don't know about it, they can't use it.

Like all policies and procedures, defined escalation is unlikely to succeed as a sysadmin tool without management support. Your boss needs to "sign up" to be on the escalation path and needs to back you up in a constructive manner when contacted by a constituent who is escalating an issue.

Defined escalation is not a project management tool. Project work should be managed separately.

Here is an example of the escalation procedure for a medium-sized company:

Routine request
-Submit via Web form, email, or telephone call to x4357 (HELP), or call: Duty Sysadmin pager +1 408 555 5554.
-Worked by duty sysadmin.
-Routine requests are account creations, file restores, network moves, alias maintenance, and so forth.
-Status reported at time of request and every 16 business hours thereafter.
-Escalates to next level in 24 business hours.
Minor outage, impairment
-Submit via Web form, email, or telephone call to x4357 (HELP), or call: Duty Sysadmin pager +1 408 555 5554, Duty manager pager +1 408 555 5555.
-Worked by duty sysadmin, duty sysadmin manager, submitterÕs manager automatically notified.
-Minor outages affect eight or fewer users. For example, a single 10BaseT hub failure or a single workstation failure is a minor outage. Impairments are error conditions that have not yet caused an outage. For example, high error rates on a disk drive or network connection are impairments.
-Status reported at time of request and every 4 business hours thereafter.
-Escalates to next level in 8 business hours.
-Major outage
-Submit via Web form, email, or telephone call to x4357 (HELP), or call: Duty Sysadmin pager +1 408 555 5554, Duty Manager pager +1 408 555 5555, IS Director pager +1 408 555 5556.
-Worked by duty sysadmin and other resources as dispatched by management. Duty Sysadmin manager, IS Director, submitterÕs manager, and director automatically notified.
-Major outages affect nine or more users or a major service, such as DNS or Internet connectivity. A major fileserver failure or a security incident in progress is a major outage.
-Status reported at time of request and hourly thereafter.
-Escalates to next level in 4 business hours.
Disaster
-Submit via Web form, email, or telephone call to x4357 (HELP), or call:
Duty Sysadmin pager +1 408 555 5554, Duty Manager pager +1 408 555 5555, IS Director pager +1 408 555 5556, Office of the President pager +1 408 555 5557.
-Worked by all available resources. All Sysadmin management, all directors, and the Office of the President automatically notified.
-Disaster is a company-wide failure of IS infrastructure. A fire in the data center or a major security penetration is a disaster.
-Status reported at time of request and hourly thereafter.
-This is the highest escalation level.

Let's walk through a few cases. The network drop for Bill's workstation fails. He calls x4357 and opens a routine ticket. The ticket number he gets in return is his initial notification. The duty sysadmin comes by in 20 minutes, repatches him to a working port, and closes the ticket. Work flow is normal, so there is no escalation.

Susan needs an alias created to support her latest project and wants it done immediately. She knows that alias creation is a routine event and can reasonably take up to three business days. She can escalate it, but her management will be automatically notified if she does. Susan chooses to wait. The alias is created in a timely manner and the ticket is closed.

Jake needs a special CAD package updated on his workstation. He opens a routine ticket on Monday, but his request gets lost in the workload. On Thursday, it is autoescalated from routine request to minor outage, and the appropriate managers are notified. Jake's software is updated early Friday morning by a chastened sysadmin.

Dorothy requests that her workgroup be moved to its own subnet to improve performance. Router interfaces are in short supply and the request can't be fulfilled in the designated time, so the sysadmin escalates the request. During the subsequent automatic management review, Dorothy is persuaded that she didn't need her own router interface after all.

Defined escalation is a tool, and will benefit you only if you use it properly. It must be accepted by your organization, including your user community. You have to be willing to live with its consequences. In return, you may reap the benefits of more businesslike interactions with your constituency and emergencies that are really "emergent."


?Need help? Use our Contacts page.
7th July 1998 efc
Last changed: 7th July 1998 efc
Issue index
;login: index
SAGE home