SAGE - Sage feature


On Reliability ­ Restores and Recovery

Sellens, John

by John Sellens
<jsellens@uunet.ca>

John Sellens is Director, Network Engineering, at UUNET Canada Inc. in Toronto, after many years as a system administrator. He is also proud to be husband to one and father to two.

This time around I'd like to discuss some aspects of disasters, how to avoid becoming too much of a victim, and how to put things back together should your avoidance measures prove to be inadequate. I'm going to review things primarily from a system administration standpoint (this is the SAGE section of ;login: after all), but it's important to remember that disaster recovery and avoidance is a far larger topic.

Computing professionals like us typically look at disaster recovery planning (DRP) primarily from a computing systems point of view, which is only natural when you consider which budget pool we're paid out of. Let's look at the big picture and hope that will help put the system administration issues into perspective.

What kinds of disasters might befall a company?

  • the obvious ones: hurricane, flood, earthquake, explosion, etc.
  • a tragic plane crash while all key staff are flying to a much-deserved off-site "retreat"
  • primary product found to cause cancer in every living mammal except laboratory mice
  • armed insurrection
  • complete breakdown of municipal transit systems, preventing staff from getting to the office
  • massive chemical spill and fire with toxic fumes at the company's plant, resulting in mass evacuations, health and environmental concerns, and virtually unlimited personal liability for the senior management and directors
  • disgruntled key employees who start a competing company and lure away every employee with a non-zero IQ
  • loss of the keys to accounts receivable filing cabinets, leading to billion-dollar write-offs

There's a lot more to DRP than making sure that the computers are running and the printers are printing. That said, let's concentrate on DRP for computing systems.

As I've tried to get across in the previous articles in this series, the key to appropriate levels of reliability is the balancing of the exposure to and costs of risks with the costs of avoiding those exposures. What's the worst that can happen (typically)? The company goes out of business, everyone is unemployed and without a pension, and the boss goes to jail. There's a story, which is probably apocryphal, of a mid-level executive who was charged with disaster recovery planning for his organization. His DRP? Keep an up-to-date copy of his resume at home. Most of us, however, would probably prefer to have at least something in place to provide some protection and recoverability.

Let's try and divide the problem space into three major areas:

  1. major physical damage to computing hardware or communications infrastructure
  2. utility (power, HVAC, telecommunications) failures
  3. physical inaccessibility due to weather, structural damage to the building, civil unrest, or evacuation (due to chemical spill, fire, etc.)

    The key to dealing with such problems is planning and documentation (on paper, both on- and offsite). Leaving your DRP until disaster strikes only increases its severity.

    Physical Damage

    If your entire computing infrastructure consists of a single clone PC, a modem, and a cheap printer, it's probably not worthwhile worrying too much about protecting your equipment from loss or damage. If something gets damaged, just go to any of the consumer electronics stores in your area and get a replacement off the shelf. If, however, your equipment is not typically available at the mall on a Saturday afternoon, you probably want to consider how to limit your potential damage and how to get access to replacement equipment in a timely fashion.

    Physical damage to your computing and communications equipment can happen in a number of ways. Two of the most obvious are fire and water, but there are a number of other possibilities that you might consider protecting against.

    Fire and Smoke

    The best protection against fire and smoke is a safe, fire code-compliant building and an appropriate fire detection and suppression system. In past years, Halon was widely used as a fire suppression agent in computer rooms, but it was not environmentally friendly. Fire suppression systems are currently available based on carbon dioxide and other chemicals, but water-based sprinkler systems are still the most common suppression method. You might also consider the use of an emergency power-off system, to power down your systems in the event of an alarm. Among other things, this will help avoid damage to your equipment from smoke and residue being drawn into the chassis through the cooling fans.

    Water

    There are a few ways for water to attack your equipment. Plumbing failure is probably the most common, but you may also wish to worry about water damage from fire suppression systems or flooding. You should consider two primary attacks from water: falling down from above and seeping up from below.

    From above, there are burstable pipes (both on your premises and feeding the bathtub or dishwasher in the unit above) and fire hoses. I've seen some installations with drainage trays mounted under the pipes in the computer room, draining off to the side of the room. And if you have control (or knowledge) of whatever it is in the rooms above you, you might want to worry about whatever plumbing there is up there. Otherwise, your best protection is to keep your equipment in racks or cabinets with a roof over them (and make sure that the ventilation fan outlet isn't in the middle of the top of the cabinet).

    From below, consider a raised floor, with in-floor drains (including backflow prevention valves), pedestals, or some other device to keep your electrical connections off a potentially wet floor and an alarm system to warn you when it gets wet. And if your computer room is below grade, you may wish to reconsider its location. The farther you are above the water table, the safer you are.

    Earthquake, Tornado

    Two approaches to these problems are building integrity and equipment safety. If you are located in an area that is at risk for earthquakes or tornados, consider how your building would be affected if either hits. What parts of the building are most likely to be damaged ­ large plate glass windows, overhangs, trailer parks? Try to locate your equipment as far away from these as possible. Consider bolting your equipment down in some appropriate fashion, and don't forget to fasten your rolling equipment racks down, too. There's no sense having your equipment bounce across the room or fall out the window every time there's a tremor.

    Vandalism

    Depending on your industry and location, you may wish to consider what vandalism, looting, revenge, or a disgruntled employee might do to your equipment. Is your computer room unlocked? Do you have big glass display windows to impress random strangers? Do you store a selection of fire axes in and near your computer room?

    Alternatively, are you careful to collect keys and change security codes when an employee leaves (or is pushed)? Can employees enter on their own, or are two people required to act together to gain access to the computer room? Do you have 7x24 physical security monitoring?

    One of the most important things, if you have a nontrivial computer room, is to consult a local expert who can advise you on what is the most appropriate protection in your area and for your situation.

    How can you attempt to recover from physical damage? The classic answer is, of course, to have a redundant offsite installation that can be used for recovery (see "Redundant Premises" below). Alternatively, consider such options as

    • emergency recovery agreements with key suppliers
    • strategically selected and located spares
    • planning for what processes and activities (if any) can be performed manually while computing system recovery is under way

    Utility Failures

    Just about everyone is in a position to be affected by some form of utility failure, the most obvious being electrical power failure. Even if you generate your own electricity with wind turbines and backup batteries, you're still at risk of extended calm or physical failure of your generating equipment. Most people can survive short outages on an occasional basis. But if you're in an area where utilities can be unreliable (poor infrastructure, frequent thunderstorms) or if you worry about extended outages such as those suffered in Quebec and New England this winter due to the ice storms (some places were without electricity for several weeks), you may wish to consider some suitable form of backup or redundancy for your utilities.

    Electricity

    Most of us rely on electricity from the local power company. The obvious way to protect yourself against outages is through the use of an uninterruptible power supply (UPS) with a diesel generator for backup and extended outages.

    But it's important to remember that, in a disaster, you won't be worried only about your central computing systems. You'll need power to run heating or cooling equipment, ventilation, at least some room lighting, your telephone switch, and so on. This isn't just a system administration issue ­ it's a facilities-wide issue.

    Water

    From a system administration perspective, the primary use for water is in air conditioning equipment. In some situations, a reservoir or cistern could provide spare water during an outage, and water tanker trucks are sometimes available. Otherwise, hope for a cool spell.

    Gas, Oil, Propane

    Again, these are used primarily for environmental control. Fortunately, alternate heat sources are often available, even if you have to resort to electric space heaters.

    Communication Links

    Most of us rely on some form of communications, whether it's ordinary telephone connections, leased lines, fiber, or various forms of wireless communications (though it's probably safe to say that wireless use is in the minority). Most of us rely on these links as a regular part of our workday, and for some of us, the business stops when the communication links go down.

    The best way to protect your communication links is through the use of redundant connections. For Internet connectivity, many organizations are "dual homed" to two providers, and prudent organizations make a point of ordering links from multiple carriers (and hope that the carriers don't simply buy capacity from each other). Even if your redundant links leave your premises through different paths over different carriers, it's still possible for them to terminate in or pass through the same carrier central office, which does limit your redundancy. If you're provisioning multiple links, try to get specific physical routes from your carriers so that you'll have a better idea of where your exposures are.

    One alternative for redundancy or backup that is becoming more common and more feasible is the use of metropolitan area wireless communications and/or satellite links. A satellite link, although typically slower and more expensive (or at least not cheap) provides nice redundancy because it can enable you to isolate your communications from any local problems. Of course, if all your connections are to systems in the same area as your office, remote satellite connectivity might not help too much. (And, of course, this is where I remind you all of the recent satellite outage that disabled huge numbers of pagers and the difficulty of dealing with failures in your backup communication systems.)

    Physical Inaccessibility

    I'll dredge up the ice storms from this past winter as an example of how your office can be fine, but it's just not possible to get there. Other examples are, of course, earthquake, flood, trucker blockades on European highways, bombs in the World Trade Center, and major parades. If your business relies on physical access (e.g., a printing company, a warehousing company, etc.), you could be in trouble. If you're an organization that deals in knowledge or computing, you might be better off. An easy way to deal with the latter is to ensure that your staff has home computing and an account on a reliable ISP (or run your own remote access servers, with lots of capacity for emergencies), and just have them dial in for the duration. Voice mail, remote phone forwarding, cell phones, and pagers all help limit (if you're lucky) the impact of this kind of disaster.

    Redundant Premises

    The classic disaster recovery plan (for computing and communications, at least) involves a redundant recovery site, just sitting and waiting for something to go wrong. This is still common in the mainframe world, where large data-processing capacity is needed on an ongoing basis. If you absolutely need ongoing computing, an alternate site is likely going to be part of your plan. Even if your needs are much simpler, you can benefit from some forms of offsite redundancy.

    Standby Sites

    In the traditional case, a large, climate-controlled, raised-floor computing center is loaded up with millions of dollars of equipment and sits there idle waiting for something to go wrong somewhere else. These sites have taken a number of forms. One of the most common is to be run by a service company, providing backup services to a number of clients. But some large organizations have backup computing centers that are dedicated to them. It is also not unheard of for system vendors to offer recovery services for their customers, and some cooperative ventures also exist for their members' mutual benefit. This typically isn't a cheap method of protection, but if you need it, you need it.

    Distributed Sites

    What is much more feasible, and much more practical in these days of high-speed Internet connectivity, is the use of distributed computing sites, which can provide backup for each other in the event of a disaster. An obvious example is the use of Web server hosting at service providers using some form of load sharing across multiple servers. This can be expanded by distributing your primary computing resources across multiple sites, taking care that you have similar equipment at each site. This kind of distribution also makes it possible to automatically store your backups offsite (given large enough bandwidth). However, it is harder to justify distributing your computing when your staff is all located in one place.

    Recovering

    I mentioned it before, as many others have mentioned before me, but I'll reiterate that the key to successful recovery is a proper plan and proper documentation. Space limits restrict how much I can say here, and I will refer you to the bookstores for DRP books, but I will mention a few points.

    • Hardware. Is your recovery hardware compatible? Do you have documentation of what makes your systems and installations unique?
    • Backups. You have them, of course, and they are offsite, of course. But do you have the index to the media that will allow you to find the necessary backups when you need them?
    • Names and addresses. Do you know what your machines should be named and numbered?
    • People. Do you know how to get in touch with your staff at home, or is your only phone list the office numbers on your desktop machine that was just destroyed in the fire? Do you have a list of who to contact first, who should do what, in what order?
    • Communications and connectivity. Do you know whom to call at your service providers and carriers to get your connectivity adjusted for your new or temporary location?
    • Status. Do you know who is in charge of the entire recovery operation for your organization and how to get in touch with them and when?
    • Food. And finally, do you have a list of food delivery places at your recovery site so that you won't pass out from lack of sustenance while working feverishly to put things back together?

    Summary

    In summary, plan, prepare, and pray you never need it.


?Need help? Use our Contacts page.
7th August 1998 efc
Last changed: 7th August 1998 efc
Issue index
;login: index
SAGE home