physical inaccessibility due to weather, structural damage to
the building, civil unrest, or evacuation (due to chemical spill, fire,
etc.)
The key to dealing with such problems is planning and documentation (on
paper, both on- and offsite). Leaving your DRP until disaster strikes
only increases its severity.
Physical Damage
If your entire computing infrastructure consists of a single clone PC,
a modem, and a cheap printer, it's probably not worthwhile worrying too
much about protecting your equipment from loss or damage. If something
gets damaged, just go to any of the consumer electronics stores in your
area and get a replacement off the shelf. If, however, your equipment
is not typically available at the mall on a Saturday afternoon, you
probably want to consider how to limit your potential damage and how to
get access to replacement equipment in a timely fashion.
Physical damage to your computing and communications equipment can
happen in a number of ways. Two of the most obvious are fire and water,
but there are a number of other possibilities that you might consider
protecting against.
Fire and Smoke
The best protection against fire and smoke is a safe, fire
code-compliant building and an appropriate fire detection and
suppression system. In past years, Halon was widely used as a fire
suppression agent in computer rooms, but it was not environmentally
friendly. Fire suppression systems are currently available based on
carbon dioxide and other chemicals, but water-based sprinkler systems
are still the most common suppression method. You might also consider
the use of an emergency power-off system, to power down your systems in
the event of an alarm. Among other things, this will help avoid damage
to your equipment from smoke and residue being drawn into the chassis
through the cooling fans.
Water
There are a few ways for water to attack your equipment. Plumbing
failure is probably the most common, but you may also wish to worry
about water damage from fire suppression systems or flooding. You
should consider two primary attacks from water: falling down from above
and seeping up from below.
From above, there are burstable pipes (both on your premises and
feeding the bathtub or dishwasher in the unit above) and fire hoses.
I've seen some installations with drainage trays mounted under the
pipes in the computer room, draining off to the side of the room. And
if you have control (or knowledge) of whatever it is in the rooms above
you, you might want to worry about whatever plumbing there is up there.
Otherwise, your best protection is to keep your equipment in racks or
cabinets with a roof over them (and make sure that the ventilation fan
outlet isn't in the middle of the top of the cabinet).
From below, consider a raised floor, with in-floor drains (including
backflow prevention valves), pedestals, or some other device to keep
your electrical connections off a potentially wet floor and an alarm
system to warn you when it gets wet. And if your computer room is below
grade, you may wish to reconsider its location. The farther you are
above the water table, the safer you are.
Earthquake, Tornado
Two approaches to these problems are building integrity and equipment
safety. If you are located in an area that is at risk for earthquakes
or tornados, consider how your building would be affected if either
hits. What parts of the building are most likely to be damaged
large plate glass windows, overhangs, trailer parks? Try to locate your
equipment as far away from these as possible. Consider bolting your
equipment down in some appropriate fashion, and don't forget to fasten
your rolling equipment racks down, too. There's no sense having your
equipment bounce across the room or fall out the window every time
there's a tremor.
Vandalism
Depending on your industry and location, you may wish to consider what
vandalism, looting, revenge, or a disgruntled employee might do to your
equipment. Is your computer room unlocked? Do you have big glass
display windows to impress random strangers? Do you store a selection
of fire axes in and near your computer room?
Alternatively, are you careful to collect keys and change security
codes when an employee leaves (or is pushed)? Can employees enter on
their own, or are two people required to act together to gain access to
the computer room? Do you have 7x24 physical security monitoring?
One of the most important things, if you have a nontrivial computer
room, is to consult a local expert who can advise you on what is the
most appropriate protection in your area and for your situation.
How can you attempt to recover from physical damage? The classic answer
is, of course, to have a redundant offsite installation that can be
used for recovery (see "Redundant Premises" below). Alternatively,
consider such options as
-
emergency recovery agreements with key suppliers
-
strategically selected and located spares
-
planning for what processes and activities (if any) can be
performed manually while computing system recovery is under way
Utility Failures
Just about everyone is in a position to be affected by some form of
utility failure, the most obvious being electrical power failure. Even
if you generate your own electricity with wind turbines and backup
batteries, you're still at risk of extended calm or physical failure of
your generating equipment. Most people can survive short outages on an
occasional basis. But if you're in an area where utilities can be
unreliable (poor infrastructure, frequent thunderstorms) or if you
worry about extended outages such as those suffered in Quebec and New
England this winter due to the ice storms (some places were without
electricity for several weeks), you may wish to consider some suitable
form of backup or redundancy for your utilities.
Electricity
Most of us rely on electricity from the local power company. The
obvious way to protect yourself against outages is through the use of
an uninterruptible power supply (UPS) with a diesel generator for
backup and extended outages.
But it's important to remember that, in a disaster, you won't be
worried only about your central computing systems. You'll need power to
run heating or cooling equipment, ventilation, at least some room
lighting, your telephone switch, and so on. This isn't just a system
administration issue it's a facilities-wide issue.
Water
From a system administration perspective, the primary use for water is
in air conditioning equipment. In some situations, a reservoir or
cistern could provide spare water during an outage, and water tanker
trucks are sometimes available. Otherwise, hope for a cool spell.
Gas, Oil, Propane
Again, these are used primarily for environmental control. Fortunately,
alternate heat sources are often available, even if you have to resort
to electric space heaters.
Communication Links
Most of us rely on some form of communications, whether it's ordinary
telephone connections, leased lines, fiber, or various forms of
wireless communications (though it's probably safe to say that wireless
use is in the minority). Most of us rely on these links as a regular
part of our workday, and for some of us, the business stops when the
communication links go down.
The best way to protect your communication links is through the use of
redundant connections. For Internet connectivity, many organizations
are "dual homed" to two providers, and prudent organizations make a
point of ordering links from multiple carriers (and hope that the
carriers don't simply buy capacity from each other). Even if your
redundant links leave your premises through different paths over
different carriers, it's still possible for them to terminate in or
pass through the same carrier central office, which does limit your
redundancy. If you're provisioning multiple links, try to get specific
physical routes from your carriers so that you'll have a better idea of
where your exposures are.
One alternative for redundancy or backup that is becoming more common
and more feasible is the use of metropolitan area wireless
communications and/or satellite links. A satellite link, although
typically slower and more expensive (or at least not cheap) provides
nice redundancy because it can enable you to isolate your
communications from any local problems. Of course, if all your
connections are to systems in the same area as your office, remote
satellite connectivity might not help too much. (And, of course,
this is where I remind you all of the recent satellite outage that
disabled huge numbers of pagers and the difficulty of dealing with
failures in your backup communication
systems.)
Physical Inaccessibility
I'll dredge up the ice storms from this past winter as an example of
how your office can be fine, but it's just not possible to get there.
Other examples are, of course, earthquake, flood, trucker blockades on
European highways, bombs in the World Trade Center, and major parades.
If your business relies on physical access (e.g., a printing company, a
warehousing company, etc.), you could be in trouble. If you're an
organization that deals in knowledge or computing, you might be better
off. An easy way to deal with the latter is to ensure that your staff
has home computing and an account on a reliable ISP (or run your own
remote access servers, with lots of capacity for emergencies), and just
have them dial in for the duration. Voice mail, remote phone
forwarding, cell phones, and pagers all help limit (if you're lucky)
the impact of this kind of disaster.
Redundant Premises
The classic disaster recovery plan (for computing and communications,
at least) involves a redundant recovery site, just sitting and waiting
for something to go wrong. This is still common in the mainframe world,
where large data-processing capacity is needed on an ongoing basis. If
you absolutely need ongoing computing, an alternate site is likely
going to be part of your plan. Even if your needs are much simpler, you
can benefit from some forms of offsite redundancy.
Standby Sites
In the traditional case, a large, climate-controlled, raised-floor
computing center is loaded up with millions of dollars of equipment and
sits there idle waiting for something to go wrong somewhere else. These
sites have taken a number of forms. One of the most common is to be run
by a service company, providing backup services to a number of clients.
But some large organizations have backup computing centers that are
dedicated to them. It is also not unheard of for system vendors to
offer recovery services for their customers, and some cooperative
ventures also exist for their members' mutual benefit. This typically
isn't a cheap method of protection, but if you need it, you need it.
Distributed Sites
What is much more feasible, and much more practical in these days of
high-speed Internet connectivity, is the use of distributed computing
sites, which can provide backup for each other in the event of a
disaster. An obvious example is the use of Web server hosting at
service providers using some form of load sharing across multiple
servers. This can be expanded by distributing your primary computing
resources across multiple sites, taking care that you have similar
equipment at each site. This kind of distribution also makes it
possible to automatically store your backups offsite (given large
enough bandwidth). However, it is harder to justify distributing your
computing when your staff is all located in one place.
Recovering
I mentioned it before, as many others have mentioned before me, but
I'll reiterate that the key to successful recovery is a proper plan and
proper documentation. Space limits restrict how much I can say here,
and I will refer you to the bookstores for DRP books, but I will
mention a few points.
-
Hardware. Is your recovery hardware compatible? Do you have
documentation of what makes your systems and installations unique?
-
Backups. You have them, of course, and they are offsite, of
course. But do you have the index to the media that will allow you to
find the necessary backups when you need them?
-
Names and addresses. Do you know what your machines should be
named and
numbered?
-
People. Do you know how to get in touch with your staff at home,
or is your only phone list the office numbers on your desktop machine
that was just destroyed in the fire? Do you have a list of who to
contact first, who should do what, in what order?
-
Communications and connectivity. Do you know whom to call at
your service providers and carriers to get your connectivity adjusted
for your new or temporary location?
-
Status. Do you know who is in charge of the entire recovery
operation for your organization and how to get in touch with them and
when?
-
Food. And finally, do you have a list of food delivery places at
your recovery site so that you won't pass out from lack of sustenance
while working feverishly to put things back together?
Summary
In summary, plan, prepare, and pray you never need it.