Friday, January 14, 2011

Disaster Recovery

For the last decade or so, disaster recovery has been a hot topic among Information Technology professionals. We all know that disasters lurk around every corner. They can happen at anytime, anywhere and without warning. All of us in IT have, at one time or another, dealt with disasters in greater or lesser magnitude. Our measure of preparedness dictates how well we weather such storms - do we come out unharmed, or looking like we've been through a major war.
Sometimes disasters are relatively minor, such as the failure of a hard disk. Occasionally, thankfully less often, they can be horrendous, such as with the First Interstate fire of 1988 or the World Trade Center bombing in 2001. A disaster can be of limited scope, as with water dripping on a computer system, or it can be of city-wide magnitude, as with a major earthquake.
All of these extremes fall within the realm of disaster planning. All levels of disaster must be examined, thought out, discussed, planned, implemented, and most importantly (and most often forgotten) tested over and over again.
This is the first in a series of articles examining this huge topic from different levels. My philosophy is simple in this regard - you must start from the basics of good security, excellent backups, well-thought-out RAID, and well managed daily procedures before you can even consider such major disasters as a significant earthquake or fire. For example, a perfectly running hot site is totally useless if your backups are not operating smoothly.
I classify disasters by order of magnitude. First, there are the minor disasters that occur in the normal course of business. These include hard disk failures, data corruptions, network glitches and everything else that seems to happen with computer systems. These kinds of problems are handled with well done warranty contracts, training, spare parts, RAID disks, excellent backups and good procedures.
Next, there are relatively minor environmental issues. These include a vast array of real life problems, including insects (chewing on wires), water seeping onto equipment, power failures (common here in California lately), small fires, even chemical leaks into a building. These kinds of occurrences generally take IT departments totally by surprise - and often produce problems far greater than anticipated.
Now come larger issues of major failures of multiple pieces of equipment. This could be as simple as a long power failure causing the computer room to be down for twelve hours, or as complex as the roof of the IT department caving inwards. The point is, the computer equipment is non-functional, but the business remains.
Finally I will touch on the larger issue of business continuity planning. This basically means not only are the computers unable to operate, but the business unit itself (such as the building) has been destroyed or made unusable. The collapse of the World Trade Center buildings or a major earthquake would fall into this category.
As you work up the four levels of disaster planning, it typically becomes more and more difficult to sell management on the necessity of spending money in that area. In fact, many organizations have trouble getting the funds for a single tape drive, much less a working, operational disaster site. 
A huge portion of the job of disaster recovery is being able to communicate to management why this is important and how it protects the business. The concept of an insurance policy comes in very handy in these conversations - managers at all levels understand insurance.
Perhaps the most critical component of disaster recovery is the issue that is also by far overlooked more often than any other. That is the issue of testing. Believe me, I've learned from hard experience that a disaster plan is completely worthless if it has not been tested and dry-run at least a dozen times. In addition, the plan must be maintained and re-tested constantly, at least once a quarter, or it quickly becomes unusable.
Another issue which is commonly overlooked is the training of personnel. People in IT tend to think about computers and code and screens, and we tend to forget that all of these things are operated by human beings. And those human beings need to know what they are doing. In other words, if there is a disaster, who does what? This folds neatly into the idea of testing - your testing must include those people who actually will be operating the disaster plan if a disaster occurs! 
Perhaps the most important lesson that I've learned about disaster planning is a simple datum: the disaster will NOT be what you've planned for. It is wise to look at your plans from multiple viewpoints. Will this work for an earthquake? A fire? A terrorist bomb? A hostage crisis? Or just water leaking through the ceiling over the weekend?
Also, you need to be sure you plan on the psychology of the people involved in the disaster. I don't know about you, but in the event of a major earthquake my mind will most definitely not be in it's most sane and rational state! So don't expect people to be able to perform any but the most robotic, thoroughly documented tasks - and don't expect them to do them correctly!
So buckle your seatbelts and get ready for the next installment of this series - Laying the groundwork for disaster recovery.

No comments:

Post a Comment