Information Security Jobs: Business Continuity and Disaster Recovery Planning

By Peter H. Gregory

Disasters, natural and man-made, occur with alarming unpredictability in the information security world, throwing organizations in their paths into chaos. Sometimes, the organization doesn’t survive or retains only a shadow of its former self. Much can be done to reduce the potency of disasters, giving organizations a far better chance of survival.

Disaster recovery planning (DRP) and business continuity planning (BCP) may not seem as though they should be part of information security. However, the core information security concept of confidentiality, integrity, and availability (CIA) does include DRP and BCP as a vital activity to ensure the availability of key systems in an organization.

BCP and DRP have their own array of concepts that are essential to information security professionals. Even if you don’t anticipate working in the BCP or DRP space, familiarity with these concepts may lead you or your organization to opportunities to improve disaster preparedness.

Types of disasters

Several types of man-made and natural disasters have a direct or an indirect effect on organizations. The types of disasters include the following:

  • Natural:

    • Weather: hurricane, tornado, ice storm, blizzard, or heavy rain

    • Geological: earthquake, tsunami, volcano, landslide, avalanche, or sinkhole

    • Other: pandemic, forest or range fire, flood, or solar storm

  • Man-made:

    • Social or political: war, riot, demonstration, or strike

    • Utilities: utility outage or fuel shortage

    • Material: hazardous material spill or radioactive materials leak

These and other types of disasters can have a direct or an indirect effect on organizations, including the following:

  • Interruptions in transportation

  • Communications outages

  • Workforce shortage

Business continuity planning and disaster recovery planning

Two primary activities take place after a disaster strikes:

  • Continuation of business processes using alternate facilities, equipment, or personnel, which is the purview of Business Continuity Planning (BCP)

  • Salvage of buildings and equipment, and restoration of primary work facilities, which is the purview of Disaster Recovery Planning (DRP)

These two activities are both concerned with getting the organization back on its feet after a disaster. Both are needed for the long-term survival of the organization.

Business impact assessment (BIA)

A business impact assessment (BIA) is a special type of risk assessment that is performed periodically to determine two key things: the most critical business processes in the organization, and the resources and dependencies on other business processes that the key processes rely on for continuous operation.

Upon completion, a BIA generally portrays the most important business processes in order of criticality (the most critical processes are listed first).

For each critical process, the maximum tolerable downtime (MTD) value is identified. MTD is the greatest amount of time that a business process can be incapacitated before the organization’s survival is at risk. The value of an MTD is difficult to determine and therefore highly judgmental.

Security professionals can derive value from the BIA by understanding which processes and underlying systems are the most important in an organization. Those systems will be the ones requiring the best protection.

Recovery targets

After identifying the most important business processes and systems in the BIA, the organization needs to establish recovery targets. These are the time intervals required to get processes and IT systems running again. The recovery targets are as follows:

  • Recovery time objective (RTO): Expressed as minutes, hours, or days, the period of time from disaster onset until the process or system is operational. The value of MTD should drive the RTO value.

  • Recovery point objective (RPO): Expressed as minutes, hours, or days, the period of maximum data loss after a disaster strikes. For instance, if an organization wants to lose no more than one hour’s worth of transactions, the RPO would be one hour.

  • Recovery consistency objective (RCO): Expressed as the measure of integrity and consistency in data in the emergency operations system compared to the original production system. RCO is a percent value that is expressed as 1 minus (number of inconsistent entries) divided by (number of entries).

  • Recovery capacity objective (RCapO): Expressed as a percentage, the capacity of temporary processing systems compared to production systems.

Often, an organization will determine that a given system does not have sufficient resilience to successfully meet the recovery objectives after a disaster. In this case, the organization must change its recovery objectives to less ambitious figures or invest in equipment and processes that will facilitate recovery within targets.

Contingency planning

Organizations need to develop written contingency plans that personnel can follow when a disaster occurs. These contingency plans should include the following considerations:

  • Primary operations personnel may be unwilling or unable to assist in the continuation and recovery of critical systems.

  • Personnel who will be following contingency plans may have less familiarity with these processes and systems.

Testing contingency plans

To determine the quality of contingency plans, organizations should periodically test them. These tests, which should include primary and backup personnel, may also serve as training, which helps these personnel better understand the procedures that should be followed during a disaster.

There are five types of tests:

  • Document review: Personnel read through contingency planning documents, and note any errors or omissions they find.

  • Walkthrough: Personnel review contingency planning documents in group sessions, noting errors and omissions they find.

  • Simulation: A scripted disaster is recited to personnel, who respond as though a real disaster is taking place.

  • Parallel test: Recovery systems are activated and process live data but in isolation so as not to disturb production systems that are still running. A parallel test helps test workload and whether recovery systems work properly.

  • Cutover test: Production systems are shut down or disconnected, and recovery systems are activated to manage live workload. This test is a complete end-to-end test of the capacity and integrity of the recovery system. If a cutover test fails, it can mean that the systems being tested stop working, resulting in key business processes grinding to a halt.