Domain 7. Chapter 18 Flashcards
(42 cards)
Chapter 18 Disaster Recovery Planning
- The Nature of Disaster
1.1 Natural Disasters
- Earthquakes
- Floods
- Fires
- Pandemics
- Other Natural Events
1.2 Human-Made Disasters
- Fires
- Acts of Terrorism
- Bombings/Explosions
- Power Outages Отключения питания
- Network, Utility, and Infrastructure Failures
- Hardware/Software Failures
- Strikes/Picketing
- Theft/Vandalism
- Natural disasters reflect the occasional fury of our habitat—violent occurrences that result from changes in the earth’s surface or atmosphere that are beyond human control.
During the BCP/DRP process, your assessment team should analyze all of your organization’s operating locations and gauge the impact that such events might have on your business.
If your business is geographically diverse, it is prudent to include local emergency response experts on your planning team.
- Understand System Resilience, High Availability, and Fault Tolerance
Понимание устойчивости системы, высокой доступности и отказоустойчивости
A primary goal of system resilience and fault tolerance is to eliminate single points of failure in critical business systems.
A single point of failure (SPOF) is any component that can cause an entire system to fail. If a database-dependent website includes multiple web servers all served by a single database server, the database server is a single point of failure.
System resilience refers to the ability of a system to maintain an acceptable level of service during an adverse event.
Fault tolerance is the ability of a system to suffer a fault but continue to operate. Fault tolerance is achieved by adding redundant components, such as additional disks within a properly configured RAID array or additional servers within a failover clustered configuration.
High availability is the use of redundant technology components to allow a system to quickly recover from a failure after experiencing a brief disruption. High availability is often achieved through the use of load balancing and failover servers. (серверы балансировки нагрузки и аварийного переключения.)
Technology professionals measure the objective and effectiveness of these controls by the percentage of the time that a system is available. For example, a fairly low availability threshold would be to specify that a system must be available 99.9 percent of the time (or “three nines” of availability). This means that the system may only experience 0.1 percent of downtime during whatever period is measured. If you apply this metric to a 30-day month of system operation, 99.9 percent availability would require less than 44 minutes of downtime. If you move to a 99.999 percent (or “five nines”) requirement, the system
2.1 Protecting Hard Drives
A RAID array includes two or more disks, and most RAID configurations will continue to operate even after one of the disks fails. Some of the common RAID configurations are as follows:
- RAID-0 This is also called striping. It uses two or more disks and improves the disk subsystem performance, but it does not provide fault tolerance.
- RAID-1 This is also called mirroring. It uses two disks, which both hold the same data. If one disk fails, the other disk includes the data so that a system can continue to operate after a single disk fails.
- RAID-5 This is also called striping with parity. (чередованием с четностью) It uses three or more disks with the equivalent of one disk holding parity information. This parity information allows the reconstruction of data through mathematical calculations if a single disk is lost. If any single disk fails, the RAID array will continue to operate, though it will be slower.
- RAID-6 This offers an alternative approach to disk striping with parity. It functions in the same manner as RAID-5 but stores parity information on two disks, protecting against the failure of two separate disks but requiring a minimum of four disks to implement.
- RAID-10 This is also known as RAID 1 + 0 or a stripe of mirrors, and it is configured as two or more mirrors (RAID-1), with each mirror configured in a striped (RAID-0) configuration. It uses at least four disks but can support more as long as an even number of disks are added. It will continue to operate even if multiple disks fail, as long as at least one drive in each mirror continues to function. However, if two drives in any of the mirrors failed, such as both drives in M1, the entire array would fail.
2.2 Protecting Servers
Fault tolerance can be added for critical servers with failover clusters. A failover cluster includes two or more servers, and if one of the servers fails, another server in the cluster can take over its load in an automatic process called failover. Failover clusters can include multiple servers (not just two), and they can also provide fault tolerance for multiple services or applications.
2.3 Protecting Power Sources
Fault tolerance can be added for power sources with a UPS, a generator, or both. In general, a UPS provides battery-supplied power for a short period of time, between 5 and 30 minutes, and a generator provides long-term power. The goal of a UPS is to provide power long enough to complete a logical shutdown of a system, or until a generator is powered on and providing stable power.
2.4 Trusted Recovery
Trusted recovery provides assurances that after a failure or crash, the system is just as secure as it was before the failure or crash occurred. Depending on the failure, the recovery may be automated or require manual intervention by an administrator.
Systems can be designed so that they fail in a fail-secure state or a fail-open state. A fail-secure system will default to a secure state in the event of a failure, blocking all access. A fail-open system will fail in an open state, granting all access.
Specifically, it defines four types of trusted recovery:
- Manual Recovery If a system fails, it does not fail in a secure state. Instead, an administrator is required to manually perform the actions necessary to implement a secured or trusted recovery after a failure or system crash.
- Automated Recovery The system is able to perform trusted recovery activities to restore itself against at least one type of failure.
- Automated Recovery without Undue Loss This is similar to automated recovery in that a system can restore itself against at least one type of failure. However, it includes mechanisms to ensure that specific objects are protected to prevent their loss. A method of automated recovery that protects against undue loss would include steps to restore data or other objects.
- Function Recovery Systems that support function recovery are able to automatically recover specific functions. This state ensures that the system is able to successfully complete the recovery for the functions, or that the system will be able to roll back the changes to return to a secure state.
2.5 Quality of Service
Quality of service (QoS) controls protect the availability of data networks under load. Many different factors contribute to the quality of the end-user experience, and QoS attempts to manage all of those factors to create an experience that meets business requirements.
Some of the factors contributing to QoS are as follows:
- Bandwidth The network capacity available to carry communications.
- Latency The time it takes a packet to travel from source to destination.
- Jitter The variation in latency between different packets.
- Packet Loss Some packets may be lost between source and destination, requiring retransmission.
- Interference Electrical noise, faulty equipment, and other factors may corrupt the contents of packets.
In addition to controlling these factors, QoS systems often prioritize certain traffic types that have low tolerance for interference and/or have high business requirements.
- Recovery Strategy
When a disaster interrupts your business, your disaster recovery plan should kick in nearly automatically and begin providing support for recovery operations. The disaster recovery plan should be designed so that the first employees on the scene can immediately begin the recovery effort in an organized fashion, even if members of the official DRP team have not yet arrived on site.
If your property insurance includes an actual cash value (ACV) clause, then your damaged property will be compensated based on the fair market value of the items on the date of loss, less all accumulated depreciation since the time of their purchase.
3.1 Business Unit and Functional Priorities
To recover your business operations with the greatest possible efficiency, you must engineer your disaster recovery plan so that those business units with the highest priority are recovered first. You must identify and prioritize critical business functions as well so that you can define which functions you want to restore after a disaster or failure and in what order. The business impact analysis (BIA) you developed during your business continuity work is an excellent resource when performing this task.
The output from this task should be a simple listing of business units in priority order.
However, a more detailed list, broken down into specific business processes listed in order of priority, would be a much more useful deliverable.
The final result should be a checklist of items in priority order, each with its own risk and cost assessment, and a corresponding set of recovery objectives and milestones. As discussed in Chapter 3, these include the mean time to repair (MTTR), maximum tolerable downtime (MTD), recovery time objective (RTO), and recovery point objective (RPO).
3.2 Crisis Management
If a disaster strikes your organization, panic is likely to set in. The best way to combat this is with an organized disaster recovery plan. The individuals in your business who are most likely to first notice an emergency situation (such as security guards and technical personnel) should be fully trained in disaster recovery procedures and know the proper notification procedures and immediate response mechanisms.
3.3 Emergency Communications
When a disaster strikes, it is important that the organization be able to communicate internally as well as with the outside world. A disaster of any significance is easily noticed, but if an organization is unable to keep the outside world informed of its recovery status, the public is apt to fear the worst and assume that the organization is unable to recover. It is also essential that the organization be able to communicate internally during a disaster so that employees know what is expected of them—whether they are to return to work or report to another location, for instance.
3.4 Workgroup Recovery
When designing a disaster recovery plan, it’s important to keep your goal in mind—the restoration of workgroups to the point that they can resume their activities in their usual work locations. It’s easy to get sidetracked and think of disaster recovery as purely an IT effort focused on restoring systems and processes to working order.
3.5 Alternate Processing Sites
One of the most important elements of the disaster recovery plan is the selection of alternate processing sites to be used when the primary sites are unavailable.
3.5.1 Cold sites are standby facilities large enough to handle the processing load of an organization and equipped with appropriate electrical and environmental support systems. The major advantage of a cold site is its relatively low cost—there’s no computing base to maintain and no monthly telecommunications bill when the site is idle.
3.5.2 A hot site is the exact opposite of the cold site. In this configuration, a backup facility is maintained in constant working order, with a full complement of servers, workstations, and communications links ready to assume primary operations responsibilities. The servers and workstations are all preconfigured and loaded with appropriate operating system and application software.
If it’s not the case, disaster recovery managers have three options to activate the hot site:
- If there is sufficient time before the primary site must be shut down, they can force replication between the two sites right before the transition of operational control.
- If replication is impossible, managers may carry backup tapes of the transaction logs from the primary site to the hot site and manually reapply any transactions that took place since the last replication.
- If there are no available backups and it isn’t possible to force replication, the disaster recovery team may simply accept the loss of some portion of the data. This should only be done when the loss is within the organization’s recovery point objective (RPO).
3.5.3 Warm Sites
Warm sites occupy the middle ground between hot and cold sites for disaster recovery specialists. They always contain the equipment and data circuits necessary to rapidly establish operations. As with hot sites, this equipment is usually preconfigured and ready to run appropriate applications to support an organization’s operations. Unlike hot sites, however, warm sites do not typically contain copies of the client’s data. The main requirement in bringing a warm site to full operational status is the transportation of appropriate backup media to the site and restoration of critical data on the standby servers.
3.5.4 Mobile Sites
Mobile sites are non-mainstream alternatives to traditional recovery sites. They typically consist of self-contained trailers or other easily relocated units. These sites include all the environmental control systems necessary to maintain a safe computing environment. Larger corporations sometimes maintain these sites on a “fly-away” basis, ready to deploy them to any operating location around the world via air, rail, sea, or surface transportation. Smaller firms might contract with a mobile site vendor in their local area to provide these services on an as-needed basis.
3.5.5 Cloud Computing
Many organizations now turn to cloud computing as their preferred disaster recovery option. Infrastructure-as-a-service (IaaS) providers, such as Amazon Web Services (AWS), Microsoft Azure, and Google Compute Engine, offer on-demand service at low cost.
3.5.6 Mutual Assistance Agreements Соглашения о взаимной помощи (MAA)
Mutual assistance agreements (MAAs), also called reciprocal agreements, are popular in disaster recovery literature but are rarely implemented in real-world practice. Under an MAA, two organizations pledge обязуются to assist each other in the event of a disaster by sharing computing facilities or other technological resources.
3.6 Database Recovery
3.6.1 Electronic Vaulting
In an electronic vaulting scenario, database backups are moved to a remote site using bulk transfers. В сценарии электронного хранилища резервные копии базы данных перемещаются на удаленный сайт с помощью массовой передачи. The remote location may be a dedicated alternative recovery site (such as a hot site) or simply an offsite location managed within the company or by a contractor for the purpose of maintaining backup data.
If you use electronic vaulting, remember that there may be a significant delay between the time you declare a disaster and the time your database is ready for operation with current data. If you decide to activate a recovery site, technicians will need to retrieve the appropriate backups from the electronic vault and apply them to the soon-to-be production servers at the recovery site.
3.6.2 Remote Journaling
With remote journaling, data transfers are performed in a more expeditious manner. Data transfers still occur in a bulk transfer mode, but they occur on a more frequent basis, usually once every hour and sometimes more frequently. Unlike electronic vaulting scenarios, where entire database backup files are transferred, remote journaling setups transfer copies of the database transaction logs containing the transactions that occurred since the previous bulk transfer.
3.6.3 Remote Mirroring
Remote mirroring is the most advanced database backup solution. Not surprisingly, it’s also the most expensive! Remote mirroring goes beyond the technology used by remote journaling and electronic vaulting; with remote mirroring, a live database server is maintained at the backup site. The remote server receives copies of the database modifications at the same time they are applied to the production server at the primary site. Therefore, the mirrored server is ready to take over an operational role at a moment’s notice.
- Recovery Plan Development
Once you’ve established your business unit priorities and have a good idea of the appropriate alternative recovery sites for your organization, it’s time to put pen to paper and begin drafting a true disaster recovery plan.
The following list includes various types of documents worth considering:
- Executive summary providing a high-level overview of the plan
Managers and public relations personnel will have a simple document that walks them through a high-level view of the coordinated symphony that is an active disaster recovery effort without requiring interpretation from team members busy with tasks directly related to that effort.
- Department-specific plans
Personnel who need to refresh themselves on the disaster recovery procedures that affect various parts of the organization will be able to refer to their department-specific plans. - Technical guides for IT personnel responsible for implementing and maintaining critical backup systems
Critical disaster recovery team members will have checklists to help guide their actions amid the chaotic atmosphere of a disaster. - Checklists for individuals on the disaster recovery team
Critical disaster recovery team members will have checklists to help guide their actions amid the chaotic atmosphere of a disaster. - Full copies of the plan for critical disaster recovery team members
4.1 Emergency Response
A disaster recovery plan should contain simple yet comprehensive instructions for essential personnel to follow immediately upon recognizing that a disaster is in progress or is imminent.
Emergency-response plans are often put together in the form of checklists provided to responders. When designing such checklists, keep one essential design principle in mind: arrange the checklist tasks in order of priority, with the most important task first!
Among these essential tasks is the formal declaration of a disaster. The response plan should include clear criteria for activation of the disaster recovery plan, define who has the authority to declare a disaster, and then discuss notification procedures, as discussed in the next section.
4.2 Personnel and Communications
A disaster recovery plan should also contain a list of personnel to contact in the event of a disaster. Usually, this includes key members of the DRP team as well as personnel who execute critical disaster recovery tasks throughout the organization.
This response checklist should include alternate means of contact (that is, pager numbers, mobile phone numbers, and so on) as well as backup contacts for each role should the primary contact be incommunicado or unable to reach the recovery site for one reason or another.
Many firms organize their notification checklists in a “telephone tree” style: each member of the tree contacts the person below them, spreading the notification burden among members of the team instead of relying on one person to make lots of telephone calls.