Page tree

Even though you can’t predict future events, you can prepare your IT infrastructure to cope with such events as sudden power outages or natural disasters by developing a comprehensive disaster recovery (DR) plan. By having the necessary tools and resources in place, you can reduce the risk of damage caused by an unplanned event to a minimum.

Key Elements of Disaster Recovery Planning

A well-devised DR plan consists of instructions and procedures that ensure fast business recovery and help to avoid or mitigate possible devastating consequences that may occur in the event of a disaster.

A good DR plan must include the following components:

  • Documentation. Prepare a well-structured and thoroughly planned DR documentation that lists:
    • All vital components of your IT infrastructure (hardware and software).
    • Sequence of measures that will need to be taken to resume business operations.
    • Responsible team members.
  • Scope and Dependencies. Your recovery scope may not necessarily include your entire IT infrastructure because not all components can be equally critical for your business continuity. Therefore, determine what VMs to include into your recovery scope to achieve your recovery time objectives. These can be the VMs storing business-critical information, IT systems and important applications. Also, consider that some VMs can have dependency links. For example, some particular information on one VM can be dependent on information housed on another one. Document these dependencies so that in the case of a disaster, your staff can resume working with minimal interruptions.
  • Responsible Team and Staff Training. Make sure you assign certain staff members to be responsible for disaster recovery activities and staff coordination. Communicate the DR plan to all your employees and introduce the DR team to them.
  • Set Recovery Time Objective (RTO) and Recovery Point Objectives (RPO). RTO determines how long your business can go without a particular application, system or VM running. RPO defines how much data your business can afford to lose without making a negative impact to your business operations. Keep in mind that you can’t afford a long downtime or data loss for VMs housing customer-related applications. On the other hand, VMs housing administrative applications can withstand data loss and some downtime.
  • Testing and Optimization. To make sure your DR plan is effective, test it to find possible weak points and inconsistencies. As your IT infrastructure gets upgraded and enhanced, revise and optimize your DR plan accordingly. It’s important to always keep it consistent and up to date to never give disaster a chance to prevail.
  • Automation. Use a reliable DR solution to automate the entire recovery process.  Automatic DR operations, from failover to failback, will free up IT managers from lots of manual and complex work. Moreover, the whole recovery process will take less time and allow you to save money, thanks to a minimum downtime period.

NAKIVO Backup & Replication – Effective DR Solution

NAKIVO Backup & Replication allows you to address all major DR planning points by creating automated DR workflows for VMware, Microsoft Hyper-V, and AWS EC2 environments. The product significantly reduces the complexity of DR planning, improves disaster recovery preparedness, and helps to achieve tighter RTOs. By installing the product on a physical or a virtual machine, you can automatically create VM backups and replicas, perform individual object/VM recovery and failover to a VM replica. Moreover, Enterprise editions of the product allow you to perform DR recovery of an entire site, not just VMs.

Site Recovery

Your DR workflow is a set of actions that can vary in complexity, depending on your needs and objectives. When utilizing Site Recovery of NAKIVO Backup & Replication, you can include up to 200 actions to a single job, including failover, failback, start or stop VMs and instances, run or stop jobs, run script, attach or detach backup repositories, send emails, wait, and check condition. By arranging actions and conditions into one automated algorithm, you can create Site Recovery jobs of any complexity to meet your business needs.

Site Recovery jobs can be run in two modes: test and production. By testing your jobs in advance you can verify their efficiency and validity. This will also allow you to run the jobs smoothly in production mode when disaster strikes. Actions such as failover, failback, start/stop VMs (instances), and attach/detach repositories are reversed upon job completion, bringing your environment back to its initial state. Site recovery tests can be run automatically and on schedule. In the production mode, only manual launch is allowed.

Among the key advantages of the Site Recovery feature are:

  • Flexibility

  • Ease of use

  • Cost-efficiency

Failover as Part of Site Recovery Workflows

Failover is the process of switching from a source (production) VM to a VM replica for the purpose of transferring workloads. It is an important process of a company’s DR plan, and is dictated by the RPO/RTO values set for the VMS. With a standby copy of your production VMs at a DR site, you can simplify and automate the DR process by creating VM failover jobs. Additionally, you can specify which VM replicas should be powered on and create rules to reconfigure VM network.

NAKIVO Backup & Replication allows you to protect VMs running within a cluster, replicate VMs, and fail over to replicas. The application automatically tracks the host on which a VM is residing so it can replicate that VM. Clusters as well as standalone ESXi or Hyper-V hosts are supported as source and destination points for replication. NAKIVO Backup & Replication can also change the VM network settings automatically upon failover. Just use the Network Mapping and Re-IP features when configuring a replication or failover job.

An Automated VM failover requires to take the following actions:

There are three types of failover:

  • Test failover is used for testing purposes - for example, to determine RTO and RPO values or simulate the recovery procedure in your test environment. It will allow you to make sure everything functions properly and can run smoothly when needed.
  • Planned failover is used for migrating workloads from one site to another, including cases when some disaster is predicted. This may include a weather alert about a tornado risk or planned maintenance works at your primary site.
  • Regular failover is unplanned failover performed when a disaster occurs unexpectedly and a critical VM (or the whole primary site) goes offline. This could be caused by a sudden power outage, natural disaster, virus attack, or any other incident. Hosts and VM replicas should be prepared for unplanned failover.

Failback as Part of Site Recovery Workflow

Failback is the process of switching workloads from the VM replica back to the source VM after the disaster damage has been remedied. It also involves identifying the changes that were made while the DR site was substituting the production site, and transferring such data back to the original VM.

Running a failback job is only possible with prior failover since failback is supposed to restore the application in a state of failover back to its original state. Therefore, to start a failback operation, you need to create a site recovery workflow that includes the failover action. You can failback to the primary site or a new location from a VM replica that has replaced the original VM.

Failback can be run in either of the following modes:

  • Failback in test mode (on demand or on schedule) will help to identify whether a workflow would run successfully in a production mode. 
    During test failback, VM replica runs all operations and remains powered on, while the original VM is powered off. A protective snapshot of the original source VM is created. Thereafter, incremental or full replication from the VM replica to the source VM is performed. Replication is only run once, which is sufficient for testing purposes. It is important to check that the IP address and network settings are correct in order to establish a connection between the sites. By doing this, you will ensure that the source VM and the VM replica can be synchronized for smooth data transfer. Finally, the source VM is powered on.

    Note

    All changes made in your VMs during the failback process will be discarded after the test is complete, and your virtual environment will revert back to its original state.

  • Failback in production mode allows you to recover your environment after a disaster. The process is similar to failback in test mode. However, replication from the VM replica to the source VM is performed twice to ensure zero data loss. In the end, the source VM gets powered on, and the VM replica at the DR site gets powered off.

    Note

    VM replicas are only powered off in production mode.

  • No labels