Best Practices: NetApp HCI DR with Cleondris
Recommendations for Success
The following tips can help you be more successful with your BCDR work.
Know your applications and what makes them work. The more time you spend on them, the more successful you will be with your real and test failovers. When there are issues, you will be able to solve them faster.
Protect one application first. Choose a relatively simple one, and demo the test failover to your peers and management. The demonstration will help you with management and peer support, and the test will help you learn more before you protect other applications.
Your tier 1 applications should be on their own volume.
You need to practice often in as realistic a scenario as possible. For example, practice off-site, sometimes with poor internet in a hotel conference room. Practicing often is key, and try changing the teams around so that application team X is recovering application Y; this approach will help with knowledge sharing.
Make sure to have an executive sponsor. You’ll need executive support when teams are not working well together, or when you need application teams to be reasonable about recovery time.
Plan for Partial or Full Outage
Most disaster recovery events are partial ones, so make sure your tier 1 applications can be recovered without having to recover everything.
Practice the failovers, but also practice managing others who are authorized to trigger a failover. They need to practice, and they need to know how successful or unsuccessful the failovers are. Make sure they practice with you in as realistic a scenario as possible. You can do a sand-table-type exercise in which operations people bring up issues and managers discuss their response.
Why Does Disaster Recovery Fail?
There are several possible reasons for a disaster recovery plan failing:
BCDR is needed.
Attitude is missing: People do not care as much as they should.
The executive sponsor is missing or not assigned.
There isn’t enough practice, or it isn’t real enough.
Data from the test gets into the product. This situation is serious and must be avoided.
Additional Uses for Disaster Recovery Orchestration Tools
Over time, customers have found other uses for disaster recovery orchestration tools. For example, they test application and OS upgrades in a test failover. This testing is better than testing in a lab, because it uses the actual production bits—which means that, when done in production, the process will be as smooth as in the test failover. I have also seen security vulnerability testing done as a test failover first to determine what applications might be negatively affected.
Currently, to protect an active-active site, you must install HCC on both sites and protect as normal. There is currently no overview of the protection. Active-active is the best model, because you can split your applications over two sites; when there is an outage, you only need to fail over half.
Allowing Extra Resources in Test Failover
Sometimes it is necessary to have more resources in the test failover so that a proper application test can occur. For example, these resources might consist of things like physical anti-spam appliances or load balancers. You can also include things like databases, which has the potential to cause problems, because you must make sure test data does not get into production. To perform this process reliably, use the following steps as a guideline.
A script executes in the disaster recovery test process (or use a manual process if necessary).
A separate logical partition (LPAR) is created.
A virtual network is added to the separate LPAR, and it is already connected to the test network.
A script exports and copies the appropriate data to the separate and new LPAR. It’s likely that you’ll need to have the application on the separate partition, too.
You might need to tweak DNS names or the configuration of the application in the test network to access this new server.
The test completes successfully.
After the test is done, and the cleanup occurs, another script runs, and it deletes the separate partition. That step keeps anything from getting into production accidentally.
You can use a similar process to get a domain controller into a test failover:
Power off the domain controller in the disaster recovery site. Make sure there is another domain controller still running.
After the domain controller is off, clone it.
Power on the original domain controller.
Put the cloned domain controller on the test network.
Power on the clone domain controller.
You should be able to use the domain controller in the test now, whether for authentication or DNS.
When the test is done, delete the cloned domain controller. Don’t skip this step, because you don’t want that domain database talking to the production domain.
It’s best to script these steps and execute the script from the recovery plan. However, to do that, you need a script or batch file that can tell whether it is executing in test or real failover—and in real failover, it does nothing.
It is useful to capture events from Cleondris by using syslog. Groups such as security or operations might benefit.
To do this, use the Setup page and the Events tab. Then use the Add Receiver button.
Specify which event to send. In this example, the best idea might be to send all of them for now. Select the boxes; some do not apply to Cleondris HCC and BCDR, but they will not be generated if not used.
You can see the BCDR events in the Events section at the bottom of the list.
The VM state is preserved during a failover. A VM that is powered on or off in production remains in the same state after a failover or during a test failover. However, be aware that HCC scans vCenter every 20 minutes. Therefore, you need to wait for that scan or use the refresh button in HCC to immediately refresh.
Add an Execute-Only Account
An execute-only account can be useful for a manager to trigger a failover without saving the changes. You create this account yourself.
First, create a role that has the following privileges:
When the role is done, create a user with that role; the resulting account is an execute-only account. This set of privileges lets the user look at and change things but not save the changes.
Idle Time Out
This parameter can be set to perform an automatic log out when there is no activity in the browser. Working on a different tab counts as activity.
Select the Setup option and then select the Advanced tab to see the Advanced Configuration window.
Click the Add Option button to add the option and value. In the screenshot above, 360 seconds must pass before a timeout if there is no activity in the browser.
The inventory rescan setting is used when a VM state is not preserved when it should be. For example, a VM should not be powered on in a failover if it is off in production. The value for the rescan interval can be set between 5 minutes and 1440 minutes; it is set to 20 minutes by default.
In the previous screenshot, the interval is set for 10 minutes.
Be aware that this setting changes the vCenter rescan time and also the Solidfire rescan time.
The following best practices can improve your experience with Cleondris and assist with support.
Always include a support bundle when you ask for support.
With certain edge cases, additional logging is very helpful for support. Enable the additional logging, and then perform the action that you are having trouble with again. You can then delete
log.levelbecause you do not want to routinely debug this level.
A busy vCenter Server Appliance (VCSA) can cause issues under some conditions. To minimize this problem, add more memory to the VCSA.
Issues can also be caused by the fact that one or two VMs might not be cleaned up in a test failover. You can clean these VMs up with the following steps:
Power off the VMs. This may take some time.
Remove the VMs from inventory.
Often, these two steps allow the datastore to disappear. You can then perform a Rescan Storage operation.