Data protection planning
The right Oracle database data protection architecture depends on the business requirements surrounding data retention, recoverability, and tolerance for disruption during various events.
For example, consider the number of applications, databases, and important datasets in scope. Building a backup strategy for a single dataset that ensures compliance with typical SLAs is fairly straightforward because there are not many objects to manage. As the number of datasets increases, monitoring becomes more complicated and administrators might be forced to spend an increasing amount of time addressing backup failures. As an environment reaches cloud and service provider scales, a wholly different approach is needed.
Dataset size also affects strategy. For example, many options exist for backup and recovery with a 100GB database because the data set is so small. Simply copying the data from backup media with traditional tools typically delivers a sufficient RTO for recovery. A 100TB database normally needs a completely different strategy unless the RTO allows for a multiday outage, in which case a traditional copy-based backup and recovery procedure might be acceptable.
Finally, there are factors outside the backup and recovery process itself. For example, are there databases supporting critical production activities, making recovery a rare event that is only performed by skilled DBAs? Alternatively, are the databases part of a large development environment in which recovery is a frequent occurrence and managed by a generalist IT team?
Is a snapshot a backup?
One commonly raised objection to the use of snapshots as a data protection strategy is the fact that the "real" data and the snapshot data are located on the same drives. Loss of those drives would result in the loss of both the primary data and the backup.
This is a valid concern. Local snapshots are used for day-to-day backup and recovery needs, and in that respect the snapshot is a backup. Close to 99% of all recovery scenarios in NetApp environments rely on snapshots to meet even the most aggressive RTO requirements.
Local snapshots should, however, never be the only backup strategy, which is why NetApp offers technology such as SnapMirror replication to quickly and efficiently replicate snapshots to an independent set of drives. In a properly architected solution with snapshots plus snapshot replication, the use of tape can be minimized to perhaps a quarterly archive or eliminated entirely.
Crash-consistent backups
A crash-consistent backup involves capturing the state of a dataset (or multiple datasets) at a single atomic point in time. As a database example, this includes all database components, such as datafiles, log files, and other files directly associated with the database. This works with almost all relational database products, including Oracle RDBMS, Microsoft SQL Server, SAP HANA, PostgreSQL, MySQL, and MariaDB. Protecting a VMware configuration with a consistency group backup would be similar - capturing all of the datastores and potentially the ESX boot LUNs in a single atomic point in time.
Creation of a such a backup is essentially simulating a crash, which is why such backups are called crash-consistent backups. There are sometimes concerns with support for recovery scenarios, but it is important to understand that no recovery procedure is usually required. When the application starts up after restoring a consistency group backup, it performs the usual log recovery processes, file system journal replays, and other tasks to replay any I/O that was in-flight at the point of the backup. The application then starts as usual.
Essentially, any application that can withstand a power failure or server crash without data corruption can be protected in this way. The fact that this works can also be demonstrated by the huge number of applications protected with synchronous and asynchronous mirroring products from many different vendors. If a disaster suddenly strikes the primary site, then the replica site contains a consistent image of the original environment at the moment the disaster occurred. Once again, no special recovery procedure is required.
The RPO for this approach is usually limited to the point of the backup. As a general rule, the minimum RPO for crash-consistent backup is one hour. For example, 48 hourly snapshots plus another 30 days of nightly snapshots is reasonable and would not require the retention of an excessive number of snapshots. An RPO lower than one hour becomes more difficult to achieve and it is not recommended without first consulting NetApp Professional Services to understand the environment, scale, and data protection requirements.
The RTO can usually be measured in seconds. A an application is shut down, the NFS volumes, LUNs, or NVMe namespaces are restored, and the application is restarted.
The simplest approach is to place all the files or LUNs in a single AFF volume or ASA LUN/namespace. Where a dataset must span multiple storage resources, a consistency group is required. This can be configured using System Manager or RESTful API calls, plus SnapCenter is capable of creating a simple consistency group snapshot.
Replication and disaster recovery architecture
This section addresses remote data protection, for which data is replicated to a remote site for the purposes of secure offsite storage and disaster recovery. Note that these tables do not address synchronous mirroring data protection. For this requirement, see the NetApp MetroCluster documentation including MetroCluster and SnapMirror active sync
The DR RPO is limited by the available network bandwidth and the total size of the data being protected. After the initial baseline transfer is created, the updates are only based on the changed data, which typically is a low percentage of the total data footprint, although exceptions do exist.
For example, a 10TB database with a 10% weekly change rate averages approximately 6GB per hour of total changes. With 10Gb of connectivity, this database requires approximately six minutes to transfer. The change rate varies with fluctuation in the database change rate, but overall a 15-minute update interval and thus a 15-minute RPO should be achievable. If there are 100 such databases, then 600 minutes is required to transfer the data. Therefore, an RPO of one-hour is not possible. Likewise, a replica of a single database 100TB in size with a 10% weekly change rate cannot be updated reliably in one hour.
Additional factors can affect replication, such as replication overhead and limitations on the number of concurrent replication operations. However, overall planning for a single-resource replication strategy can be based on available bandwidth, and a replication RPO of one hour is generally achievable. An RPO lower than one hour becomes more difficult to achieve and should only be performed after consulting NetApp Professional Services. In some cases, 15 minutes is feasible with very good site-to-site network connectivity. However, overall, when an RPO below one hour is required, the multi-resource log replay architecture yields better results.
The RTO with consistency group replication in a disaster recovery scenario is excellent, typically measured in seconds from a storage point of view. The most straightforward approach is to simply break the mirror, and the database is ready to be started. Database startup time is typically about 10 seconds, but very large databases with a lot of logged transactions could take a few minutes.
The more important factor in determining RTO is not the storage system but rather the application and the host operating system upon which it runs. For example, the replicated data can be made available in a second or two, but this only represents the data. There must also be a correctly configured operating system with application binaries that is available to use the data.
In some cases, customers have prepared disaster recovery instances ahead of time with the storage prediscovered on operating systems. In these cases, activating the disaster recovery scenario can require nothing more than breaking a mirror and starting the application. In other cases, the OS and associated applications might be mirrored alongside the database as an ESX virtual machine disk (VMDK). In these cases, the RPO is determined by how much a customer has invested in automation to quickly boot the VMDK so that the applications can be started.
The retention time is controlled in part by the snapshot limit. For example, AFF volumes and ASA LUNs in ONTAP have a limit of 1024 snapshots. In some cases, customers have multiplexed replication to increase the limit. For example, if 2000 days of backups are required, a source can be replicated to two different destinations with updates occurring on alternate days. This requires an increase in the initial space required, but it still represents a much more efficient approach than a traditional backup system, which involves multiple full backups.
Single-volume consistency group (AFF)
The simplest approach is to place all the files or LUNs in a single volume. A volume is a consistency group by itself, which means SnapMirror and SnapVault updates can be scheduled directly on the storage system. No external software is required.
Single-LUN/namespace consistency group (ASA)
While it may seem strange to place an entire database on a single LUN/namespace, it is becoming increasingly common. The reason volume managers were originally created was to overcome the performance and capacity limits of physical drives. With modern arrays, a single block device presented to a host is already abstracted across multiple drives. The size can be adjusted dynamically, and performance is far greated than an individual drive.
The result is workloads ranging from an ESX datastore to an Oracle database can often be hosted on just a single block device. This not only vastly simplifies data management, but has helpful side effects, such as faster host reboot times due to the reduced number of devices that need to be discovered.
Multi-volume consistency group
When a database must span AFF volumes or ASA LUNs/namespaces, a consistency group snapshot is required. As mentioned above, this can be configured using System Manager or RESTful API calls, plus SnapCenter is capable of creating a simple consistency group snapshot.
There is also one additional consideration on the use of AFF multi-volume or ASA multi-LUN consistent snapshots for disaster recovery purposes. When performing an update of multiple snapmirror relationships, it is possible that a disaster could occur while a transfer is still in progress. The result would be a set of volumes, LUNs or namespaces that are not consistent with one another. If this happened, some of the volumes must be restored to an earlier snapshot state to deliver a database image that is crash-consistent and ready for use.
Disaster recovery: activation
NFS
The process of activating the disaster recovery copy depends on the type of storage. With NFS, the file systems can be premounted on the disaster recovery server. They are in a read-only state and become read-write when the mirror is broken. This delivers an extremely low RPO, and the overall disaster recovery process is more reliable because there are fewer parts to manage.
SAN
Activating SAN configurations in the event of disaster recovery become more complicated. The simplest option is generally to temporarily break the mirrors and mount the SAN resources, including steps such as discovering LVM configuration (including application-specific features such as Oracle Automatic Storage Management [ASM]), and adding entries to /etc/fstab.
The result is that the LUN device paths, volume groups names, and other device paths are made known to the target server. Those resources can then be shut down, and afterward the mirrors can be restored. The result is a server that is in a state that can rapidly bring the application online. The steps to activate volumes groups, mount file systems, or and start databases and applications are easily automated.
Care must be taken to make sure that the disaster recovery environment is up to date. For example, new LUNs are likely to be added to the source server, which means the new LUNs must be prediscovered on the destination to make sure that the disaster recovery plan works as expected.