Recovering failed storage volumes and rebuilding the Cassandra database
You must run a script that reformats and remounts storage on failed storage volumes, and rebuilds the Cassandra database on the Storage Node if the system determines that it is necessary.
-
You must have the
Passwords.txt
file. -
The system drives on the server must be intact.
-
The cause of the failure must have been identified and, if necessary, replacement storage hardware must already have been acquired.
-
The total size of the replacement storage must be the same as the original.
-
You have checked that a Storage Node decommissioning is not in progress, or you have paused the node decommission procedure. (In the Grid Manager, select Maintenance > Maintenance Tasks > Decommission.)
-
You have checked that an expansion is not in progress. (In the Grid Manager, select Maintenance > Maintenance Tasks > Expansion.)
-
You have reviewed the warnings about storage volume recovery.
-
As needed, replace failed physical or virtual storage associated with the failed storage volumes that you identified and unmounted earlier.
After you replace the storage, make sure you rescan or reboot to make sure that it is recognized by the operating system, but do not remount the volumes. The storage is remounted and added to
/etc/fstab
in a later step. -
Log in to the failed Storage Node:
-
Enter the following command:
ssh admin@grid_node_IP
-
Enter the password listed in the
Passwords.txt
file. -
Enter the following command to switch to root:
su -
-
Enter the password listed in the
Passwords.txt
file.
-
When you are logged in as root, the prompt changes from
$
to#
.-
Use a text editor (vi or vim) to delete failed volumes from the
/etc/fstab
file and then save the file.Commenting out a failed volume in the /etc/fstab
file is insufficient. The volume must be deleted fromfstab
as the recovery process verifies that all lines in thefstab
file match the mounted file systems. -
Reformat any failed storage volumes and rebuild the Cassandra database if it is necessary. Enter:
reformat_storage_block_devices.rb
-
If storage services are running, you will be prompted to stop them. Enter: y
-
You will be prompted to rebuild the Cassandra database if it is necessary.
-
Review the warnings. If none of them apply, rebuild the Cassandra database. Enter: y
-
If more than one Storage Node is offline or if another Storage Node has been rebuilt in the last 15 days. Enter: n
The script will exit without rebuilding Cassandra. Contact technical support.
-
-
For each rangedb drive on the Storage Node, when you are asked:
Reformat the rangedb drive <name> (device <major number>:<minor number>)? [y/n]?
, enter one of the following responses:-
y to reformat a drive that had errors. This reformats the storage volume and adds the reformatted storage volume to the
/etc/fstab
file. -
n if the drive contains no errors, and you do not want to reformat it.
Selecting n exits the script. Either mount the drive (if you think the data on the drive should be retained and the drive was unmounted in error) or remove the drive. Then, run the reformat_storage_block_devices.rb
command again.Some StorageGRID recovery procedures use Reaper to handle Cassandra repairs. Repairs occur automatically as soon as the related or required services have started. You might notice script output that mentions “reaper” or “Cassandra repair.” If you see an error message indicating the repair has failed, run the command indicated in the error message. In the following example output, the drive
/dev/sdf
must be reformatted, and Cassandra did not need to be rebuilt:root@DC1-S1:~ # reformat_storage_block_devices.rb Storage services must be stopped before running this script. Stop storage services [y/N]? **y** Shutting down storage services. Storage services stopped. Formatting devices that are not in use... Skipping in use device /dev/sdc Skipping in use device /dev/sdd Skipping in use device /dev/sde Reformat the rangedb drive /dev/sdf (device 8:64)? [Y/n]? **y** Successfully formatted /dev/sdf with UUID c817f87f-f989-4a21-8f03-b6f42180063f Skipping in use device /dev/sdg All devices processed Running: /usr/local/ldr/setup_rangedb.sh 12075630 Cassandra does not need rebuilding. Starting services. Reformatting done. Now do manual steps to restore copies of data.
-
-
-