Identify and retry failed replication operations
After resolving the Cross-grid replication permanent failure alert, you should determine if any objects or delete markers failed to be replicated to the other grid. You can then reingest these objects or use the Grid Management API to retry replication.
The Cross-grid replication permanent failure alert indicates that tenant objects can't be replicated between the buckets on two grids for a reason that requires user intervention to resolve. This alert is typically caused by a change to either the source or the destination bucket. For details, see Troubleshoot grid federation errors.
Determine if any objects failed to be replicated
To determine if any objects or delete markers have not been replicated to the other grid, you can search the audit log for CGRR (Cross-Grid Replication Request) messages. This message is added to the log when StorageGRID fails to replicate an object, multipart object, or delete marker to the destination bucket.
You can use the audit-explain tool to translate the results into an easier-to-read format.
-
You have Root access permission.
-
You have the
Passwords.txt
file. -
You know the IP address of the primary Admin Node.
-
Log in to the primary Admin Node:
-
Enter the following command:
ssh admin@primary_Admin_Node_IP
-
Enter the password listed in the
Passwords.txt
file. -
Enter the following command to switch to root:
su -
-
Enter the password listed in the
Passwords.txt
file.When you are logged in as root, the prompt changes from
$
to#
.
-
-
Search the audit.log for CGRR messages, and use the audit-explain tool to format the results.
For example, this command greps for all CGRR messages in the past 30 minutes and uses the audit-explain tool.
# awk -vdate=$(date -d "30 minutes ago" '+%Y-%m-%dT%H:%M:%S') '$1$2 >= date { print }' audit.log | grep CGRR | audit-explain
The results of the command will look like this example, which has entries for six CGRR messages. In the example, all cross-grid replication requests returned a general error because the object could not be replicated. The first three errors are for "replicate object" operations, and the last three errors are for "replicate delete marker" operations.
CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate object" bucket:bucket123 object:"audit-0" version:QjRBNDIzODAtNjQ3My0xMUVELTg2QjEtODJBMjAwQkI3NEM4 error:general error CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate object" bucket:bucket123 object:"audit-3" version:QjRDOTRCOUMtNjQ3My0xMUVELTkzM0YtOTg1MTAwQkI3NEM4 error:general error CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate delete marker" bucket:bucket123 object:"audit-1" version:NUQ0OEYxMDAtNjQ3NC0xMUVELTg2NjMtOTY5NzAwQkI3NEM4 error:general error CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate delete marker" bucket:bucket123 object:"audit-5" version:NUQ1ODUwQkUtNjQ3NC0xMUVELTg1NTItRDkwNzAwQkI3NEM4 error:general error
Each entry contains the following information:
Field Description CGRR Cross-Grid Replication Request
The name of the request
tenant
The tenant's account ID
connection
The ID of the grid federation connection
operation
The type of replication operation that was being attempted:
-
replicate object
-
replicate delete marker
-
replicate multipart object
bucket
The bucket name
object
The object name
version
The version ID for the object
error
The type of error. If cross-grid replication failed, the error is "General error".
-
Retry failed replications
After generating a list of objects and delete markers that were not replicated to the destination bucket and resolving the underlying issues, you can retry replication in either of two ways:
-
Reingest each object into the source bucket.
-
Use the Grid Management private API, as described.
-
From the top of the Grid Manager, select the help icon and select API documentation.
-
Select Go to private API documentation.
The StorageGRID API endpoints that are marked "Private" are subject to change without notice. StorageGRID private endpoints also ignore the API version of the request. -
In the cross-grid-replication-advanced section, select the following endpoint:
POST /private/cross-grid-replication-retry-failed
-
Select Try it out.
-
In the body text box, replace the example entry for versionID with a version ID from the audit.log that corresponds to a failed cross-grid-replication request.
Be sure to retain the double quotes around the string.
-
Select Execute.
-
Confirm that the server response code is 204, indicating that the object or delete marker has been marked as pending for cross-grid replication to the other grid.
Pending means the cross-grid replication request has been added to the internal queue for processing.
Monitor replication retries
You should monitor the replication retry operations to make sure they complete.
It might take several hours or longer for an object or delete marker to be replicated to the other grid. |
You can monitor retry operations in either of two ways:
-
Use an S3 HeadObject or GetObject request. The response includes the StorageGRID-specific
x-ntap-sg-cgr-replication-status
response header, which will have one of the following values:Grid Replication status Source
-
COMPLETED: The replication was successful.
-
PENDING: The object hasn't been replicated yet.
-
FAILURE: The replication failed with a permanent failure. A user must resolve the error.
Destination
REPLICA: The object was replicated from the source grid.
-
-
Use the Grid Management private API, as described.
-
In the cross-grid-replication-advanced section of the private API documentation, select the following endpoint:
GET /private/cross-grid-replication-object-status/{id}
-
Select Try it out.
-
In the Parameter section, enter the version ID you used in the
cross-grid-replication-retry-failed
request. -
Select Execute.
-
Confirm that the server response code is 200.
-
Review the replication status, which will be one of the following:
-
PENDING: The object hasn't been replicated yet.
-
COMPLETED: The replication was successful.
-
FAILED: The replication failed with a permanent failure. A user must resolve the error.
-