Skip to main content

Identify and retry failed replication operations

Contributors netapp-madkat netapp-perveilerk

After resolving the Cross-grid replication permanent failure alert, you should determine if any objects or delete markers failed to be replicated to the other grid. You can then reingest these objects or use the Grid Management API to retry replication.

The Cross-grid replication permanent failure alert indicates that tenant objects can't be replicated between the buckets on two grids for a reason that requires user intervention to resolve. This alert is typically caused by a change to either the source or the destination bucket. For details, see Troubleshoot grid federation errors.

Determine if any objects failed to be replicated

To determine if any objects or delete markers have not been replicated to the other grid, you can search the audit log for CGRR (Cross-Grid Replication Request) messages. This message is added to the log when StorageGRID fails to replicate an object, multipart object, or delete marker to the destination bucket.

You can use the audit-explain tool to translate the results into an easier-to-read format.

Before you begin
  • You have Root access permission.

  • You have the Passwords.txt file.

  • You know the IP address of the primary Admin Node.

Steps
  1. Log in to the primary Admin Node:

    1. Enter the following command: ssh admin@primary_Admin_Node_IP

    2. Enter the password listed in the Passwords.txt file.

    3. Enter the following command to switch to root: su -

    4. Enter the password listed in the Passwords.txt file.

      When you are logged in as root, the prompt changes from $ to #.

  2. Search the audit.log for CGRR messages, and use the audit-explain tool to format the results.

    For example, this command greps for all CGRR messages in the past 30 minutes and uses the audit-explain tool.

    # awk -vdate=$(date -d "30 minutes ago" '+%Y-%m-%dT%H:%M:%S') '$1$2 >= date { print }' audit.log | grep CGRR | audit-explain

The results of the command will look like this example, which has entries for six CGRR messages. In the example, all cross-grid replication requests returned a general error because the object could not be replicated. The first three errors are for "replicate object" operations, and the last three errors are for "replicate delete marker" operations.

CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate object" bucket:bucket123 object:"audit-0" version:QjRBNDIzODAtNjQ3My0xMUVELTg2QjEtODJBMjAwQkI3NEM4 error:general error
CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate object" bucket:bucket123 object:"audit-3" version:QjRDOTRCOUMtNjQ3My0xMUVELTkzM0YtOTg1MTAwQkI3NEM4 error:general error
CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate delete marker" bucket:bucket123 object:"audit-1" version:NUQ0OEYxMDAtNjQ3NC0xMUVELTg2NjMtOTY5NzAwQkI3NEM4 error:general error
CGRR Cross-Grid Replication Request tenant:50736445269627437748 connection:447896B6-6F9C-4FB2-95EA-AEBF93A774E9 operation:"replicate delete marker" bucket:bucket123 object:"audit-5" version:NUQ1ODUwQkUtNjQ3NC0xMUVELTg1NTItRDkwNzAwQkI3NEM4 error:general error

Each entry contains the following information:

Field Description

CGRR Cross-Grid Replication Request

The name of the request

tenant

The tenant's account ID

connection

The ID of the grid federation connection

operation

The type of replication operation that was being attempted:

  • replicate object

  • replicate delete marker

  • replicate multipart object

bucket

The bucket name

object

The object name

version

The version ID for the object

error

The type of error. If cross-grid replication failed, the error is "General error".

Retry failed replications

After generating a list of objects and delete markers that were not replicated to the destination bucket and resolving the underlying issues, you can retry replication in either of two ways:

  • Reingest each object into the source bucket.

  • Use the Grid Management private API, as described.

Steps
  1. From the top of the Grid Manager, select the help icon and select API documentation.

  2. Select Go to private API documentation.

    Note The StorageGRID API endpoints that are marked “Private” are subject to change without notice. StorageGRID private endpoints also ignore the API version of the request.
  3. In the cross-grid-replication-advanced section, select the following endpoint:

    POST /private/cross-grid-replication-retry-failed

  4. Select Try it out.

  5. In the body text box, replace the example entry for versionID with a version ID from the audit.log that corresponds to a failed cross-grid-replication request.

    Be sure to retain the double quotes around the string.

  6. Select Execute.

  7. Confirm that the server response code is 204, indicating that the object or delete marker has been marked as pending for cross-grid replication to the other grid.

    Note Pending means the cross-grid replication request has been added to the internal queue for processing.

Monitor replication retries

You should monitor the replication retry operations to make sure they complete.

Tip It might take several hours or longer for an object or delete marker to be replicated to the other grid.

You can monitor retry operations in either of two ways:

  • Use an S3 HEAD Object or GET Object request. The response includes the StorageGRID-specific x-ntap-sg-cgr-replication-status response header, which will have one of the following values:

    Grid Replication status

    Source

    • SUCCESS: The replication was successful.

    • PENDING: The object hasn't been replicated yet.

    • FAILURE: The replication failed with a permanent failure. A user must resolve the error.

    Destination

    REPLICA: The object was replicated from the source grid.

  • Use the Grid Management private API, as described.

Steps
  1. In the cross-grid-replication-advanced section of the private API documentation, select the following endpoint:

    GET /private/cross-grid-replication-object-status/{id}

  2. Select Try it out.

  3. In the Parameter section, enter the version ID you used in the cross-grid-replication-retry-failed request.

  4. Select Execute.

  5. Confirm that the server response code is 200.

  6. Review the replication status, which will be one of the following:

    • PENDING: The object hasn't been replicated yet.

    • COMPLETED: The replication was successful.

    • FAILED: The replication failed with a permanent failure. A user must resolve the error.