Troubleshooting platform services

If a tenant has issues configuring or using platform services, you might be able to use the Grid Manager to identify the cause of their issue.

Issues with endpoint creation

Before a tenant can use platform services, they must create one or more endpoints using the Tenant Manager. Each endpoint represents an external destination for one platform service, such as a StorageGRID Webscale S3 bucket, an Amazon Web Services bucket, a Simple Notification Service topic, or an Elasticsearch cluster hosted locally or on AWS. Each endpoint includes both the location of the external resource (expressed as a URI) and the credentials needed to access that resource.

When a tenant creates an endpoint, the StorageGRID Webscale system validates that the endpoint exists and that it can be reached using the credentials that were specified. If endpoint validation fails, an error message explains why endpoint validation failed. The tenant user should resolve the issue, then try creating the endpoint again.
Note: Endpoint creation will fail if platform services are not enabled for the tenant account.

Issues with endpoint validation

Tenant users can test an existing endpoint by selecting the endpoint, and clicking Test. A success message is displayed if the endpoint can be reached using the specified credentials.

If endpoint validation fails, an error message is displayed. To correct the error, the tenant user should edit the endpoint and click Save to validate the changes.

Client operations fail

Some platform services issues might cause client operations on the S3 bucket to fail. For example, S3 client operations will fail if the internal Replicated State Machine (RSM) service stops, or if there are too many platform services messages queued for delivery.

To check the status of services:
  1. Select Support > Grid Topology.
  2. Select site > Storage Node > SSM > Services.

Recoverable and unrecoverable endpoint errors

After endpoints have been created, platform service request errors can occur for various reasons. Some errors are recoverable with user intervention. For example, recoverable errors might occur for the following reasons:
  • The user's credentials have been deleted or have expired.
  • The destination bucket does not exist.
  • The notification cannot be delivered.

If StorageGRID Webscale encounters a recoverable error, the platform request will be retried until it succeeds.

Other errors are unrecoverable. For example, an unrecoverable error occurs if the endpoint is deleted.

If StorageGRID Webscale encounters an unrecoverable endpoint error, the Total Events (SMTT) alarm is triggered in the Grid Manager. To view the Total Events alarm:
  1. Select Nodes.
  2. Select site > grid node > Events.
  3. View Last Event at the top of the table.

    Event messages are also listed in /var/local/log/bycasterr.log.

  4. Follow the guidance provided in the SMTT alarm contents to correct the issue.
  5. Click Reset event counts.
  6. Notify the tenant of the objects whose platform services messages have not been delivered.
  7. Instruct the tenant to trigger the failed replication or notification by updating the object's metadata or tags.

    The tenant can resubmit the existing values to avoid making unwanted changes.

Platform services messages cannot be delivered

If the destination encounters an issue that prevents it from accepting platform services messages, the client operation on the bucket succeeds, but the platform services message is not delivered. For example, this error might happen if credentials are updated on the destination such that StorageGRID Webscale can no longer authenticate to the destination service.

If platform services messages cannot be delivered because of an unrecoverable error, the Total Events (SMTT) alarm is triggered in the Grid Manager.

Slower performance for platform service requests

StorageGRID Webscale might throttle incoming S3 requests for a bucket if the rate at which the requests are being sent exceeds the rate at which the destination endpoint can receive the requests. Throttling only occurs when there is a backlog of requests waiting to be sent to the destination endpoint.

The only visible effect is that the incoming S3 requests will take longer to execute. If you start to detect significantly slower performance, you should reduce the ingest rate or use an endpoint with higher capacity. If the backlog of requests continues to grow, client S3 operations (such as PUT requests) will eventually fail.

CloudMirror requests are more likely to be affected by the performance of the destination endpoint because these requests typically involve more data transfer than search integration or event notification requests.