Skip to main content

Troubleshooting

Contributors netapp-mwallis netapp-rlithman

Learn how to work around some common problems you might encounter.

Common issues and solutions

If you have one of these issues, you can use the steps in the Workaround column to try to resolve it.

Area Issue Cause Workaround

Deployment

Deployment fails because the volume already exists.

Workload Factory for GenAI needs to create a new volume during the deployment process, but a volume already exists using the name you have specified.

Specify a unique name to use for the new volume, and try deploying again.

Deployment

The deployment fails because Workload Factory for GenAI is unable to mount the volume.

One or more of the inbound ports required for FSx for NetApp ONTAP are closed or filtered.

Open the ports listed in Security group rules for FSx for ONTAP.

Maintenance

The backend Docker instance used by Workload Factory for GenAI failed to start.

The volume was deleted and the EC2 instance was restarted.

Use the following recovery steps:

Steps
  1. Create a new volume on FSx for NetApp ONTAP. For example, the volume name can be netapp_ai and the volume path can be /netapp_ai.

  2. SSH to the Amazon EC2 instance.

  3. List the volumes:

    docker volume list
  4. Remove the old volume:

    docker volume rm ec2-user_persistent_folder
  5. Open the docker-compose.yml file using a text editor.

  6. In the volumes section, change the device path to the new volume path. For example:

    volumes:
      persistent_folder:
        driver_opts:
          type: 'nfs'
          o: "addr=svm-0df66b96a890d8a72.\
          fs-0d673008aaca12bc3.\
          fsx.us-east-1.amazonaws.com,nolock,soft,rw"
          device: ':/netapp_ai' # Path to new volume

Maintenance

The backend Docker instance used by Workload Factory for GenAI failed to start.

The root volume was deleted.

Create a volume with a name and path, and then restart the backend Docker instance from Amazon EC2.