GenAI troubleshooting

09/09/2025 Contributors

PDFs

Learn how to work around some common problems you might encounter.

Common issues and solutions

If you have one of these issues, you can use the steps in the Workaround column to try to resolve it.

Area Issue Cause Workaround

Deployment

Deployment fails because the volume already exists.

NetApp Workload Factory for GenAI needs to create a new volume during the deployment process, but a volume already exists using the name you have specified.

Specify a unique name to use for the new volume, and try deploying again.

Deployment

The deployment fails because NetApp Workload Factory for GenAI is unable to mount the volume.

One or more of the inbound ports required for FSx for NetApp ONTAP are closed or filtered.

Open the following inbound ports:

Protocol

Port

Purpose

All ICMP

All

Pinging the instance

HTTPS

443

Access from the Connector to fsxadmin management LIF to send API calls to FSx

SSH

SSH access to the IP address of the cluster management LIF or a node management LIF

TCP

111

Remote procedure call for NFS

TCP

139

NetBIOS service session for CIFS

TCP

161-162

Simple network management protocol

TCP

445

Microsoft SMB/CIFS over TCP with NetBIOS framing

TCP

635

NFS mount

TCP

749

Kerberos

TCP

2049

NFS server daemon

TCP

3260

iSCSI access through the iSCSI data LIF

TCP

4045

NFS lock daemon

TCP

4046

Network status monitor for NFS

TCP

10000

Backup using NDMP

TCP

11104

Management of intercluster communication sessions for SnapMirror

TCP

11105

SnapMirror data transfer using intercluster LIFs

UDP

111

Remote procedure call for NFS

UDP

161-162

Simple network management protocol

UDP

635

NFS mount

UDP

2049

NFS server daemon

UDP

4045

NFS lock daemon

UDP

4046

Network status monitor for NFS

UDP

4049

NFS rquotad protocol

Maintenance

The AI engine fails to start, and you see the error "AI engine instance error" on the Knowledge bases page.

The AI engine instance was corrupted or doesn't exist.

Select the Rebuild button. NetApp Workload Factory for GenAI rebuilds the infrastructure and displays the rebuild progress. When complete, your knowledge bases are reconnected to the rebuilt infrastructure and the list of knowledge bases is displayed.

Maintenance

The AI engine fails to start, and you see the error "The GenAI engine instance is stopped" on the Knowledge bases page.

The AI engine instance is not running.

Use the AWS Management Console or the AWS CLI to start the AI engine instance.

Maintenance

The AI engine fails to start, and you see the error "The GenAI engine server is not responding" on the Knowledge bases page.

The AI engine instance is not responding.

Use the following recovery steps:

Steps

Modify the GenAI engine instance security group to enable SSH access to the GenAI engine instance.
Log in to the instance using SSH.
Run the following command:
```
docker-compose up
```

Maintenance

The backend Docker instance used by NetApp Workload Factory for GenAI failed to start.

The volume was deleted and the EC2 instance was restarted.

Use the following recovery steps:

Steps

Create a new volume on FSx for NetApp ONTAP. For example, the volume name can be netapp_ai and the volume path can be /netapp_ai.
SSH to the Amazon EC2 instance.
List the volumes:
```
docker volume list
```

Remove the old volume:

docker volume rm ec2-user_persistent_folder

Open the docker-compose.yml file using a text editor.

In the volumes section, change the device path to the new volume path. For example:

volumes:
  persistent_folder:
    driver_opts:
      type: 'nfs'
      o: "addr=svm-0df66b96a890d8a72.\
      fs-0d673008aaca12bc3.\
      fsx.us-east-1.amazonaws.com,nolock,soft,rw"
      device: ':/netapp_ai' # Path to new volume

Maintenance

The backend Docker instance used by NetApp Workload Factory for GenAI failed to start.

The root volume was deleted.

Create a volume with a name and path, and then restart the backend Docker instance from Amazon EC2.

Maintenance

The backend Docker instance used by NetApp Workload Factory for GenAI failed to start.

The root volume was deleted.

Create a volume with a name and path, and then restart the backend Docker instance from Amazon EC2.

GenAI troubleshooting

Creating your file...

Common issues and solutions